Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright
Background A script/bot that searches the web in a methodical, automated manner (wikipedia, ”web crawler”) With these results we will index and analyze the contents to create a useable search engine We have limited the scope of the crawl to the domain due to space and information gathering constraints
Task Breakdown Bryan Chapman Implementing the Crawler Writing several scripts to analyze the results Ryan Caplet Search functionality Testing Morris Wright UI Development Database Management Web Server Account Manager Keyword extraction will be a group effo rt
Functionality Overview The crawler creates a ”mirror” of our intended scope of websites on local hard drive Using a script, the title is then extracted from the relavent files and placed into a DB table Another script then visits each url and extracts keywords to populate the second DB table A search is then sent from the UI, where results are requested from queries of the databases
Product Functionality Our crawler is nothing more than a recursive call to the built-in linux command wget, starting with a base url of and limited to this domain Once a ”mirror” is created, a script is then run recursively on our base directory to extract the tag's contents from the files for indexing This process involves several built in libraries and the Perl scripting language
Product Functionality Once this is accomplished our first database is populated with indexing information and has a layout as seen below. ID Site Index Table URL TITLE Used as a primary key Stores site's url address Stores extracted title
Product Functionality We then move our scripting language to php, where we loop through all the url listings in our indexing database to create keywords By first stripping unwanted html syntax and punctuation characters, we can use PHP's built- in function array_count_values to create a list of keywords and frequency This process is very detailed and we expect most of our time to be spent here
Product Functionality Once this list is created for a given website, we then populate our keyword database by either creating a new table for the keyword, or simply adding a new entry into an existing table ID 'Keyword' Table URL TITLE Used as a primary key Stores site's url address Stores keyword frequency
Product Functionality - Example Consider the following results Title: For all your Technology Needs technology 4 information 10 Title: For all your Sports Information football 10 information 12
Product Functionality - Example 0http:// all your Technology Needs 1http:// all your Sports Information Site Indexing Database 0http:// Technology 0http:// Information 1http:// Football 0http://
Product Functionality Once the databases have been populated, the search engine is now ready to do its work A query is entered into the search field, where an attempt is made to locate a corresponding table entry for each seperate word Each url match is then given a ranking based upon the accumulated totals for its frequency across all the keywords
Product Functionality - Example The search results are then displayed by listing the url, title string, and any keywords present Results from past example... Football Information 1) For all your Sports Information – keywords: football -10, information – 12 2) For all you Technology Needs - keywords: information - 10
Enviroments Developing Enviroment Primarily Linux, windows used where neccesary Coding done with PHP/html, Perl, and MYSQL User Enviroment Product will function in any environment, assuming a graphical web browser is installed
Product Constraints Due to space constraints we limited our crawling to a single pass resulting to roughly 2.5 GB Upon actual implementation dedicated servers would be crawling/analyzing 24/7 to keep indexes up to date The last ”official” estimate was that Google is maintaining ~450,000 servers
Software Dependencies Within Perl, the following built in libraries are required File::Compare, HTML::TokeParser, LWP::Simple, File::Basename, DBI, DBD::mysq; For our PHP scripting, we will make use of the CURL library
Action Plan Configure web server and mySQL server Begin to ”mirror” until established capacity is met Write script to extract title tags Write script to extract keywords Code search functions and ranking system Design interface and link with existing code
Timeline August 25 – September 24 Configure servers and populate index database with title strings and matching url September 25 – October 15 Extract keywords and populate index database October 16 – November 12 Write search functions and integrate with GUI November 13 – December 3 Testing period
Security Concerns Server security – Hosted by ECS so security concerns are out of our control Prevent Injections – Ensure input validation, and use HackBar for security auditing
Test Plans Test plans for this project will be… Keeping good consistency of rendering across different Operating systems, Browsers, and Browser versions Check to make sure that search queries correspond to expected results based on what is stored in the database
Questions? By: Ryan Caplet, Bryan Chapman, Morris Wright