Download presentation
Presentation is loading. Please wait.
Published byEgbert Richardson Modified over 8 years ago
1
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
2
Introduction ●New type of Search Engine ●Originally dubbed BackRub ●Released as Google in 1998 ●Changed the way people use the Internet ●Designed to handle the expansion of the WWW Sergey Brin & Lawrence Page
3
Growth of the Internet
4
Goals of Google Accurate Searches ●Search Engines of the time unable to find themselves ●Number of documents matching queries was rapidly increasing ●Humans only interested in the first 10 or so results ●Need some way to recognise better matches Academic Usage ●Search Engine development was secretive ●Search information is commercially valuable ●Enable large-scale web data processing
5
Predicting Market fluctuations via Google search information
6
Features of Google PageRank ●Uses citation (link) graph of the web ●Can estimate relevance of search results ●PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) ●Modeled on human behaviour - Random Surfer
7
Features of Google Anchor Text ●Associate with both current page and target page ●Allows access to pages that have not been crawled ●Creates indices for images and videos Other Features ●Location based searching ●Font properties ●HTML repository
8
System Anatomy ●The URL Server sends lists of URLs to Crawlers ●Crawlers download pages to the storeserver ●These pages are assigned docIDs, compressed and sent to the repository ●The indexer retrieves files from the repository, uncompresses and then parses them ●Additional URLs found from parsing are also given docIDs ●Each document is converted into a set of word occurrences called hits ●Hits record the word, position in the document and formatting, and are stored in the “barrels” ●Anchor Text related information is also created by the Indexer ●The URL Resolver creates a links database out of the anchors which are used to calculate PageRanks. ●The Sorter resorts the barrels by wordID instead of docID ●All the discovered words are then combined with the Lexicon and used by the Searcher to respond to queries
9
Major Data Structures BigFiles ●Virtual files distributed across multiple systems ●Allowed Google to workaround limitations of 32-bit OS ●Later replaced by Google File System ●GFS replaced by GFS2 “Colossus” in 2010
10
Major Data Structures Repository ●Contains the full HTML of every crawled web page ●Sacrifices compression ratio in favour of speed ●Entire system can be rebuilt from the repository
11
Major Data Structures Document Index ●Contains information about each document in the repository ●Includes URL, and title if crawled ●Designed to only need one disk seek ●Also contains a file that is used to convert URLs into docIDs ●URLresolver uses batch processing to reduce disk seeks
12
Major Data Structures Hit Lists ●Occurrences of a word in a document ●Includes position and formatting ●Two types of hits: Fancy and Plain ●Fancy hits are words in URLs, titles, anchor text or meta tags ●Plain hits include everything else
13
Major Data Structures Forward and Inverted Index ●64 barrels each with a range of wordIDs ●Matching docs placed in barrels ●Barrels are sorted into two sets ●One contains anchor and title hits ●The other contains all hits
14
Crawling the Web ●Fragile process, prone to errors and likely to crash ●Originally written in Python, but changed to C++ in 2000 ●Crawlers restricted by server response times ●Asynchronous IO helps negate this ●Crawlers garner interest of website owners ○“How did you like my website?”, “This page is copyrighted and should not be indexed.” ●Crawler can only be tested online ●Required a lot of work monitoring emails and logs
15
Searching ●Focused on quality over efficiency ●Original search had a limit of 40,000 ●Hits, PageRank, font parameters and all other information is combined to create the ranking of returned pages ●Trusted users were used to provide feedback
16
Results ●Favourable compared to existing search engines ●Queries return sensible results ●Can return pages that have not been crawled ●Proximity weighting helps multi-word queries
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.