Download presentation
Presentation is loading. Please wait.
1
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
2
The Web Index sizes –1994: World Wide Web Worm (McBryan) indexes 110,000 pages/documents –1997: search engines claim from 2-100 million pages indexed –2005: Google claims 8 billion –later claim of 25 billion was removed Queries –1994: WWWW had 1500 queries per day –1997: AltaVista had 20 million queries per day –2002: Google had 250 million queries a day –2004: 2.5 billion queries per day Clearly Web search engines must scale to increased corpus size and increased use
3
Google Design Goals Scale up to work with much larger collections –In 1997, older search engines were breaking down due to manipulations by advertisers, etc. Exhibit high precision among top-ranked documents –Many documents are kind-of relevant –For the Web, “relevant” should be just the very best documents Provide data sets for academic research on search engines
4
Changes from Earlier Engines PageRank –Rank importance of pages –Use citation patterns like for academic research –Normalize for number of links on page –Variation on probability that a random surfer would view a page Anchor Text –Store text of anchors with the pages they link to –Incoming link anchors can be better index than page content –Allows uncrawled and nontextual content to be indexed –Not a new idea: used by WWWW in 1994
5
Related Work Information retrieval –Traditional IR metrics (TREC 96) called 20GB a very large corpus –Term vectors assume all documents have some value short documents (few words) that match query term get ranked highest “bill clinton” returns page that only says “bill clinton sucks” Different from well-controlled collections –Content has abnormal vocabulary (e.g. product codes) –Content generated by systems (e.g. complex ids, binhexed data) –Content includes bugs/flakiness/disinformation
6
System Anatomy
7
Google Anatomy URL Server –Manages lists of URLs to be retrieved Crawlers –Retrieve content for URLs provided Store Server –Compresses and stores content in repository
8
Google Anatomy Indexer –Reads & uncompresses content in repository –Parses documents for word occurrences (“hits”) and links –Records word, position in document, relative font size, capitalization for each hit –Hits divided into barrels –Records anchor text and from and to information for each link in Anchors file –Generates a lexicon (vocabulary list)
9
Google Anatomy URL Resolver –Converts relative URLs into absolute URLs –Generates document ID for each absolute URL –Puts anchor text into forward index for page that link points to –Generates a database of links that are pairs of document IDs
10
Google Anatomy Sorter –Resorts (in place) barrels by word ID to generate inverted index –Produces list of word IDs and offsets into inverted index
11
Google Anatomy PageRank –Generates page ranking based on links DumpLexicon –Takes lexicon and inverted index and generates lexicon used by Searcher
12
Google Data Structures BigFiles –Used to create a virtual file system spanning multiple file systems –Addressable by 64 bit integers Repository –Page contents compressed using zlib to balance speed and compression factor –Stores document ID, length, and URL prefixed to document –Documents stored sequentially
13
Google Data Structures Document Index –Ordered by document ID –Includes document status, pointer into repository, document checksum, other statistics Lexicon –14 million words (rare words not included here) concatenated together with separating nulls –Hash table of pointers
14
Google Data Structures Hit Lists –Each hit encoded in two bytes –Plain and fancy hits –Plain hits include 1 capitalization bit 3 font size bits 12 position bits –Fancy hits (identified by 111 in font size bits) 4 bits for hit type 8 bits for position Anchor type hits have –4 bits for hash of document ID of the anchor –4 bits for position
15
Google Data Structures Forward Index –Stored in 64 barrels –Barrel has a range of word IDs –Barrel includes document ID and list of word IDs that belong in barrel –Document IDs duplicated across barrels –Word IDs stored as differences from minimum ID for the barrel (24 bits per word)
16
Google Data Structures Inverted Index –Same barrels as forward index –Lexicon points to barrel and to a list of document IDs with their hit lists Two sets of barrels –One for title and link anchor text Use this one first –One for the rest Use this if not enough hits in above barrels
17
Crawling, Indexing, Searching Web sites were surprised to be crawled in 1997 –Requires people to deal with email questions Parsing HTML is a challenge –Does not use YACC to generate CFG parser Too much overhead (too slow) –Uses flex to generate a lexical analyzer Ranking –Use vector of count weights and vector of type weights –Bias towards proximity of search terms –Use PageRank to give a final rank
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.