Download presentation
Presentation is loading. Please wait.
Published byMorris Hubbard Modified over 9 years ago
1
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov
2
03/08/2001Google: Case Study2 Introduction: What’s new? Amount of web information growing Amount of inexperienced users growing Surfers willing to start from indices like Yahoo! Expensive to build and maintain; Slow to improve; Cannot cover all topics! Google – large scale search engine Name from “googol” = 10 100 Uses heavily additional structure = quality results
3
03/08/2001Google: Case Study3 Introduction (continued…) Search engine technology to scale Server requests to scale similarly… Technology advances help… but no so much! E.g. disk seek time, operating system problems Expect cost of indexing/storing text/html to drop relative to amount of information available!
4
03/08/2001Google: Case Study4 Main goals: Quality, Quality,… Completeness of index is just one factor Lots of junk in the results Number of documents increase exponentially, but user ability does not! High precision very important! Link structure & Link text are valuable … Not much information; Commercial!
5
03/08/2001Google: Case Study5 Features: PageRank Heavy use of the link structure Performs well even indexing only the titles Counting links to a page Weghts on the sources Page A has pages T i pointing to it. d: damping factor C(A): # of links out of A
6
03/08/2001Google: Case Study6 Related Work: Applicability Information retrieval Size does matter! Large corpuses are small for the means of Web search (20GB/147GB) Vector methods often tend to return short documents Argument: Users should specify more concretely what they search for! Google: disagree! Other differences from controlled collections No format, language restrictions, control Extended meta information
7
03/08/2001Google: Case Study7 From Inside… Mostly C/C++ Solaris/Linux Module-based architecture Multi-machine Multi-thread Resource dedication
8
03/08/2001Google: Case Study8 Major Structures BigFiles Span several file systems 64-bit addressed Descriptor management Compression Document index ISAM (Index sequential access mode), ordered by docID Pointer to Repository, Status, Statistics Pointer to URL and Title in docinfo file if crawled URL to docID conversion (checksum)
9
03/08/2001Google: Case Study9 Major Structures (continued) Repository Zlib compressed docID, Length, URL Self-consistent data Lexicon Memory resident List of words and a hash-table of pointers Other auxiliary information… (out of scope)
10
03/08/2001Google: Case Study10 Major Structures (continued 2) Hit Lists Word in a document + typesetting information (hand-encoded) Take most of the space of all indices
11
03/08/2001Google: Case Study11 Major Structures (continued 3) Forward Index Partially sorted Stored in a number of barrels Each barrel holds range of wordIDs + hitlist
12
03/08/2001Google: Case Study12 Major Structures (continued 4) Inverted Index Same barrels, but processed by the sorter Not stored by ranking in occurrence for the sake of speed Two sets of inverted barrels
13
03/08/2001Google: Case Study13 Crawling the Web We talked before… Fragile, beyond our control Implemented in Python Internal DNS cache for each crawler Social issues Phone calls, support Preventing indexing Virtually unable to debug… just test!
14
03/08/2001Google: Case Study14 Indexing the Web Parsing problems Errors in HTML Non-ASCII characters Home-grown parser (not YACC) Indexing documents into barrels Shared lexicon – too much locking Log file of new words… processed at end Sorting
15
03/08/2001Google: Case Study15 Searching 1. Parse the query 2. Convert words to wordIDs 3. Seek to start of doclist in the short barrel for every word 4. Scan through until a document that matches all terms is encountered 5. Compute the rank of that document 6. Repeat the same thing for the full barrel 7. Sort the documents matched by rank and return the first few
16
03/08/2001Google: Case Study16 Results and Performance Quality of results Manual ranking Sorting PageRank Anchor text Proximity Broken links Query: bill clinton http://www.whitehouse.gov/ 100.00% (no date) (0K) http://www.whitehouse.gov/ Office of the President 99.67% (Dec 23 1996) (2K) http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html Welcome To The White House 99.98% (Nov 09 1997) (5K) http://www.whitehouse.gov/WH/Welcome.html Send Electronic Mail to the President 99.86% (Jul 14 1997) (5K) http://www.whitehouse.gov/WH/Mail/html/Mail_President.html mailto:president@whitehouse.gov 99.98% mailto:President@whitehouse.gov 99.27% The "Unofficial" Bill Clinton 94.06% (Nov 11 1997) (14K) http://zpub.com/un/un-bc.html Bill Clinton Meets The Shrinks 86.27% (Jun 29 1997) (63K) http://zpub.com/un/un-bc9.html President Bill Clinton - The Dark Side 97.27% (Nov 10 1997) (15K) http://www.realchange.org/clinton.htm $3 Bill Clinton 94.73% (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html http://www.whitehouse.gov/Office of the PresidentWelcome To The White HouseSend Electronic Mail to the President mailto:president@whitehouse.govmailto:President@whitehouse.gov The "Unofficial" Bill Clinton Bill Clinton Meets The Shrinks President Bill Clinton - The Dark Side $3 Bill Clinton
17
03/08/2001Google: Case Study17 Performance Storage Scale with the size of the Web Repository is comparatively small Good/Fast compression/decompression System Crawling, Indexing, Sorting Last two simultaneously Searching Bounded by dish IO over LAN (NFS)
18
03/08/2001Google: Case Study18 Conclusion Google: Scalable search engine Complete architecture Many research ideas arise Always something to improve Matter of time High quality search is the dominant factor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.