1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Introduction to Information Retrieval and Anatomy of Google.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Google Search Engine
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Why indexing? For efficient searching of a document
Implementation Issues & IR Systems
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Thanks to Ray Mooney & Scott White
Instructor: P.Krishna Reddy
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
CyberMiner Software Architecture Group
INF 141: Information Retrieval
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Presentation transcript:

1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov

03/08/2001Google: Case Study2 Introduction: What’s new? Amount of web information growing Amount of inexperienced users growing Surfers willing to start from indices like Yahoo! Expensive to build and maintain; Slow to improve; Cannot cover all topics! Google – large scale search engine Name from “googol” = Uses heavily additional structure = quality results

03/08/2001Google: Case Study3 Introduction (continued…) Search engine technology to scale Server requests to scale similarly… Technology advances help… but no so much! E.g. disk seek time, operating system problems Expect cost of indexing/storing text/html to drop relative to amount of information available!

03/08/2001Google: Case Study4 Main goals: Quality, Quality,… Completeness of index is just one factor Lots of junk in the results Number of documents increase exponentially, but user ability does not! High precision very important! Link structure & Link text are valuable … Not much information; Commercial!

03/08/2001Google: Case Study5 Features: PageRank Heavy use of the link structure Performs well even indexing only the titles Counting links to a page Weghts on the sources Page A has pages T i pointing to it. d: damping factor C(A): # of links out of A

03/08/2001Google: Case Study6 Related Work: Applicability Information retrieval Size does matter! Large corpuses are small for the means of Web search (20GB/147GB) Vector methods often tend to return short documents Argument: Users should specify more concretely what they search for! Google: disagree! Other differences from controlled collections No format, language restrictions, control Extended meta information

03/08/2001Google: Case Study7 From Inside… Mostly C/C++ Solaris/Linux Module-based architecture Multi-machine Multi-thread Resource dedication

03/08/2001Google: Case Study8 Major Structures BigFiles Span several file systems 64-bit addressed Descriptor management Compression Document index ISAM (Index sequential access mode), ordered by docID Pointer to Repository, Status, Statistics Pointer to URL and Title in docinfo file if crawled URL to docID conversion (checksum)

03/08/2001Google: Case Study9 Major Structures (continued) Repository Zlib compressed docID, Length, URL Self-consistent data Lexicon Memory resident List of words and a hash-table of pointers Other auxiliary information… (out of scope)

03/08/2001Google: Case Study10 Major Structures (continued 2) Hit Lists Word in a document + typesetting information (hand-encoded) Take most of the space of all indices

03/08/2001Google: Case Study11 Major Structures (continued 3) Forward Index Partially sorted Stored in a number of barrels Each barrel holds range of wordIDs + hitlist

03/08/2001Google: Case Study12 Major Structures (continued 4) Inverted Index Same barrels, but processed by the sorter Not stored by ranking in occurrence for the sake of speed Two sets of inverted barrels

03/08/2001Google: Case Study13 Crawling the Web We talked before… Fragile, beyond our control Implemented in Python Internal DNS cache for each crawler Social issues Phone calls, support Preventing indexing Virtually unable to debug… just test!

03/08/2001Google: Case Study14 Indexing the Web Parsing problems Errors in HTML Non-ASCII characters Home-grown parser (not YACC) Indexing documents into barrels Shared lexicon – too much locking Log file of new words… processed at end Sorting

03/08/2001Google: Case Study15 Searching 1. Parse the query 2. Convert words to wordIDs 3. Seek to start of doclist in the short barrel for every word 4. Scan through until a document that matches all terms is encountered 5. Compute the rank of that document 6. Repeat the same thing for the full barrel 7. Sort the documents matched by rank and return the first few

03/08/2001Google: Case Study16 Results and Performance Quality of results Manual ranking Sorting PageRank Anchor text Proximity Broken links Query: bill clinton % (no date) (0K) Office of the President 99.67% (Dec ) (2K) Welcome To The White House 99.98% (Nov ) (5K) Send Electronic Mail to the President 99.86% (Jul ) (5K) % 99.27% The "Unofficial" Bill Clinton 94.06% (Nov ) (14K) Bill Clinton Meets The Shrinks 86.27% (Jun ) (63K) President Bill Clinton - The Dark Side 97.27% (Nov ) (15K) $3 Bill Clinton 94.73% (no date) (4K) of the PresidentWelcome To The White HouseSend Electronic Mail to the President The "Unofficial" Bill Clinton Bill Clinton Meets The Shrinks President Bill Clinton - The Dark Side $3 Bill Clinton

03/08/2001Google: Case Study17 Performance Storage Scale with the size of the Web Repository is comparatively small Good/Fast compression/decompression System Crawling, Indexing, Sorting Last two simultaneously Searching Bounded by dish IO over LAN (NFS)

03/08/2001Google: Case Study18 Conclusion Google: Scalable search engine Complete architecture Many research ideas arise Always something to improve Matter of time High quality search is the dominant factor