The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Chapter 5: Introduction to Information Retrieval

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Natural Language Processing WEB SEARCH ENGINES August, 2002.

The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

Google and Scalable Query Services

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

Overview of Search Engines

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.

Web Search Algorithms By Matt Richard and Kyle Krueger.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Search Xin Liu.

“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.

The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.

1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.

1 CS 430: Information Discovery Lecture 20 Web Search Engines.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,

The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Information Retrieval in Practice

Why indexing? For efficient searching of a document

Information Retrieval in Practice

Search Engine Architecture

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Implementation Issues & IR Systems

The Anatomy Of A Large Scale Search Engine

Google and Scalable Query Services

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Hongjun Song Computer Science The University of Memphis

Anatomy of a search engine

Data Mining Chapter 6 Search Engines

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI

Web Search Engines.

The Search Engine Architecture

Instructor : Marina Gavrilova

Presentation transcript:

The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael Galletti Britt Phillips Jeff Masson

Searching in the 90’s Search Engine Technology had to deal with huge growths.

Google will Scale They wanted a search engine that: – Has fast crawling capabilities – Use Storage Space Efficiently – Process Indexes fast – Handles Queries fast They Had to Deal with Scaling Difficulties – Disk Speeds and OS robustness not scaling as well as hardware performance and cost

The Google Goals Improve Search Quality – Remove Junk Results (Prioritizing of Results) Academic Search Engine Research – Create Literature on the subject of Databases Gather Usage Data – Data bases can support research Support Novel Research Activities on Web Data

System Features Two important features that help it produce high precision results: – PageRank – Anchor Text

PageRank Graph structure of hyperlinks hadn’t been used by other search engines Graph of 518 million hyperlinks Text matching using page titles performs well after pages are prioritized Similar results when looking at entire pages

PageRank Formula Not all pages linking to others are counted equally PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) – A: page – T1…Tn: pages linking to it – C(A): pages linking out of it – d: “damping factor” PageRank for 26m pages can be calculated in a few hours

Intuitive Justification A page can have a high PageRank if many pages link to it Or if a high PageRank’d page links to it (eg: Yahoo News) – The page wouldn’t be linked to if it wasn’t high quality, or it had a broken link PageRank handles these cases by propagating the weights of different pages

Anchor Text Anchors provide more accurate descriptions than the page itself. Anchors exist for documents that aren’t text- based (eg. Images, videos, etc) Google indexed more than 259m anchors from just 24m pages.

Other Features Larger font sizes or bold fonts carry more weight than other words

Related Work

Early Search Engines The World Wide Web Worm (WWWW) – One of the first web search engines (Developed 1994) – Had a database of 300,000 multimedia objects Some early search engines retrieved results by post-processing the results of other search engines.

Information Retrieval The science of searching for documents or information within documents and for metadata about documents. Most research is on small collections of scientific papers or news stories on a related topic. Text Retrieval Conference is the primary benchmark for information retrieval – Uses the “Very Large Corpus”, a small and well controlled collection for their benchmarks – Very Large Corpus benchmark is only 20GB

Information Retrieval The Text Retrieval Conference doesn’t produce good results on the web – EX: A search of “Bill Clinton” would return a page that only says “Bill Clinton Sucks” and have a picture of him. Brin and Page believe that for a search of “Bill Clinton” you should receive reasonable results because there is so much information on the topic. The standard information retreival work needs to be extended to deal effectively with the web

Differences Between the Web and Well Controlled Collections Documents differ internally in their language, vocabulary, type or format, and may even be machine generated. External meta information is information that can be inferred about a document but is not contained within it. – Ex: reputation of the source, update frequency, quality, popularity, etc. A page like Yahoo needs to be treated differently than an article or web page that receives one view every ten years.

Differences Between the Web and Well Controlled Collections There is no control over what people can put on the web Some companies manipulate search engines to route traffic for profit Metadata efforts have largely failed with web search engines because a user can be returned a web page that has nothing to do with the query due to the search engine being manipulated.

The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael Galletti Britt Phillips Jeff Masson

System Anatomy High-level discussion of architecture Descriptions of data structures – Repository – Lexicon – HitLists – Forward and Inverted Indices Major applications – Crawling – Indexing – Searching

Google Architecture Overview Implemented in C, C++ – Runs efficiently on Linux, Solaris Many distributed webcrawlers – Receive list of URLs to crawl from URL Server Crawlers send pages to Store Server – Compressed pages sent to Repository – Repository assigns page a docID Indexer – Documents from Repository converted into HitLists – Sends HitLists to Barrels – Sends links to anchor file

Google Architecture Overview URL Resolver – Reads from anchor file – Converts URLS to docIDs and sends them to Barrels – Pairs of docIDs stored in Links database Sorter – Barrels presorted by docID, Forward Index – Re-sorts by wordID to create Inverted Index – Dumps a list of associated wordIDs to Lexicon Lexicon – Keeps a list of words Searcher – Uses Lexicon, Inverted Index, and Pagerank to answer queries

Repository BigFiles – Virtual files spanning multiple file systems – Operating systems did not provide enough for system needs Repository access – No additional data structures necessary – Reduces complexity – Can rebuild all data structures from Respository Repository53.5 GB = GB Uncompressed SyncLengthCompressed packet SyncLengthCompressed packet Uncompressed Packet docIdecodeurlLenpageLenurlpage Repository – Contains full HTML of every web page – Compression decision Bzip offers 4 : 1 compression Zlib offers 3 : 1, is faster – Opted for speed over ratio

Document Index and Lexicon Document Index – Stores information about each document – Fixed-width ISAM (Index- Sequential Access Mode) ordered by docID – Information includes: Status Pointer into Repository Checksum Various Statistics – Record fetching Document points to docinfo file with URL and title if previously crawled Otherwise points to URL in URLlist docID Allocation – File of all document checksums paired with docIDs Sorted by checksum – Find docID 1. Checksum of URL is computed 2. Binary search over file – May be done in batches Lexicon – Capable of existing in main memory of a machine – Holds 14 million words Linked-List of words Hash table of pointers

HitLists and Encoding Hit – Occurrence of a word in a document, given 2 bytes – Fancy and plain hits – Records capitalization, size relative to document, and position HitList – List of Hits for some word in some document – Requires the most space – Many possible encoding schemes Simple Hand-optimized Huffman – Time vs space compromise Bit Allocation for Different Hits [2 Bytes] PlainCap: 1Size: 3Position: 12 FancyCap: 1Size = 7Type: 4Position: 4 AnchorCap: 1Size = 7Type: 4Hash: 4Pos: 4 Anchor Hits – Hash to docID anchor occurs in Storing – Lists stored in barrels – Space-saving Combine length with different ID depending on Forward or Inverted index If list length will not fit in remaining bits, place escape character there and use next two bytes to store list length

Forward and Inverted Indices Forward Index – 64 barrels Each one corresponds to a range of wordIDs – Words in documents broken up into ranges docID is recorded into appropriate barrel List of wordIDs with HitLists follow wordIDs stored relative to Barrel starting index – Fit in 24 bits, leaving 8 for list length – System requires more storage for duplicate IDs However, coding complexity greatly reduced Inverted Index – Created after Barrels go through Sorter – For each valid wordID there is a pointer from Lexicon into corresponding Barrel – Points to docList of docIDs and matching HitLists Represents every document in which a particular word appears docList Ordering – Sort by docID Quick for multi-word queries – Sort by ranking of occurrence One word queries trivial Multi-word queries likely near start of list Merging is difficult Development is difficult – Compromise! Keep two sets of Barrels

Crawling The Web

Crawling Accessing millions of webpages and logging data DNS caching for increased performance from web admins Unpredictable bugs Copyright problems Robots.txt

Indexing The Web Parsing HTML data Handle wide variety of errors Encoding to Barrels Turning words into WordIds Hashing all the data Sorting data recursively – Bucket Sort

Searching Quality first Limited depth(40k hits) No one factor will have too much impact Titles,Font Size, Distance,Count Creates Relevance score Combines PageRank and IR score

User Feedback User input vital to improved search results Verified users can evaluate results and send their ratings back Adjust ranking system Verify that old results are still valid

Results and Performance

The most important measure of a search engine is the quality of its search results “Our own experience with Google has shown it to produce better results than the major commercial search engines for most searches.” Results are generally high quality pages with minimal broken links

Storage Requirements Total size of repository is about 53 GB (relatively cheap source of data) Total of all the data use by engine requires about 55 GB With better compression, only 7 GB of drive needed

System and Search Performance Google’s major operations: Crawling, Indexing, Sorting Indexer > Crawler in terms of speed Indexer runs at 54 pages/second Using four machines, sorting takes 24 hours Most queries answered within 10 s No query caching or subindices on common terms

Conclusions Google is designed to a be a scalable search engine, providing high quality search results. Future Work: Query caching, smart disk allocation, subindices Smart algorithms to decide what old web pages should be recrawled and which new ones should be crawled Using proxy caches to build search databases and adding boolean operators, negation, and stemming Support user context and result summarization

High Quality Search: Users want high quality results without being frustrates and wasting time. Google returns higher quality search results than current commercial search engines; Link structure analysis determines quality of pages, link description determines relevance. Scalable Architecture: Google is efficient in both space and time Google has overcome bottleneck in CPU, memory access and capacity, and disk I/O during various operations to prove excellence Crawling, Indexing, Sorting are efficient enough to build 24 million pages in less than a week