The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Slides:

Advertisements

Similar presentations

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.

The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Presented by: Vanshika Sharma

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)

Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Google and Scalable Query Services

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Presented By: - Chandrika B N

The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.

The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.

Web Search Algorithms By Matt Richard and Kyle Krueger.

Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

Search Xin Liu.

Google PageRank Algorithm

“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.

The anatomy of a Large-Scale Hypertextual Web Search Engine.

The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.

1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.

1 CS 430: Information Discovery Lecture 20 Web Search Engines.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,

The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Why indexing? For efficient searching of a document

Information Retrieval in Practice

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Implementation Issues & IR Systems

The Anatomy Of A Large Scale Search Engine

CSE 454 Advanced Internet Systems University of Washington

Google and Scalable Query Services

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Search Search Engines Search Engine Optimization Search Interfaces

Hongjun Song Computer Science The University of Memphis

Thanks to Ray Mooney & Scott White

Instructor: P.Krishna Reddy

Anatomy of a search engine

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI

Web Search Engines.

The Search Engine Architecture

Instructor : Marina Gavrilova

Presentation transcript:

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma

Overview Introduction Page Rank Architecture Overview Major Data Structures Major Applications Query Evaluation Conclusion

Introduction Authors: Sergey Brin, Lawrence Page Google, prototype of large-scale search engine ( Makes heavy use of the structure present in hypertext Designed to crawl and index the Web efficiently

Page Rank Page Rank of page A: PR(A) = (1-d) + d( PR(T1)/C(T1) + … + PR(Tn)/C(Tn) ) where, d = Damping Factor (0<d<1), ~0.85 Tn = A page pointing to page A C(Tn) = No. of links going out of page A A T1 T2 Tn C1 C2 Cm

Page Rank PR forms a probability distribution over webpages – Sum of all web pages PRs will be one Calculated using a simple iterative algorithm Corresponds to principal eigenvector of the normalized link matrix of the web Intuitively, – Models user behavior – PR – probability that random surfer visits the page – d – Probability that random surfer is bored and requests another random page – Variation : d is added to a single page d is added to a group of pages – High PR value possible if There are many pages that point to it There are some pages that point to it that have a high PR

Google Architecture Overview URL server – Sends URLs to be fetched to crawlers Crawler – Downloads web pages – Done by several distributed crawlers Store Server – Compresses and stores web pages Repository – Each web page associated with docID

Google Architecture Overview Indexer – Reads repository, uncompresses docs, parses them – Each doc converted to set of word occurences, “Hits” Record word, position, font, capitalization Distributes hits into set of “Barrels” creating partially sorted forward index – Parses all links in web pages and stores them in “Anchor File” Contains enough info to determine where each link points from and to and text of link

Google Architecture Overview URL Resolver – Reads anchor files – Converts relative URLs to absolute URLs and in turn into docIDs – Puts anchor text into forward index, associated with docID that the anchor points to – Generates database of link (pairs of docIDs, used to compute PR)

Google Architecture Overview Sorter – Takes barrels sorted by docID – Resorts by wordID to generate inverted index – Also produces list of wordIDs and offsets into inverted index DumpLexicon – Takes above list along with lexicon produced by indexer – Generates new lexicon for use by “Searcher” Searcher – Uses above lexicon with inverted index and PR to answer queries

Major Data Structures 1. Big Files Virtual files spanning multiple file systems Addressable by 64 bit integers File system allocation handled automatically Handles allocation and deallocation of file descriptors Support rudimentary compression options

Major Data Structures 2. Repository Contains full HTML of every web page Each page compressed with zlib Docs stored one after another Prefix : docID, length, URL Requires no other data structure to be accessed

Major Data Structures 4. Lexicon Fits in main memory Contains excess of 14 million words Implemented as: – List of words (concatenated, but separated by nulls) – Hash table of pointers

Major Data Structures 3. Document Index Keeps information about each doc It’s a fixed width ISAM (Index Sequential Access Mode) index, ordered by docID Each entry includes current doc status, pointer to repository, doc checksum If doc crawled, contains pointer to variable width file, docinfo (contains its URL, title) Else, points to URL list (contains just URL)

Major Data Structures 5. Hit Lists Occurrences of words in doc with position, font, capitalization info Accounts for most space in forward, inverted indices Uses hand optimized compact encoding Types: – Fancy Hits Hits occurring in URL, title, anchor text, meta tag Capitalization Bit + Font Size Bits to encode type, Position (8 Bits) Anchor Hits: Position bits split as 4 bits anchor position + 4 Bits docID hash of anchor – Plain Hits Capitalization bit + Font size (Relative, 3 Bits) + Word position (12 Bits)

Major Data Structures 6. Forward Index Partially sorted Stored in a no. of barrels (~64) Each barrel holds range of wordIDs Barrel stores docID of doc containing word + list of wordIDs + Corres. Hit lists Instead of actual wordID, relative difference from minimum Barrel wordID stored Leaves 8 bits for Hit list length

Major Data Structures 7. Inverted Index Consists of same barrels as forward index that are already processed by sorter For valid wordID, lexicon contains pointer to Barrel that wordID falls into Points to doclist of docIDs + Corres hit lists Doclist represents all occurences of that word in all docs

Major Applications Crawling the Web – Uses fast distributed crawling system – Each crawler maintains its own DNS cache Indexing the Web – Parsing – Indexing Documents into Barrels – Sorting Searching

Google Query Evaluation 1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Conclusion Page Rank allows for better quality of search results Designed to scale effectively

Reference The Anatomy of a Large-Scale Hypertextual Web Search Engine – Sergey Brin and Lawrence Page – db.stanford.edu/~backrub/google.html