Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Presented by: Vanshika Sharma
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
From Memex to Google in 120 minutes Rivka Taub Amit Levin.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Presented By: - Chandrika B N
Web Technologies Search Engines
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Xin Liu.
ITEC547 Text Mining Fall Overview of Search Engines.
“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
Search Engine Architecture
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Thanks to Ray Mooney & Scott White
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Search Engine Architecture
Web Search Engines.
The Search Engine Architecture
Discussion Class 9 Google.
Presentation transcript:

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998) The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Overview Introduction Problem Design Goals System Features Architecture Results Conclusion Discussion Introduction Problem Design Goals System Features Architecture Results Conclusion Discussion

Problem Amount of Web pages is very large, and are growing exponentially

Problem Classical Information Retrieval techniques do not work well with web searching because: Most IR research has been done on small (relative to the internet) controlled homogenous collections. The web is a collection of uncontrolled heterogeneous documents, with varying authority. Keyword matching algorithms return low quality matches on the web. Advertisers manipulate content to be listed higher in result sets. Tend to return results with smaller amounts of content. Example (next slide) Classical Information Retrieval techniques do not work well with web searching because: Most IR research has been done on small (relative to the internet) controlled homogenous collections. The web is a collection of uncontrolled heterogeneous documents, with varying authority. Keyword matching algorithms return low quality matches on the web. Advertisers manipulate content to be listed higher in result sets. Tend to return results with smaller amounts of content. Example (next slide)

Problem Bill Clinton example: Search performed in 1996 for the words “Bill Clinton”, on a leading web search engine. Top result, “Bill Clinton Sucks!” Bill Clinton example: Search performed in 1996 for the words “Bill Clinton”, on a leading web search engine. Top result, “Bill Clinton Sucks!”

Problem Limited substantial research done on web search engines. Users are most likely to only look at the first 10 results. Limited substantial research done on web search engines. Users are most likely to only look at the first 10 results.

Design Goals Improve the quality of web search engines. Have the highest precision documents listed in the top 10 results, even at the cost of recall. Precision – The number of relevant documents out of all documents returned. Recall – The number of relevant documents returned out of the total number of relevant documents that could be returned. Improve the quality of web search engines. Have the highest precision documents listed in the top 10 results, even at the cost of recall. Precision – The number of relevant documents out of all documents returned. Recall – The number of relevant documents returned out of the total number of relevant documents that could be returned.

Design Goals Scale with the internet. Support novel research activities on large-scale web data. Don’t let advertising effect the ranking of search results. Example (next page) Scale with the internet. Support novel research activities on large-scale web data. Don’t let advertising effect the ranking of search results. Example (next page)

Design Goals Cell phone example: Search for “cell phone” on Google in 1998 returns “The Effect of Cellular Phone Use Upon Driver Attention“ as its top result. If advertisers had an impact on results, surely a cell phone advertisement would have taken the top result position. Cell phone example: Search for “cell phone” on Google in 1998 returns “The Effect of Cellular Phone Use Upon Driver Attention“ as its top result. If advertisers had an impact on results, surely a cell phone advertisement would have taken the top result position.

System Features PageRank Use of Anchor Text Use of location Use of font size of words Cached pages kept on repository PageRank Use of Anchor Text Use of location Use of font size of words Cached pages kept on repository

Google Architecture URL Server URL Server Crawler Store Server Store Server Repository Indexer Anchor URL Resolver URL Resolver Barrels lexicon links Doc Index Doc Index Sorter Searcher PageRank

Url, Crawler, Store URL Server URL Server Crawler Store Server Store Server Single URLserver Number of crawlers Store server compresses and stores Single URLserver Number of crawlers Store server compresses and stores

Repository Store Server Store Server Repository Compress and stores pages in repository sync || length || compressed packet docID || encode || urllen || pagelen || url || page Compress and stores pages in repository sync || length || compressed packet docID || encode || urllen || pagelen || url || page

Indexer Repository Indexer The indexer reads from the repository Web pages are given a docID when the URL is uncompressed docID Document checksum computed to find docID Fixed with Index sequential access mode (ISAM) Current document status Pointer into repository

Indexer Anchor Barrels lexicon Each doc is converted into a set of word occurrence (hits) Hits record the word, position, font, size, cap

Forward Index Partial sorted by implementation Each barrel holds a range of wordIDs Barrels Doc 6 left right e - h Doc 6 Doc 4 Doc 1 q - s Doc 6 Doc 3 Doc 2 Forward Index Forward Index

Resolver and Anchor URL Resolver URL Resolver Anchor file contain info on what file point to another Resolver duties to convert and place links Pagerank Barrels Anchor

Sorter and Lexicon Barrels lexicon Sorter Sorter takes barrels sort by docID and sorts them by wordID The sorter produces inverted index Lexicon is a list of words

Searcher Barrels lexicon Searcher Pagerank The searcher is run by a webserver which uses the lexicon, pagerank, and inverted index to answer queries

Searcher and Rankings Simple case - one word searches Multi word searches Searcher

Results Quality of search is the number one criteria, may be subjective. Prior to Google, searches were like database search. Now search engines employ some offshoot of the Google’s methodologies. Quality of search is the number one criteria, may be subjective. Prior to Google, searches were like database search. Now search engines employ some offshoot of the Google’s methodologies.

Conclusions Google is scalable Primary goal is high quality searches Web is dynamic and growing Heavy use of hypertext information Google is scalable Primary goal is high quality searches Web is dynamic and growing Heavy use of hypertext information

Discussion Questions? What is the most important concept when considering Google's. architecture? What is more important in Google's structure, software or hardware? Google’s advertising – has Google lost its initial position on advertising with the advent of Adwords and AdSense? (demo)