Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Similar presentations


Presentation on theme: "Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)"— Presentation transcript:

1 Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998) The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

2 Overview Introduction Problem Design Goals System Features Architecture Results Conclusion Discussion Introduction Problem Design Goals System Features Architecture Results Conclusion Discussion

3 1. http://www.zakon.org/robert/internet/timeline/http://www.zakon.org/robert/internet/timeline/ 1. http://www.zakon.org/robert/internet/timeline/http://www.zakon.org/robert/internet/timeline/ Problem Amount of Web pages is very large, and are growing exponentially

4 Problem Classical Information Retrieval techniques do not work well with web searching because: Most IR research has been done on small (relative to the internet) controlled homogenous collections. The web is a collection of uncontrolled heterogeneous documents, with varying authority. Keyword matching algorithms return low quality matches on the web. Advertisers manipulate content to be listed higher in result sets. Tend to return results with smaller amounts of content. Example (next slide) Classical Information Retrieval techniques do not work well with web searching because: Most IR research has been done on small (relative to the internet) controlled homogenous collections. The web is a collection of uncontrolled heterogeneous documents, with varying authority. Keyword matching algorithms return low quality matches on the web. Advertisers manipulate content to be listed higher in result sets. Tend to return results with smaller amounts of content. Example (next slide)

5 Problem Bill Clinton example: Search performed in 1996 for the words “Bill Clinton”, on a leading web search engine. Top result, “Bill Clinton Sucks!” Bill Clinton example: Search performed in 1996 for the words “Bill Clinton”, on a leading web search engine. Top result, “Bill Clinton Sucks!”

6 Problem Limited substantial research done on web search engines. Users are most likely to only look at the first 10 results. Limited substantial research done on web search engines. Users are most likely to only look at the first 10 results.

7 Design Goals Improve the quality of web search engines. Have the highest precision documents listed in the top 10 results, even at the cost of recall. Precision – The number of relevant documents out of all documents returned. Recall – The number of relevant documents returned out of the total number of relevant documents that could be returned. Improve the quality of web search engines. Have the highest precision documents listed in the top 10 results, even at the cost of recall. Precision – The number of relevant documents out of all documents returned. Recall – The number of relevant documents returned out of the total number of relevant documents that could be returned.

8 Design Goals Scale with the internet. Support novel research activities on large-scale web data. Don’t let advertising effect the ranking of search results. Example (next page) Scale with the internet. Support novel research activities on large-scale web data. Don’t let advertising effect the ranking of search results. Example (next page)

9 Design Goals Cell phone example: Search for “cell phone” on Google in 1998 returns “The Effect of Cellular Phone Use Upon Driver Attention“ as its top result. If advertisers had an impact on results, surely a cell phone advertisement would have taken the top result position. Cell phone example: Search for “cell phone” on Google in 1998 returns “The Effect of Cellular Phone Use Upon Driver Attention“ as its top result. If advertisers had an impact on results, surely a cell phone advertisement would have taken the top result position.

10 System Features PageRank Use of Anchor Text Use of location Use of font size of words Cached pages kept on repository PageRank Use of Anchor Text Use of location Use of font size of words Cached pages kept on repository

11 Google Architecture URL Server URL Server Crawler Store Server Store Server Repository Indexer Anchor URL Resolver URL Resolver Barrels lexicon links Doc Index Doc Index Sorter Searcher PageRank

12 Url, Crawler, Store URL Server URL Server Crawler Store Server Store Server Single URLserver Number of crawlers Store server compresses and stores Single URLserver Number of crawlers Store server compresses and stores

13 Repository Store Server Store Server Repository Compress and stores pages in repository sync || length || compressed packet docID || encode || urllen || pagelen || url || page Compress and stores pages in repository sync || length || compressed packet docID || encode || urllen || pagelen || url || page

14 Indexer Repository Indexer The indexer reads from the repository Web pages are given a docID when the URL is uncompressed docID Document checksum computed to find docID Fixed with Index sequential access mode (ISAM) Current document status Pointer into repository

15 Indexer Anchor Barrels lexicon Each doc is converted into a set of word occurrence (hits) Hits record the word, position, font, size, cap

16 Forward Index Partial sorted by implementation Each barrel holds a range of wordIDs Barrels Doc 6 left right e - h Doc 6 Doc 4 Doc 1 q - s Doc 6 Doc 3 Doc 2 Forward Index Forward Index

17 Resolver and Anchor URL Resolver URL Resolver Anchor file contain info on what file point to another Resolver duties to convert and place links Pagerank Barrels Anchor

18 Sorter and Lexicon Barrels lexicon Sorter Sorter takes barrels sort by docID and sorts them by wordID The sorter produces inverted index Lexicon is a list of words

19 Searcher Barrels lexicon Searcher Pagerank The searcher is run by a webserver which uses the lexicon, pagerank, and inverted index to answer queries

20 Searcher and Rankings Simple case - one word searches Multi word searches Searcher

21 Results Quality of search is the number one criteria, may be subjective. Prior to Google, searches were like database search. Now search engines employ some offshoot of the Google’s methodologies. Quality of search is the number one criteria, may be subjective. Prior to Google, searches were like database search. Now search engines employ some offshoot of the Google’s methodologies.

22 Conclusions Google is scalable Primary goal is high quality searches Web is dynamic and growing Heavy use of hypertext information Google is scalable Primary goal is high quality searches Web is dynamic and growing Heavy use of hypertext information

23 Discussion Questions? What is the most important concept when considering Google's. architecture? What is more important in Google's structure, software or hardware? Google’s advertising – has Google lost its initial position on advertising with the advent of Adwords and AdSense? (demo)


Download ppt "Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)"

Similar presentations


Ads by Google