Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures

Step-1: Crawling the web  Google has a fast distributed crawling system  Each crawler keeps roughly 300 connection open at once  Google can crawl over 100 web pages per second using four crawlers at peak speeds (roughly 600K per second of data)  Each crawler maintains a its own DNS cache  The crawler uses asynchronous IO and a number of queues Ref: http://www.elibrary.icrisat.org/Google%20Search/how%20google%20works.htm

Step2: Indexing the Web (1) Parsing  Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones.  Use flex to generate a lexical analyzer for maximum speed URL ServerStore Server Crawler Repository Indexer Barrel s Indexer Sorter

Step2: Indexing the Web (2)

Searching (pseudo code) 1.Parse the query 2.Convert words into wordIDs 3.Seek to the start of the doclist in the short barrel for every word 4.Scan through the doclists until there is a document that matches all the search terms 5.Compute the rank of that document for the query 6.If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4 7.If we are not at the end of any doclist go to step 4 8.Sort the documents that have matched by rank and return the top k Figure 4. Google Query Evaluation

Searching

Page Rank

Google architecture URL server sends list of URLs to be fetched to crawlers StoreServer compresses and stores pages Indexer extracts words, their pos., size, capital. Anchors cont.links and their text Sorter generates inverted index Searcher uses Lexicon, II, and PR

Major data Structure

Reference [Cho 98] Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. [Page 98] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. [Chakrabarti 98] S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, P. Raghavan and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. [Gravano 94] Luis Gravano, Hector Garcia-Molina, and A. Tomasic. The Effectiveness of GlOSS for the Text-Database Discovery Problem. Proc. of the 1994 ACM SIGMOD International Conference On Management Of Data, 1994. Sergey Brin and Lawrence Page; The Anatomy of a Large-Scale Hypertextual Web Search Engine; Computer Science Department, Stanford University, Stanford, CA 94305

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Similar presentations

Presentation on theme: "Anatomy of a search engine Design criteria of a search engine Architecture Data structures."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Similar presentations

Presentation on theme: "Anatomy of a search engine Design criteria of a search engine Architecture Data structures."— Presentation transcript:

Similar presentations

About project

Feedback