David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Efficiency of Algorithms Csci 107 Lecture 6-7. Topics –Data cleanup algorithms Copy-over, shuffle-left, converging pointers –Efficiency of data cleanup.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Searching on the WWW The Google Phenomena Snyder p
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Efficiency of Algorithms February 11th. Efficiency of an algorithm worst case efficiency is the maximum number of steps that an algorithm can take for.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
David Evans Class 13: Quicksort, Problems and Procedures CS150: Computer Science University of Virginia Computer Science.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Data Structures & Algorithms and The Internet: A different way of thinking.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
David Evans CS150: Computer Science University of Virginia Computer Science Class 31: Cookie Monsters and Semi-Secure.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
CS216: Program and Data Representation University of Virginia Computer Science Spring 2006 David Evans Lecture 4: Dynamic Programming, Trees
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS216: Program and Data Representation University of Virginia Computer Science Spring 2006 David Evans Lecture 23: Review.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Setting up a search engine KS 2 Search: appreciate how results are selected.
David Evans CS150: Computer Science University of Virginia Computer Science Class 37: How to Find Aliens (and Factors)
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
The Nutch Open-Source Search Engine CSE 454 Slides by Michael J. Cafarella.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Map Reduce.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Hongjun Song Computer Science The University of Memphis
Anatomy of a search engine
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
CS246: Search-Engine Scale
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling

2 CS150 Fall 2005: Lecture 38: Googling Google Some searches... “David Evans”David Evans “Dave Evans”Dave Evans “idiot”idiot “lawn lighting”lawn lighting Tomorrow at 6pm (but google doesn’t know that!)

3 CS150 Fall 2005: Lecture 38: Googling Google Building a Web Search Engine Database of web pages –Crawling the web collecting pages and links –Indexing them efficiently Responding to Searches –How to find documents that match a query –How to rank the “best” documents

4 CS150 Fall 2005: Lecture 38: Googling Google Crawling Crawler activeURLs = [ “ ] while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: page = downloadPage (URL) newURLs += extractLinks (page) activeURLs = newURLs Problems: Will keep revisiting the same pages Will take very long to get a good view of the web Will annoy web server admins downloadPage and extractLinks must be very robust

5 CS150 Fall 2005: Lecture 38: Googling Google Crawling Crawler activeURLs = [ “ ] visitedURLs = [ ] while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: visitedURLs += URL page = downloadPage (URL) newURLs += extractLinks (page) - visitedURLs activeURLs = newURLs What is the complexity?

6 CS150 Fall 2005: Lecture 38: Googling Google Distributed Crawler activeURLs = [ “ ] visitedURLs = [ ] while (len(activeURLs) > 0) : newURLs = [ ] parfor URL in activeURLs: visitedURLs += URL page = downloadPage (URL) newURLs += extractLinks (page) - visitedURLs activeURLs = newURLs Is this as “easy” as distributing finding aliens?

7 CS150 Fall 2005: Lecture 38: Googling Google Building a Web Search Engine Database of web pages –Crawling the web collecting pages and links –Indexing them efficiently Responding to Searches –How to find documents that match a query –How to rank the “best” documents

8 CS150 Fall 2005: Lecture 38: Googling Google Building an Index What if we just stored all the pages? Answering a query would be  (size of the database) (need to look at all characters in database) For google: about 4 Billion pages (actual size is now considered a corporate secret) * 60 KB (average web page size) = ~184 Trillion Linear is not nearly good enough when n is Trillions

9 CS150 Fall 2005: Lecture 38: Googling Google Reverse Index WordLocations … “David”[ …, …] … “Evans”[ …, …] … What is time complexity of search now?

10 CS150 Fall 2005: Lecture 38: Googling Google Best Possible Searching Searching Problem: –Input: a target key key, a list of n pairs, sorted by key using a comparison function cf –Output: if key is in the list, the value associated with key; otherwise, not found What is the best possible solution to the general searching problem?

11 CS150 Fall 2005: Lecture 38: Googling Google Recall Class 13: Sorting problem is Ω(n log n) There are n! possible orderings Each comparison can eliminate at best ½ of them So, best possible sorting procedure is Ω(log 2 n!) Sterling’s approximation: n! = Ω(n n ) –So, best possible sorting procedure is Ω(log (n n )) = Ω(n log n) Recall log multiplication is normal addition: log mn = log m + log n

12 CS150 Fall 2005: Lecture 38: Googling Google Searching Problem is  (log n ) It is  (log n ) –Each comparison can eliminate at best ½ of all the elements from consideration It is O (log n ) –We know a procedure that solves it in  (log n ) For google: n is the number of distinct words on the web (hundreds of millions?) –  (log n ) is not good enough

13 CS150 Fall 2005: Lecture 38: Googling Google Faster Searching? The proof that searching is  (log n ) relied on knowing that the best a comparison can do is eliminate ½ the entries Can we do better? –Without knowing anything about comparison: no –With knowing about comparison: yes What if one comparison can eliminate O( n ) of the entries?

14 CS150 Fall 2005: Lecture 38: Googling Google Bin Searching First LetterItems a[<“aardvark”, [ …]>, … ] b[ … ] … z[ …, ] def binsearch (key, table) : search (key, table[key[0]]) What is time complexity of binsearch?

15 CS150 Fall 2005: Lecture 38: Googling Google Searching in O(1) To do better than  (log n ) the number of bins must scale with n –Average number of elements in a bin must be O(1) –One comparison must eliminate O( n ) of the elements

16 CS150 Fall 2005: Lecture 38: Googling Google Hash Tables Bin = H (key, number of bins) –H is a hash function –We’ve seen cryptographic hash functions where H must be collision resistant –For this, we don’t need that just need H must distribute the keys well across the bins Finding a good H is difficult –You can download google’s from

17 CS150 Fall 2005: Lecture 38: Googling Google Google’s Lexicon 1998: 14 million words (much more today) Lookup word in H ( word, nbins ) Maps to WordID KeyWords 0[,... ] 1[,..., ]... nbins – 1[,..., ]

18 CS150 Fall 2005: Lecture 38: Googling Google Google’s Reverse Index WordIdndocspointer (From 1998 paper...may have changed some since then) Lexicon: 293 MB (1998) Inverted Barrels: 41 GB (1998)

19 CS150 Fall 2005: Lecture 38: Googling Google Inverted Barrels docid (27 bits)nhits (5 bits)hits (16 bits each) plain hit: capitalized: 1 bit font size: 3 bits position: 12 bits first 4095 chars, everything else extra info for anchors, titles (less position bits)

20 CS150 Fall 2005: Lecture 38: Googling Google Building a Web Search Engine Database of web pages –Crawling the web collecting pages and links –Indexing them efficiently Responding to Searches –How to find documents that match a query –How to rank the “best” documents

21 CS150 Fall 2005: Lecture 38: Googling Google Finding the “Best” Documents Humans rate them –“Jerry and David’s Guide to the World Wide Web” (became Yahoo!) Machines rate them –Count number of occurrences of keyword Easy for sites to rig this –Machine language understanding not good enough Business Model –Whoever pays you the most is listed first

22 CS150 Fall 2005: Lecture 38: Googling Google Random Walk Model Initialize all page ranks = 0 p = select a random URL for as long as you feel like p.rank = p.rank + 1 p = select random link from Links (p) Eventually, ranks measure probability a random websurfer would encounter a page Problems with this?

23 CS150 Fall 2005: Lecture 38: Googling Google Back Links = 219 backlinks

24 CS150 Fall 2005: Lecture 38: Googling Google Counting Back Links link: –109 backlinks (hey, I should be first!) Back links are not a good measure –Most of mine are from my own pages But Google doesn’t know that (always) –Some pages are more important than others

25 CS150 Fall 2005: Lecture 38: Googling Google PageRank Weight the back links by the popularity of the linking page def PageRank (u): rank = 0 for b in BackLinks (u) rank = rank + PageRank (b) / Links (b) return rank Would this work?

26 CS150 Fall 2005: Lecture 38: Googling Google Converging PageRank Ranks of all pages depend on ranks of all other pages Keep recalculating ranks until they converge def CalculatePageRanks (urls): initially, every rank is 1 for as many times as necessary calculate a new rank for each page (using old ranks of other pages) replace the old ranks with the new ranks How do initial ranks effect results? How many iterations are necessary?

27 CS150 Fall 2005: Lecture 38: Googling Google PageRank Crawlable web (1998): 150 million pages, 1.7 Billion links Database of 322 million links –Converges in ~50 iterations Initialization matters –All pages = 1: very democratic, models browser equally likely to start on random page – = 1,..., all others = 0 More like what Google probably uses

28 CS150 Fall 2005: Lecture 38: Googling Google Query Work To respond to 1 query (2002) –Read 100 MB of data –10s of Billions of CPU cycles Google in 2002: –15,000 commodity PCs Racks of 88 2GB PCs, $278,000 each rack Power: 10 MW-h/month ($1,500) –If you have 15,000 PCs, there always be some with faults: load balancing, data partitioning

29 CS150 Fall 2005: Lecture 38: Googling Google Building a Web Search Engine Database of web pages –Crawling the web collecting pages and links –Indexing them efficiently Responding to Searches –How to find documents that match a query –How to rank the “best” documents Ready to go become the next google?

30 CS150 Fall 2005: Lecture 38: Googling Google Charge Before becoming the next Google, you need to finish PS8! Tomorrow: 6pm, Lighting of the Lawn Friday’s class: –A few other neat things about Google –Guidelines for project presentations –Exam review – me your topics and questions Monday: project presentations