1 CS 430: Information Discovery Lecture 20 Web Search Engines.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
Google Search Engine
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
1 CS/INFO 430 Information Retrieval Lecture 21 Web Search 3.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Why indexing? For efficient searching of a document
CS 430: Information Discovery
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Instructor : Marina Gavrilova
cs430 lecture 02/22/01 Kamen Yotov
Discussion Class 9 Google.
Presentation transcript:

1 CS 430: Information Discovery Lecture 20 Web Search Engines

2 Course Administration

3 Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges Volume of material -- several billion items, growing steadily Items created dynamically or in databases Great variety -- length, formats, quality control, purpose, etc. Inexperience of users -- range of needs Economic models to pay for the service

4 Strategies Subject hierarchies Yahoo! -- use of human indexing Web crawling + automatic indexing General -- Google, AltaVista, AllTheWeb, NorthernLight,... Subject based -- Psychcrawler, PoliticalInformation.Com, Inomics.Com,... Mixed models Human directed web crawling and automatic indexing -- BBC News

5 Components of Web Search Service Components Web crawler Indexing system Search system Considerations Economics Scalability

6 Economic Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek, now Go.Com) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Licensing Cost of company are covered by fees, licensing of software and specialized services

7 Cost Example (Google) 85 people 50% technical, 14 Ph.D. in Computer Science Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month

8 Scalability ,000 10, ,000 1,000,000 10,000, ,000,000 1,000,000,000 10,000,000, The growth of the web

9 Web search services are centralized systems Over the past 3-5 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. Will this continue? Possible areas for concern are telecommunications costs, disk access rates. Scalability

10 Limitations of Web Crawling Time delay. Typically a monthly cycle. Crawlers are ineffective with sites that change rapidly, e.g., news. Pages not linked to. Crawlers find only those pages that are linked by paths from their seeds. Depth of crawl. Crawlers do not index every page on a site (algorithms to avoid crawler traps). but... Creators of information are increasingly organizing them to be accessible to the web search services (e.g., Springer- Verlag)

11 Indexing Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Usability requires: Ranking hits in order that fits user's requirements Effective presentation helpful summary records removal of duplicates grouping results from a single site Completeness of index is just one factor.

12 Indexing the Web Parsing problems –Errors in HTML –Non-ASCII characters Indexing documents Shared files, e.g., word list Sorting

13 Ranking Options Special factors Conventional methods (e.g., tf.idf) were developed for homogenous collections, e.g., items of similar length Some items are deliberately constructed to distort indexing Options Vector space ranking with corrections for document length Extra weighting for specific fields, e.g., title, anchors, etc. Link structure, e.g., Google's PageRank, Kleinberg's Hubs and Authorities

14 Case Study: Google Perl with C/C++ Linux Module-based architecture Multi-machine Multi-thread

15 Major Structures BigFiles Span several file systems 64-bit addressed Descriptor management Compression Document index ISAM (Index sequential access mode), ordered by docID Pointer to Repository, Status, Statistics Pointer to URL and Title in docinfo file if crawled URL to docID conversion (checksum)

16 Major Structures (continued) Repository Zlib compressed docID, Length, URL Self-consistent data Lexicon Memory resident List of words and a hash-table of pointers Other auxiliary information… Repository Packet (stored compressed in repository) synclengthcompressed packet synclengthcompressed packet docidecodepagelenurllenurlpage

17 Major Structures (continued 2) Hit Lists Word in a document + typesetting information (hand- encoded) Take most of the space of all indices

18 Major Structures (continued 3) Forward Index Partially sorted Stored in a number of barrels Each barrel holds range of wordIDs + hitlist

19 Major Structures (continued 4) Inverted Index Same barrels, but processed by the sorter Not stored by ranking in occurrence for the sake of speed Two sets of inverted barrels LexiconInverted Barrels wordid ndocs docid: 27nhits: 5hit hit docid: 27nhits: 5hit hit hit docid: 27nhits: 5hit hit docid: 27nhits: 5hit

20 Searching 1.Parse the query 2.Convert words to wordIDs 3.Seek to start of doclist in the short barrel for every word 4.Scan through until a document that matches all terms is encountered 5.Compute the rank of that document 6.Repeat the same thing for the full barrel 7.Sort the documents matched by rank and return the first few

21 Results Quality of results Sorting –PageRank –Anchor text –Proximity Broken links Query: bill clinton % (no date) (0K) Office of the President 99.67% (Dec ) (2K) Welcome To The White House 99.98% (Nov ) (5K) Send Electronic Mail to the President 99.86% (Jul ) (5K) % 99.27% The "Unofficial" Bill Clinton 94.06% (Nov ) (14K) Bill Clinton Meets The Shrinks 86.27% (Jun ) (63K) President Bill Clinton - The Dark Side 97.27% (Nov ) (15K) $3 Bill Clinton 94.73% (no date) (4K)

22 Performance Storage –Scale with the size of the Web –Repository is comparatively small –Good/Fast compression/decompression System –Crawling, Indexing, Sorting –Last two simultaneously Searching –Bounded by disk I/O

23 Question 4: Scaling Much of the article is about scalability. (a) How many pages were they indexing when they wrote the article? How many today? How many queries does the system handle every day? (b) What is their strategy for scalability? Where do you think the limitations lie? (c) How do they manage to implement such a large-scale (and ever changing) with a small technical staff?

24 Question 6: Implementation (a) What is the function of the Google lexicon? How is it stored? (b) What is the function of the hit list? How is it stored? (c) What is the function of the forward index? How is it stored?

25 Conclusion Google: –Scalable search engine –Complete architecture Many research ideas arise –Always something to improve –Matter of time High quality search is the dominant factor