1 CS 430: Information Discovery Lecture 20 Web Search Engines.

1 CS 430: Information Discovery Lecture 20 Web Search Engines

2 Course Administration

3 Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges Volume of material -- several billion items, growing steadily Items created dynamically or in databases Great variety -- length, formats, quality control, purpose, etc. Inexperience of users -- range of needs Economic models to pay for the service

4 Strategies Subject hierarchies Yahoo! -- use of human indexing Web crawling + automatic indexing General -- Google, AltaVista, AllTheWeb, NorthernLight,... Subject based -- Psychcrawler, PoliticalInformation.Com, Inomics.Com,... Mixed models Human directed web crawling and automatic indexing -- BBC News

5 Components of Web Search Service Components Web crawler Indexing system Search system Considerations Economics Scalability

6 Economic Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek, now Go.Com) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Licensing Cost of company are covered by fees, licensing of software and specialized services

7 Cost Example (Google) 85 people 50% technical, 14 Ph.D. in Computer Science Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month

8 Scalability 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 199419972000 The growth of the web

9 Web search services are centralized systems Over the past 3-5 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. Will this continue? Possible areas for concern are telecommunications costs, disk access rates. Scalability

10 Limitations of Web Crawling Time delay. Typically a monthly cycle. Crawlers are ineffective with sites that change rapidly, e.g., news. Pages not linked to. Crawlers find only those pages that are linked by paths from their seeds. Depth of crawl. Crawlers do not index every page on a site (algorithms to avoid crawler traps). but... Creators of information are increasingly organizing them to be accessible to the web search services (e.g., Springer- Verlag)

11 Indexing Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Usability requires: Ranking hits in order that fits user's requirements Effective presentation helpful summary records removal of duplicates grouping results from a single site Completeness of index is just one factor.

12 Indexing the Web Parsing problems –Errors in HTML –Non-ASCII characters Indexing documents Shared files, e.g., word list Sorting

13 Ranking Options Special factors Conventional methods (e.g., tf.idf) were developed for homogenous collections, e.g., items of similar length Some items are deliberately constructed to distort indexing Options Vector space ranking with corrections for document length Extra weighting for specific fields, e.g., title, anchors, etc. Link structure, e.g., Google's PageRank, Kleinberg's Hubs and Authorities

14 Case Study: Google Perl with C/C++ Linux Module-based architecture Multi-machine Multi-thread

15 Major Structures BigFiles Span several file systems 64-bit addressed Descriptor management Compression Document index ISAM (Index sequential access mode), ordered by docID Pointer to Repository, Status, Statistics Pointer to URL and Title in docinfo file if crawled URL to docID conversion (checksum)

16 Major Structures (continued) Repository Zlib compressed docID, Length, URL Self-consistent data Lexicon Memory resident List of words and a hash-table of pointers Other auxiliary information… Repository Packet (stored compressed in repository) synclengthcompressed packet synclengthcompressed packet docidecodepagelenurllenurlpage

17 Major Structures (continued 2) Hit Lists Word in a document + typesetting information (hand- encoded) Take most of the space of all indices

18 Major Structures (continued 3) Forward Index Partially sorted Stored in a number of barrels Each barrel holds range of wordIDs + hitlist

19 Major Structures (continued 4) Inverted Index Same barrels, but processed by the sorter Not stored by ranking in occurrence for the sake of speed Two sets of inverted barrels LexiconInverted Barrels wordid ndocs docid: 27nhits: 5hit hit docid: 27nhits: 5hit hit hit docid: 27nhits: 5hit hit docid: 27nhits: 5hit

20 Searching 1.Parse the query 2.Convert words to wordIDs 3.Seek to start of doclist in the short barrel for every word 4.Scan through until a document that matches all terms is encountered 5.Compute the rank of that document 6.Repeat the same thing for the full barrel 7.Sort the documents matched by rank and return the first few

21 Results Quality of results Sorting –PageRank –Anchor text –Proximity Broken links Query: bill clinton http://www.whitehouse.gov/ 100.00% (no date) (0K) http://www.whitehouse.gov/ Office of the President 99.67% (Dec 23 1996) (2K) http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html Welcome To The White House 99.98% (Nov 09 1997) (5K) http://www.whitehouse.gov/WH/Welcome.html Send Electronic Mail to the President 99.86% (Jul 14 1997) (5K) http://www.whitehouse.gov/WH/Mail/html/Mail_President.html mailto:president@whitehouse.gov 99.98% mailto:President@whitehouse.gov 99.27% The "Unofficial" Bill Clinton 94.06% (Nov 11 1997) (14K) http://zpub.com/un/un-bc.html Bill Clinton Meets The Shrinks 86.27% (Jun 29 1997) (63K) http://zpub.com/un/un-bc9.html President Bill Clinton - The Dark Side 97.27% (Nov 10 1997) (15K) http://www.realchange.org/clinton.htm $3 Bill Clinton 94.73% (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html

22 Performance Storage –Scale with the size of the Web –Repository is comparatively small –Good/Fast compression/decompression System –Crawling, Indexing, Sorting –Last two simultaneously Searching –Bounded by disk I/O

23 Question 4: Scaling Much of the article is about scalability. (a) How many pages were they indexing when they wrote the article? How many today? How many queries does the system handle every day? (b) What is their strategy for scalability? Where do you think the limitations lie? (c) How do they manage to implement such a large-scale (and ever changing) with a small technical staff?

24 Question 6: Implementation (a) What is the function of the Google lexicon? How is it stored? (b) What is the function of the hit list? How is it stored? (c) What is the function of the forward index? How is it stored?

25 Conclusion Google: –Scalable search engine –Complete architecture Many research ideas arise –Always something to improve –Matter of time High quality search is the dominant factor

1 CS 430: Information Discovery Lecture 20 Web Search Engines.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 20 Web Search Engines."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 430: Information Discovery Lecture 20 Web Search Engines.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 20 Web Search Engines."— Presentation transcript:

Similar presentations

About project

Feedback