Recuperação de Informação B Cap. 13.4.4: Ranking 13.4.5: Crawling the Web 13.4.6: Indices December 06, 1999.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 19: Information Retrieval
1 Searching the Web Baeza-Yates Modern Information Retrieval, 1999 Chapter 13.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Chapter 6: Information Retrieval and Web Search
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Web- and Multimedia-based Information Systems Lecture 2.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Retrieval CSE 8337 Spring 2005 Web Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Why indexing? For efficient searching of a document
Search Engines and Search techniques
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Chapter 5: Information Retrieval and Web Search
Search Engines & Subject Directories
Web Search Engines.
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

Recuperação de Informação B Cap : Ranking : Crawling the Web : Indices December 06, 1999

Ranking n More used models: Boolean or Vector and their variations n Ranking has to be performed without accessing the text, just the index n About ranking algorithms, all information is “top secret”, it is almost impossible to measure recall, as the number of relevant pages can be quite large for simple queries

Ranking n Yuwono and Lee (1997) proposed three ranking algorithms (plus tf-idf scheme): u Boolean spread (Classical Boolean plus “simplified link analysis”) u Most-cited (based only on the terms included in pages having a link to the pages in the answer) u Vector Spread Activation(Vector space model and spread activation model) (relatively superior, 76%) u Vector Space Model (TF x IDF) (word distribution statistics) (relatively superior, 75%)

Ranking n Comparison of these techniques considering 56 queries over a collection of 2400 Web pages indicates that the vector model yields a better recall- precision curve, with an average precision of 75%. n Some of the new ranking algorithms also use hyperlink information· n Important difference between the Web and normal IR databases the number of hyperlinks that point to a page provides a measure of its popularity and quality. n Links in common between pages often indicates a relationship between those pages.

Ranking n Three examples of ranking techniques based in link analysis u WebQuery, which also allows visual browsing of Web pages. WebQuery takes a set of Web pages (for example, the answer to a query) and ranks them based on how connected each Web page is

Ranking n WebQuery

Ranking n Kleinberg ranking scheme depends on the query and considers the set of pages S that point to or are pointed by pages in the answer u Pages that have many links pointing to them in S are called authorities u Pages that have many outgoing links are called hubs n Better authority pages come from incoming edges from good hubs and better hub pages come from outgoing edges to good authorities

Ranking n

n PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a n This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed n Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p 1 to p n

Ranking

Crawling the web n Start with a set of URLs and from there extract other URLs which are followed recursively in a breadth-first or depth-first fashion. n Search engines allow users to submit top Web sites that will be added to the URL set u A variation is to start with a set of populars URLs, because we can expect that they have information frequently requested

Crawling the Web n Both cases work well for one crawler, but it is difficult to coordinate several crawlers to avoid visiting the same page more than once n Another technique is to partition the Web using country codes or Internet names, and assign one or more robots to each partition, and explore each partition exhaustively n Considering how the Web is traversed, the index of a search engine can be thought of as analogous to the stars in an sky. What we see has never existed, as the light has traveled different distances to reach our eye

Crawling the web n Similarly, Web pages referenced in an index were also explored at different dates and they may not exist any more n How fresh are the Web pages referenced in an index? The pages will be from one day to two months old. For that reason, most search engines show in the answer the date when the page was indexed n The percentage of invalid links stored in search engines vary from 2 to 9%

Crawling the Web n User submitted pages are usually crawled after a few days or weeks n Some engines traverse the whole Web site, while others select just a sample of pages or pages up to a certain depth. Non-submitted pages will wait from weeks up to a couple of months to be detected n There are some engines that learn the change frequency of a page and visit it accordingly n The current fastest crawlers are able to traverse up to 10 million Web pages per day

Crawling the Web n The order in which the URLs are traversed is important u Using a breadth first policy, we first look at all the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests u In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively u good ordering schemes can make a difference if crawling better pages first (PageRank)

Crawling the Web n Due to the fact that robots can overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed n Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages

Indices n Most indices use variants of the inverted file u An inverted file is a list of sorted words (vocabulary), each one having a set of pointers to the pages where it occurs u Some search engines use elimination of stopwords to reduce the size of the index. Normalization operations may include removal of punctuation and multiple spaces,etc u To give the user some idea about each document retrieved, the index is complemented with a short description of each Web page

Indices u Assuming that 500 bytes are required to store the URL and the description of each Web page, we need 50 Gb to store the description for 100 million pages u As the user initially receives only a subset of the complete answer to each query, the search engine usually keeps the whole answer set in memory, to avoid having to recompute it if the user asks for more documents

Indices n Indexing techniques can reduce the size of an inverted file to about 30% of the size of the text (less if stopwords are used). For 100 million pages, this implies about 15 Gb of disk space n A query is answered by doing a binary search on the sorted list of words of the inverted file n Searching multiple words, the results have to be combined to generate the final answer n Problem: frequency of the word

Indices n Inverted files can also point to the actual occurrences of a word within a document in space for the Web (too costly), because each pointer has to specify a page and a position inside the page (word numbers can be used instead of actual bytes) n Having the positions of the words in a page, we can answer phrase searches or proximity queries by finding words that are near each other in a page n Currently, some search engines are providing phrase searches, but the actual implementation is not known

Indices n Finding words which start with a given prefix requires two binary searches in the sorted list of words n More complex searches can be performed by doing a sequential scan over the vocabulary n The best sequential algorithms for this type of query can search around 20 Mb of text stored in RAM in one second (5 Mb is more or less the vocabulary size for 1 Gb of text) For the Web this is still too slow but not impossible. Using Heaps' law and assuming  (beta) = 0.7 for the Web, the vocabulary size for 1 Tb is 630 Mb which implies a searching time of half a minute

Indices Pointing to pages or to word positions is an indication of the granularity of the index n The index can be less dense if we point to logical blocks instead of pages u Reduce the variance of the different document sizes, by making all blocks roughly the same size F Reduces the size of the pointers (because there are fewer blocks than documents) F Reduces the number of pointers (because words have locality of reference)

Indices n This idea was used in Glimpse n Queries are resolved as for inverted files, obtaining a list of blocks that are then searched sequentially (exact sequential search can be done over 30 Mb per second in RAM) n Glimpse originally used only 256 blocks, efficient up to 200 Mb for searching words that were not too frequent, obtaining an index of only 200Mb of the text n Not used in the Web because sequential search cannot be afforded, as it implies a network access. However, in a distributed architecture where the index is also distributed, logical blocks make sense

Indices

Conclusion n Nowadays search engines uses, basically, Boolean or Vector models and their variations n Link Analysis Techniques seems to be the “next generation” of the search engines n “Bots” will be “Pokemons” n Indices: Compression and distributed architecture are keys

References n CHAKRABARTI, S.; DOM, B.; RAGHAVAN, P.; RAJAGOPALAN, S.; GIBSON, D.; KLEINBERG, J. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text n GIBSON, D.; KLEINBERG, J.; RAGHAVAN, P. Structural Analysis of the World Wide Web. Position paper at the WWW Consortium Web Characterization 1998 n GOOGLE, n Members of the Clever Project. Hypersearching the Web. Scientific American, March, 1999 n TODOBR, n WEBGLIMPSE, glimpse.cs.arizona.edu/webglimpse/