Data Mining Chapter 6 Search Engines Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Search Engines There are more than twenty billion documents on the Web. Google itself claims to index more than 16 billion pages in November 2008. In a library, every book is individually manually indexed. A more automated system is needed for the Web given the volume of information. Two approaches: search engines (e.g. google) and hierarchical directories (e.g. yahoo!). 25/12/2018 ©GKGupta
IR vs Web search Bulk Dynamic Web - about one-third changes each year Heterogeneity - text, pictures, audio, etc Duplication - as much as 30% High linkage Wide variety of users User behaviour - 85% only look at the first screen, 78% never modify their first query 25/12/2018 ©GKGupta
IR vs Web search 0 terms in query 21% 1 terms in query 26% More than 3 terms 12% 25/12/2018 ©GKGupta
The goals of Web search Speed Recall Precision - relevance Precision in top 10 result pages 25/12/2018 ©GKGupta
Search engine architecture Three major components: the crawler - collects pages from the Web the indexer - indexes collected pages the query server - accepts and processes the query and returns results 25/12/2018 ©GKGupta
The crawler An application that automatically traverses the Web by retrieving a page and recursively retrieving pages that are referenced. Some search engines use several distributed crawlers. 25/12/2018 ©GKGupta
The crawler Base - set of known working hyperlinks Queue - put base in queue Retrieve - retrieve next page in queue, process and store in the database Add to the queue - add the new links from the page to the queue Continue the process until finished If a page is never linked to any other page, the search engine can never find it. 25/12/2018 ©GKGupta
The crawler The more pages a crawler retrieves, the more pages are discovered It is a large task If one was finding one million pages a day, it will need 700 pages per minute to find 25/12/2018 ©GKGupta
Indexing Web pages Need an index to efficiently answer queries. Indexing should also assist ranking. A query for data mining returned 2.2 million pages! A good ranking algorithm is needed to deal with this abundance. Many algorithms for indexing are based on inverted index technique or on superimposed coding. Google uses inverted file index. 25/12/2018 ©GKGupta
Indexing Web pages Search engines either use keyword search or a concept search. Building an index requires document analysis and term extraction. In automatic indexing of Web documents, many parts of documents are difficult to use for indexing. Some search engines extract terms only from the title, some others use the full documents. The information is usually resides on search engine databases and can be somewhat stale. 25/12/2018 ©GKGupta
Indexing Once a crawler finds a page, it is indexed using techniques that are used by the search engine. Often this requires information about the text and links from the page as well as to the page. 25/12/2018 ©GKGupta
Manual indexing Some search engines do manual indexing including Yahoo!, google, etc. A group of individuals maintain a list of documents that are categorised by hand. In some cases users are allowed to submit documents by category. Manual indexing obviously is labour intensive and is becoming obsolete. 25/12/2018 ©GKGupta
Search Engines Concept-based search tries to determine what you mean, not just what you say. So the search is more “about” the subject that you are searching for. Concept-based search is based on clustering; words are examined in relation to other words nearby. Excite used the concept approach. It determined meaning by calculating the frequency with which certain words appeared together. 25/12/2018 ©GKGupta
Rankings Many search engines provide rankings of the results. Some provide facilities for searching similar documents. 25/12/2018 ©GKGupta
Rankings Google uses a ranking algorithm based on page popularity by counting how many pages link to each page, along with other factors like proximity of your keywords to those in the documents. Let page A be pointed to by T1, T2, T3, etc. Let C(A) be the number of links going out from A. Page rank of A is given by (d is a damping factor): PR(A) = (1-d) + d(PR(T1)/C(T1) +… + PR(Tn)/C(Tn)) 25/12/2018 ©GKGupta
Rankings Kleinberg’s HITS algorithm is also being used as a ranking algorithm. Rankings and search for similar documents are becoming more effective. ACM Digital Library rankings appear to be very good but similar documents search does not appear to work as well as one would expect. 25/12/2018 ©GKGupta