DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee
2 Abstraction A multi-database model of distributed information retrieval Full-text information retrieval consists of discovering database contents Ranking databases by their expected ability to satisfy the query Searching a small number of databases Merging results returned by different databases This paper Presents algorithms for each task
3 Introduction Multi-database model of distributed information retrieval Reflects the distributed location and control of information in a wide area computer network 1)Resource description -The contents of each text database must be described 2)Resource selection -Given an information need and a set of resource descriptions, a decision must be made about which database(s) to search 3)Results merging -Integrating the ranked lists returned by search by each data base into a single, coherent ranked list
4 Multi-database Testbeds Marcus, 1983 Addressed resource description and selection in the EXPERT CONIT system The creation of the TREC corpora The text collections created by the U.S. National Institute for Standards and Technology (NIST) for its TREC conferences Sufficiently large and varied Could divide into smaller databases The summary statistics for three distributed IR testbeds
5 Resource Description Unigram language model Gravano et al.,1994; Gravano and Gracia-Molina,1995; Callan et al., Represent each database by a description consisting of the words that occur in the database, and their frequencies of occurrence Compact and can be obtained automatically by examining the documents in a database or the document indexes Can be extended easily to include phrases, proper names, and other text features Resource description based on terms and frequencies A small fraction of the size of the original text database Resource Description gives the way to technique called Query Based Sampling
6 Resource Selection (1/4) Distributed Information Retrieval System Resource Selection Process of selecting databases relative to the query Collections are treated analogously to documents in a databae CORI database selection algorithm is used
7 Resource Selection (2/4) The CORI Algorithm (Callan et al., 1995) - df : the number of documents in Ri containing rk - cw : the number of indexing terms in resource Ri - avg_cw : the average number of indexing terms in each resource - C : the number of resource - cf : the number of resources containing term rk - B : the minimum belief component (usually 0.4)
8 Resource Selection (3/4) INQUERY query operator (Turtle, 1990; Turtle and Croft, 1991) Can be used for ranking databases and documents - p j :p(r j |R i )
9 Resource Selection (4/4) Effectiveness of a resource ranking algorithm Compares a given database ranking at rank n to a desired database ranking at rank n - rgi : number of relevant documents in the i’’th-ranked database under the given ranking - rdi : number of relevant documents in the i’’th-ranked database under a desired ranking in which documents are ordered by the number of relevant documents they contain
10 Merging Document Ranking (1/2) After a set of databases is searched The ranked results from each databases must be merged into a single ranking Difficult when individual databases are not cooperative -Each database are based on different corpus statistics, representations and/or retrieval algorithms Resource merging technique Cooperative approach -Use of global idf or same ranking algorithm -Recomputing document scores at the search client Non-cooperative approach -Estimate normalized document scores : combination of the score of the database and the score of the document
11 Merging Document Ranking (2/2) Estimates normalized document score - N : number of resources searched - D’’ : the product of the unnormalized document score D - R i : the database score R i - Avg_R : the average database score
12 Acquiring Resource Descriptions (1/2) Query-based sampling (Callan, et al., 1999; Callan & Connel, 2001) Does not require cooperation of the databases Process of querying database using random word queries Initial query is selected from large dictionary of terms Subsequent queries from documents sampled from database
13 Acquiring Resource Descriptions (2/2) Query-based sampling algorithm 1.Select initial query term 2.Run a one-term query on the database 3.Retrieve the top N documents returned by the database 4.Update the resource description based on characteristics of retrieved document -Extract words & frequencies from top N documents returned by the database -Add the word and their frequencies to the learned resource description 5.If a stopping criterion as not yet been reached, -Select a new query term -Go to Step 2
14 Accuracy of Unigram Language Models (1/3) Test corpora for query-based sampling experiments Ctf ratio How well the learned vocabulary matches the actual vocabulary - V’ : a learned vocabulary - V : a an actual vocabulary - ctf i :the number of times term I occurs in the database
15 Accuracy of Unigram Language Models (2/3) Spearman Rank Correlation Coefficient How well the learned term frequencies indicates the frequency of each term in database The rank correlation coefficient -1 : two orderings are identical -0 : they are uncorrelated --1 : they are in reverse order - d i : the rank difference of common term i - n : the number of terms - f k :the number of ties in the kth group if ties in the learned resource description - g m : the number of ties in the mth group of ties in the actual resource description
16 Accuracy of Unigram Language Models (3/3) Experiment
17 Accuracy of Resource Rankings Experiment
18 Accuracy of Document Rankings Experiment
19 Summary and Conclusions Techniques for acquiring descriptions of resources controlled by uncooperative parties Using resource description to rank text databases by their likelihood of satisfying a query Merging the document rankings returned by different text databases The major remaining weakness The algorithm for merging document rankings produces by different databases Computational cost by parsing and reranking the documents Many of the traditional IR tools, such as relevance feedback, have yet to be applied to multi-database environments