A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --

Slides:



Advertisements
Similar presentations
Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Cue Validity Variance (CVV) Database Selection Algorithm Enhancement Travis Emmitt 9 August 1999.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Evaluating Search Engine
Information Retrieval in Practice
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
CS/Info 430: Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluating the Performance of IR Sytems
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
IR Models: Review Vector Model and Probabilistic.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Evaluation of Information Retrieval Systems Xiangming Mu.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Alexandria Digital Library ADL Metadata Architecture Greg Janée.
Information Retrieval in Practice
Search Engine Architecture
Evaluation of IR Systems
IR Theory: Evaluation Methods
Panagiotis G. Ipeirotis Luis Gravano
INF 141: Information Retrieval
Presentation transcript:

A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley -- TREC DiskSourceSize MBSize doc 1WSJ (86-89)27098,732 1AP (89)25984,678 1ZIFF24575,180 1FR (89)26225,960 2WSJ (90-92) AP (88)24179,919 2ZIFF17856,920 2FR (88)21119,860 3AP (90)24278,321 3SJMN (91)29090,257 3PAT2456,711 Totals2,690691,058 TREC DiskSourceNum DBTotal DB 1WSJ (86-89)29Disk 1 1AP (89)1267 1ZIFF14 1FR (89)12 2WSJ (90-92)22Disk 2 2AP (88)1154 2ZIFF11 (1 dup) 2FR (88)10 3AP (90)12Disk 3 3SJMN (91) PAT92 Totals We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time. We calculate the probability of relevance using Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval time the probability of relevance for a particular query Q and a collection C is estimated by: Probabilistic Retrieval Using Logistic Regression For the 6 X attribute measures shown below: Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection Length Average Inverse Collection Frequency Inverse Collection Frequency (N = Number of collections M = Number of Terms in common between query and document The probabilities are actually calculated as the log odds, and converted The c i coefficients were estimated separately for three query types (during retrieval the length of the query was used to differentiate these. Test Database Characteristics The Problem We used collections formed by dividing the documents on TIPSTER disks 1, 2, and 3 into 236 sets based on source and month (using the same contents as in evaluations by Powell & French and Callan). The query set used was TREC queries Collection relevance information was based on whether any documents in the collection were relevant according to the relevance judgements for TREC queries The relevance information was used both for estimating the logistic regression coefficients (using a sample of the data) and for the evaluation (with full data). Hundreds or Thousands of servers with databases ranging widely in content, topic, and format –Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results –How to select the “best” ones to search? What to search first Which to search next –Topical /domain constraints on the search selections –Variable contents of database (metadata only, full text…) Resource Description –How to collect metadata about digital libraries and their collections or databases Resource Selection –How to select relevant digital library collections or databases from a large number of databases Distributed Search –How to perform parallel or sequential searching over the selected digital library databases Data Fusion –How to merge query results from different digital libraries with their different search engines, differing record structures, etc. Distributed IR Tasks This research was sponsored at U.C. Berkeley and the University of Liverpool by the National Science Foundation and the Joint Information Systems Committee (UK) under the International Digital Libraries Program award #IIS James French and Allison Powell kindly provided the CORI and Ideal(0) results used in the evaluation. Acknowledgements MetaSearch Server Map Explain And Scan Queries Internet Map Results Map Query Map Results Search Engine DB2DB 1 Map Query Map Results Search Engine DB 4DB 3 Distributed Index Search Engine Db 6 Db 5 Our Approach Using Z39.50 Replicated servers Meta-Topical Servers General Servers Database Servers –Tested using the collection representatives as harvested from over the network and the TIPSTER relevance judgements –Testing by comparing our approach to known algorithms for ranking collections –Results (preliminary) were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) –Recall analog (How many of the Rel docs occurred in the top n databases – averaged) Very Long Queries ^R^R Number of Collections ^R^R Title Queries Long Queries Number of Collections ^R^R Distributed Retrieval Testing and Results CORI Ranking Assume each database has some merit for a given query, q Given a Baseline ranking B and an estimated (test) ranking E for the evaluated system Let db bi and db ei denote the database in the i-th ranked position of rankings B and E Let B i = merit(q, db bi ) and E i = merit(q, db ei ) We can define some measures: Recall AnalogsAnd a Precision Analog Effectiveness Measures Comparative Evaluation MetaSearch New approach to building metasearch based on Z39.50 Instead of using broadcast search we are using two Z39.50 Services -- Identification of database metadata using Z39.50 Explain -- Extraction of distributed indexes using Z39.50 SCAN -- Creation of “Collection Documents” using index contents Evaluation Questions: -- How efficiently can we build distributed indexes? -- How effectively can we choose databases using the index? -- How effective is merging search results from multiple sources? -- Do Hierarchies of servers (general/meta-topical/individual) work? For all servers (could be a topical subset)… –Get Explain information to find which indexes are supported and the collection statistics. –For each index Use SCAN to extract terms and frequency information Add term + freq + source index + database metadata to the metasearch XML “Collection Document” –Index collection documents including for retrieval by the above algorithm Planned Exensions Post-Process indexes (especially Geo Names, etc) for special types of data –e.g. create “geographical coverage” indexes Application and Further Research Data Harvesting and Collection Document Creation The figures to the right summarize our results from the preliminary evaluation. The X axis is the number of collections in the ranking and the Y axis,, is a Recall analog that measures the proportion of the total possible relevant documents that have been accumulated in the top N databases, averaged across all of the queries. The Max line is the optimal results based where the collections are ranked in order of the number of relevant documents they contain. Ideal(0) is an implementation of the GlOSS ``Ideal''algorithm and CORI is an implementation of Callan's Inference net approach. The Prob line is the logistic regression method (described to the left). For title queries the described method performs slightly better than the CORI algorithm for up to about 100 collections, where CORI exceeds it. For Long queries our method is virtually identical to CORI, and CORI performs better for Very Long queries. Both CORI and the logistic regression method outperform the Ideal(0) implementation. Results and Discussion ^R^R The method described here is being applied to two distributed systems of servers in the UK. The first (the Distributed Archives Hub will be made up of individual servers containing archival descriptions in the EAD (Encoded Archival Description) DTD. MerseyLibraries.org is a consortium of University and Public libraries in the Merseyside area. In both cases the method described here is being used to build a central index to provide efficient distributed search over the various servers. The basic model is shown below – individual database servers will be harvested to create (potentially) a hierarchy of servers used to intelligently route queries to the databases most like to contain relevant materials. We are also continuing to refine the both the harvesting method and the collection ranking algorithm. We believe that additional collection and collection document statistics may provide a better ranking of results and thus more effective routing of queries.