Cue Validity Variance (CVV) Database Selection Algorithm Enhancement Travis Emmitt 9 August 1999.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Inference for Regression
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Testing Differences Among Several Sample Means Multiple t Tests vs. Analysis of Variance.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Point estimation, interval estimation
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Chapter 7 Sampling and Sampling Distributions
Adaptive Rao-Blackwellized Particle Filter and It’s Evaluation for Tracking in Surveillance Xinyu Xu and Baoxin Li, Senior Member, IEEE.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Experimental Evaluation
The Calibration Process
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Overview of Search Engines
Today Evaluation Measures Accuracy Significance Testing
Overview of Meta-Analytic Data Analysis
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Biostatistics: Measures of Central Tendency and Variance in Medical Laboratory Settings Module 5 1.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
1 Computing Relevance, Similarity: The Vector Space Model.
Comp. Genomics Recitation 3 The statistics of database searching.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Natural Language Processing Topics in Information Retrieval August, 2002.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Chapter Eleven Performing the One-Sample t-Test and Testing Correlation.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Inferential Statistics. Population Curve Mean Mean Group of 30.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Stats Methods at IC Lecture 3: Regression.
Large-Scale Content-Based Audio Retrieval from Text Queries
Information Retrieval in Practice
Martin Rajman, Martin Vesely
Representation of documents and queries
Sorting … and Insertion Sort.
CS 430: Information Discovery
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Presentation transcript:

Cue Validity Variance (CVV) Database Selection Algorithm Enhancement Travis Emmitt 9 August 1999

Collection Selection Overview Coll 1 Coll 3 Coll 2 document relevant to Query A document relevant to Query B document relevant to Query C User X Query A User Y User Z Query B Query C Query A Clones Broker BROKER ACTIONS 1) Receives query from user 2) Decides which collections are likely to contain resources relevant to the query, ranks them 3) Forwards query to top-n ranked collection servers 4) Receives results from collection servers 5) Merges and presents results to user Step 2 is the Collection Selection (a/k/a Database Selection) Problem. This is our focus.

TREC-Based Test Environment 6 Sources: AP, FR, PATN, SJM, WSJ, ZIFF –Sources partitioned into a total of 236 collections (a/k/a sites, databases) –Different decomposition (partitioning) methods 250 Queries –Often contain many [possibly repeating] query terms 1000s of Relevance Judgements –Humans listed documents relevant to each query –Complete only for queries

SYM Decomposition SYM = Source Year Month Example collections: –AP contains all Associated Press documents from February 1988 –ZIFF contains all ZIFF documents from December collections total

SYM’s “Goto AP” Problem In SYM, larger collections (# documents) tend to contain more of the relevant documents Some collections (AP) so large w.r.t. others that any algorithm favoring large collections over small collections will perform well, independently of the query contents

UDC Decomposition UDC = Uniform Document Count Example collections: –AP.01 - contains the first 1/236 (approx) of the complete set of documents; restricted to AP documents only [so no overlap w/ FR] –ZIFF.45 - contains the last 1/236 (approx) of the complete set documents 236 collections total

Baselines RBR = Relevance Based Ranking –merit query,coll = number of documents in collection which were deemed relevant to query Others...

Estimates Estimate names/categories: –gGLOSS - ideal(k) –CORI - U.Mass, best performance –SMART - ntn_ntn, etc. (tweaked components) –CVV - Yuwono/Lee’s “traditional” version –CVV p,q,r,s - hybrid, elements from CVV, SMART –SBR, random, etc. merit query,coll = estimated merit (“goodness”)

Ranks For each query, collections are ranked according to [estimated] merit (highest 1st) A ranking represents the order in which collections should be searched for docs Often want to consider only top-n collections –n is the “collection cut-off” or simply “cut-off”

Rank Comparisons An estimate’s ranks are compared against a baseline’s ranks (e.g., RBR vs CORI) Different comparison metrics: –Mean Squared Error (MSE), Spearman’s rho –P(n) = Precision at cut-off n prop. of estimated top-n collections with real merit > 0 –R(n) = Recall at cut-off n prop. of top-n real merit in estimated top-n collections –R^(n) = “R hat”(n) = H(n) prop. of total real merit in estimated top-n collections

Overall Performance Metric In this study, we use R(avg) averaged over all queries This is a normalized “area under the curve”

What is CVV? CVV = Cue Validity Variance CVV algorithm –a/k/a “CVV ranking method” –an estimator, generates merits (which can be ranked and compared against a baseline) –consists of CVV component, DF component CVV component –derived from DF and N What are DF and N?

Terminology, cont. C = set of collections in the system –|C| = number of collections in the system (236) N coll = number of documents in collection DF term, coll = “Document Frequency” –number of documents in collection in which term occurs at least once –So, DF term,coll <= N coll

Cue Validity (CV) Density term,coll = DF term,coll / N coll External term,coll =  (DF term,c ) /  (N c ) c != coll c != coll CV term,coll = Density term,coll Density term,coll + External term,coll CV term =  (CV term,coll ) / |C| coll in C CV term,coll values are always between 0 and 1

Cue Validity Variance (CVV) CVV term =  (CV term,coll - CV term ) 2 / |C| coll in C merit query,coll = estimated “Goodness” =  (CVV term * DF term,coll ) term in query 2 components The basic CVV algorithm ignores the number of times a term occurs in the query (Query Term Weight)

Basic CVV “Problem” What if query = “cat cat dog”? –Basic CVV ignores the number of times “cat” appears in the query, so merits (and consequently ranks) will be the same as for query “cat dog” –Unlike in CVV’s development environment, many of our test queries have multiply-occurring terms –Performance of basic CVV tends to be noticeably poorer than that of algorithms which incorporate the query term frequency in their merit calculations –We can modify CVV to incorporate query term frequency

Enchancing CVV with QTW QTW query,term = query term weight (or freq) = # of times term occurs in query merit query,coll =  (CVV term * DF term,coll * QTW query,term ) term in query 3 components

Effects of QTW Enhancement For R(avg) averaged over all queries… –SYM performance increased from.8416 to.8486 –UDC performance increased from.6735 to.6846 –Meanwhile, CORI’s SYM =.8972, UDC =.7884 –So, performance increased, but not dramatically For most individual queries, performance improved or stayed the same, but for 20-30% of queries, performance decreased –Could QTW be overcorrecting?

Further Questions... What if we varied the degree to which QTW influenced the merit calculations? We could: –Add an exponent to the QTW component: merit term,coll =  (CVV term * DF term,coll * QTW term e ) –Evaluate using a range of exponents, searching for version that maximizes performance (e = 0, 0.5, 1, 2, …) What if we added exponents to the other two components as well: CVV and DF? What about adding an ICF component?

Inverse Collection Frequency (ICF) CF term = collection frequency of term = # of collections in which term occurs ICF term = log ((|C| + 1) / CF term ) ICF is used to decrease contribution of terms which appear in many collections and are therefore not good discriminators; terms which occur in few collections are best discriminators. ICF has shown useful in other algorithms.

Final Enhancement Equation merit query,coll =  (CVV term p * DF term,coll q * QTW query,term r * ICF term s ) term in query 4 components Notes: –DF is the only component dependent upon collection –CVV(1,1,0,0) = Basic CVV = CVV(0) or “cvv” –CVV(1,1,1,0) = QTW-Enhanced CVV = CVV(1) –CVV(0,1,1,2) = ntn_ntn –CVV(*,0,*,*) = alphabetical (same merits for all collections)

P E R F O R M A N C E EstimateSYM UDC Joint CORI Basic CVV Ideal(0) ntn_ntn Note: ntn_ntn =

Essentiality What’s the best performance you can get if you hold a component’s exponent at 0? CVV component appears to be the least essential; DF appears to be the most essential Omitted Best Performer and Performance Comp SYM UDC Joint. none (.8938) (.7776) (.8325) CVV (.8932) (.7737) (.8298) DF tied (.6081) tied (.6017)tied (.6049) QTW (.8851) (.7525) (.8053) ICF (.8755) (.7647) (.8188)

Other Reductions Active Best Performer and Performance Comps SYM UDC Joint. all (.8938) (.7776) (.8325) DF,CVV (.8533) (.7525) (.7945) DF,QTW (.8725) (.7306) (.8015) DF,ICF (.8831) (.7372) (.8053) DF only (.8533) (.6889) (.7711) Comments: –Fewer sample points, since we didn’t focus searches here

Open Questions How much would other algorithms’ performances improve if we tweaked them? –Would CORI get so much better that a CVV hybrid couldn’t even get close? –Or, is CORI already optimized? Is there value in automating these searches, using adaptive programming?

Basic CVV Example Given: –query1 = “cat dog” –CVV “cat” = 0.8 (“cat” is unevenly distributed) –CVV “dog” = 0.2 (“dog” is more evenly distributed) –DF “cat”,collA = 1 (“cat” appears in only one of collA’s docs) –DF “dog”,collA = 20 –DF “cat”,collB = 5 –DF “dog”,collB = 3 merit query1,collA = (0.8*1) + (0.2*20) = = 4.8 merit query1,collB = (0.8*5) + (0.2*3) = = 4.6 For query1, collA would be ranked higher (better) than collB