Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 27, CIKM Necessity is as important as idf (theory) Explains behavior of IR models (practice) Can be predicted Performance gain Main Points
Definition of Necessity P(t | R q ) Directly calculated given relevance judgements for q Docs that contain t Relevant (q) 2 P(t | R q ) = 0.4 Collection Necessity == 1 – mismatch == term recall
Why Necessity? Roots in Probabilistic Models Binary Independence Model –[Robertson and Spärck Jones 1976] –“Relevance Weight”, “Term Relevance” P(t | R) is effectively the only part about relevance. 3 Necessity odds idf (sufficiency) Necessity is as important as idf (theory) Explains behavior of IR models (practice) Can be predicted Performance gain Main Points
Without Necessity The emphasis problem for idf-only term weighting –Emphasize high idf terms in query “prognosis/viability of a political third party in U.S.” (Topic 206) 4
Ground Truth partypoliticalthirdviabilityprognosis True P(t | R) idf Emphasis TREC 4 topic 206
Indri Top Results 1. (ZF ) Recession concerns lead to a discouraging prognosis for (AP ) Politics … party … Robertson's viability as a candidate 3. (WSJ ) political parties … 4. (AP ) there is no viable opposition … 5. (WSJ ) A third of the votes 6. (WSJ ) politics, party, two thirds 7. (AP ) third ranking political movement… 8. (AP ) political parties 9. (AP ) prognosis for the Sunday school 10. (ZF ) third party provider (Google, Bing still have top 10 false positives. Emphasis also a problem for large search engines!) 6
Without Necessity The emphasis problem for idf-only term weighting –Emphasize high idf terms in query “prognosis/viability of a political third party in U.S.” (Topic 206) –False positives throughout rank list especially detrimental at top rank –No term recall hurts precision at all recall levels –(This is true for BIM, and also BM25, LM that use tf.) How significant is the emphasis problem? 7
Failure Analysis of 44 Topics from TREC RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Necessity term weighting Necessity guided expansion Basis: Term Necessity Prediction Necessity is as important as idf (theory) Explains behavior of IR models (practice) & Bigrams, &Term restriction using doc fields Can be predicted Performance gain Main Points
Given True Necessity +100% over BIM (in precision at all recall levels) [Robertson and Spärk Jones 1976] % over Language Model, BM25 (in MAP) This work For a new query w/o relevance judgements, need to predict necessity. –Predictions don’t need to be very accurate to show performance gain. 9
(Examples from TREC 3 topics) Term in Query Oil Spills Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments P(t | R) How Necessary are Words? 10
Mismatch Statistics Mismatch variation across terms (TREC 3 title) (TREC 9 desc) –Not constant, need prediction 11
Mismatch Statistics (2) Mismatch variation for the same term in different queries TREC 3 recurring words –Query dependent features needed (1/3 term occurrences have necessity variation>0.1) 12
Prior Prediction Approaches Croft/Harper combination match (1979) –treats P(t | R) as a tuned constant –when >0.5, rewards docs that match more query terms Greiff’s (1998) exploratory data analysis –Used idf to predict overall term weighting –Improved over BIM Metzler’s (2008) generalized idf –Used idf to predict P(t | R) –Improved over BIM Years of simple idf feature, limited success –Missing piece: P(t | R) = term necessity = term recall 13
Factors that Affect Necessity What causes a query term to not appear in relevant documents? Topic Centrality (Concept Necessity) –E.g., Laser research related or potentially related to US defense, Welfare laws propounded as reforms Synonyms –E.g., movie == film == … Abstractness –E.g., Ailments in the vitamin query, Dog Maulings, Christian Fundamentalism –Worst thing is a rare & abstract term, e.g. prognosis 14
Features We need to –Identify synonyms/searchonyms of a query term –in a query dependent way Use Thesauri? –Biased (not collection dependent) –Static (not query dependent) –Not promising, Not easy Term-term similarity in concept space! –Local LSI (Latent Semantic Indexing) LSI of (e.g. 200) top ranked documents keep (e.g. 150) dimensions 15
Features Topic Centrality –Length of term vector after dimension reduction (local LSI) Synonymy (Concept Necessity) –Average similarity scores of top 5 similar terms Replaceability –Adjust the Synonymy measure by how many new documents the synonyms match Abstractness –Users modify abstract terms with concrete terms 16 effects on the US educational programprognosis of a political third party
Experiments Necessity Prediction Error –Regression problem Model: RBF kernel regression, M: P(t | R) Necessity for Term Weighting –End-to-End retrieval performance –How to weight terms by their necessity In BM25 –Binary Independence Model In Language Models –Relevance model P m (t | R) – multinomial (Lavrenko and Croft 2001) 17
Necessity Prediction Example 18 partypoliticalthirdviabilityprognosis True P(t | R) Predicted Emphasis Trained on TREC 3, tested on TREC 4
Necessity Prediction Error 19 L1 Loss: The lower The better Necessity is as important as idf Explains behavior of IR models Can be predicted Performance gain Main Points
Predicted Necessity Weighting 20 TREC train sets Test/x-validation4688 LM desc – Baseline LM desc – Necessity Improvement26.38%23.52%20.33%21.32% Baseline Necessity Baseline Necessity % gain (necessity weight) 10-20% gain (top Precision)
TREC train sets Test/x-validation LM desc – Baseline LM desc – Necessity Improvement11.43%11.25%149.8%24.82% Baseline Necessity Baseline Necessity Predicted Necessity Weighting (ctd.) 21 Necessity is as important as idf Explains behavior of IR models Can be predicted Performance gain Main Points
vs. Relevance Model Test/x-validation Relevance Model desc RM reweight-Only desc RM reweight-Trained desc Weight Only ≈ Expansion Supervised > Unsupervised (5-10%) Relevance Model: #weight( 1-λ #combine( t 1 t 2 ) λ #weight( w 1 t 1 w 2 t 2 w 3 t 3 … ) x ~ y w 1 ~ P(t 1 |R) w 2 ~ P(t 2 |R) x y
23 Necessity is as important as idf (theory) Explains behavior of IR models (practice) Effective features can predict necessity Performance gain Take Home Messages
Acknowledgements Reviewers from multiple venues Ni Lao, Frank Lin, Yiming Yang, Stephen Robertson, Bruce Croft, Matthew Lease –Discussions & references David Fisher, Mark Hoy –Maintaining the Lemur toolkit Andrea Bastoni and Lorenzo Clemente –Maintaining LSI code for Lemur toolkit SVM-light, Stanford parser TREC –All the data NSF Grant IIS and IIS Feedback: Le Zhao