Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct.

Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 15, 2010 1 Necessity is as important as idf (theory) Explains behavior of IR models (practice) Can be predicted Performance gain Main Points

Definition of Necessity P(t | R q ) Directly calculated given relevance judgements for q Docs that contain t Relevant (q) 2 P(t | R q ) = 0.4 Collection Necessity == 1 – mismatch == term recall

Why Necessity? Roots in Probabilistic Models Binary Independence Model –[Robertson and Spärck Jones 1976] –“Relevance Weight”, “Term Relevance” P(t | R) is effectively the only part about relevance. 3 Necessity odds idf (sufficiency) Necessity is as important as idf (theory) Explains behavior of IR models (practice) Can be predicted Performance gain Main Points

Without Necessity The emphasis problem for idf-only term weighting –Emphasize high idf terms in query “prognosis/viability of a political third party in U.S.” (Topic 206) 4

Ground Truth partypoliticalthirdviabilityprognosis True P(t | R)0.97960.71430.59180.04080.0204 idf 2.402 2.513 2.187 5.017 7.471 5 Emphasis TREC 4 topic 206

Indri Top Results 1. (ZF32-220-147) Recession concerns lead to a discouraging prognosis for 1991 2. (AP880317-0017) Politics … party … Robertson's viability as a candidate 3. (WSJ910703-0174) political parties … 4. (AP880512-0050) there is no viable opposition … 5. (WSJ910815-0072) A third of the votes 6. (WSJ900710-0129) politics, party, two thirds 7. (AP880729-0250) third ranking political movement… 8. (AP881111-0059) political parties 9. (AP880224-0265) prognosis for the Sunday school 10. (ZF32-051-072) third party provider (Google, Bing still have top 10 false positives. Emphasis also a problem for large search engines!) 6

Without Necessity The emphasis problem for idf-only term weighting –Emphasize high idf terms in query “prognosis/viability of a political third party in U.S.” (Topic 206) –False positives throughout rank list especially detrimental at top rank –No term recall hurts precision at all recall levels –(This is true for BIM, and also BM25, LM that use tf.) How significant is the emphasis problem? 7

Failure Analysis of 44 Topics from TREC 6-8 8 RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Necessity term weighting Necessity guided expansion Basis: Term Necessity Prediction Necessity is as important as idf (theory) Explains behavior of IR models (practice) & Bigrams, &Term restriction using doc fields Can be predicted Performance gain Main Points

Given True Necessity +100% over BIM (in precision at all recall levels) [Robertson and Spärk Jones 1976] +30-80% over Language Model, BM25 (in MAP) This work For a new query w/o relevance judgements, need to predict necessity. –Predictions don’t need to be very accurate to show performance gain. 9

(Examples from TREC 3 topics) Term in Query Oil Spills Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments P(t | R) 0.9914 0.9831 0.6885 0.2821 0.1071 How Necessary are Words? 10

Mismatch Statistics Mismatch variation across terms (TREC 3 title) (TREC 9 desc) –Not constant, need prediction 11

Mismatch Statistics (2) Mismatch variation for the same term in different queries TREC 3 recurring words –Query dependent features needed (1/3 term occurrences have necessity variation>0.1) 12

Prior Prediction Approaches Croft/Harper combination match (1979) –treats P(t | R) as a tuned constant –when >0.5, rewards docs that match more query terms Greiff’s (1998) exploratory data analysis –Used idf to predict overall term weighting –Improved over BIM Metzler’s (2008) generalized idf –Used idf to predict P(t | R) –Improved over BIM Years of simple idf feature, limited success –Missing piece: P(t | R) = term necessity = term recall 13

Factors that Affect Necessity What causes a query term to not appear in relevant documents? Topic Centrality (Concept Necessity) –E.g., Laser research related or potentially related to US defense, Welfare laws propounded as reforms Synonyms –E.g., movie == film == … Abstractness –E.g., Ailments in the vitamin query, Dog Maulings, Christian Fundamentalism –Worst thing is a rare & abstract term, e.g. prognosis 14

Features We need to –Identify synonyms/searchonyms of a query term –in a query dependent way Use Thesauri? –Biased (not collection dependent) –Static (not query dependent) –Not promising, Not easy Term-term similarity in concept space! –Local LSI (Latent Semantic Indexing) LSI of (e.g. 200) top ranked documents keep (e.g. 150) dimensions 15

Features Topic Centrality –Length of term vector after dimension reduction (local LSI) Synonymy (Concept Necessity) –Average similarity scores of top 5 similar terms Replaceability –Adjust the Synonymy measure by how many new documents the synonyms match Abstractness –Users modify abstract terms with concrete terms 16 effects on the US educational programprognosis of a political third party

Experiments Necessity Prediction Error –Regression problem Model: RBF kernel regression, M:  P(t | R) Necessity for Term Weighting –End-to-End retrieval performance –How to weight terms by their necessity In BM25 –Binary Independence Model In Language Models –Relevance model P m (t | R) – multinomial (Lavrenko and Croft 2001) 17

Necessity Prediction Example 18 partypoliticalthirdviabilityprognosis True P(t | R)0.97960.71430.59180.04080.0204 Predicted0.75850.65230.62360.30800.2869 Emphasis Trained on TREC 3, tested on TREC 4

Necessity Prediction Error 19 L1 Loss: The lower The better Necessity is as important as idf Explains behavior of IR models Can be predicted Performance gain Main Points

Predicted Necessity Weighting 20 TREC train sets33-53-77 Test/x-validation4688 LM desc – Baseline0.17890.15860.1923 LM desc – Necessity0.22610.19590.23140.2333 Improvement26.38%23.52%20.33%21.32% P@10 Baseline0.41600.29800.3860 Necessity0.49400.34200.42200.4380 P@20 Baseline0.34500.24400.3310 Necessity0.41800.29000.35400.3610 10-25% gain (necessity weight) 10-20% gain (top Precision)

TREC train sets3-991113 Test/x-validation10 1214 LM desc – Baseline0.1627 0.02390.1789 LM desc – Necessity0.18130.18100.05970.2233 Improvement11.43%11.25%149.8%24.82% P@10 Baseline0.3180 0.02000.4720 Necessity0.32800.34000.04670.5360 P@20 Baseline0.2400 0.02110.4460 Necessity0.27900.28100.04110.5030 Predicted Necessity Weighting (ctd.) 21 Necessity is as important as idf Explains behavior of IR models Can be predicted Performance gain Main Points

vs. Relevance Model Test/x-validation468810 1214 Relevance Model desc0.24230.17990.2352 0.1888 0.02210.1774 RM reweight-Only desc0.22150.17050.2435 0.1700 0.06920.1945 RM reweight-Trained desc0.23300.19210.25420.25630.18090.17930.05340.2258 22 Weight Only ≈ Expansion Supervised > Unsupervised (5-10%) Relevance Model: #weight( 1-λ #combine( t 1 t 2 ) λ #weight( w 1 t 1 w 2 t 2 w 3 t 3 … ) x ~ y w 1 ~ P(t 1 |R) w 2 ~ P(t 2 |R) x y

23 Necessity is as important as idf (theory) Explains behavior of IR models (practice) Effective features can predict necessity Performance gain Take Home Messages

Acknowledgements Reviewers from multiple venues Ni Lao, Frank Lin, Yiming Yang, Stephen Robertson, Bruce Croft, Matthew Lease –Discussions & references David Fisher, Mark Hoy –Maintaining the Lemur toolkit Andrea Bastoni and Lorenzo Clemente –Maintaining LSI code for Lemur toolkit SVM-light, Stanford parser TREC –All the data NSF Grant IIS-0707801 and IIS-0534345 24 Feedback: Le Zhao (lezhao@cs.cmu.edu)

25 Not Concept Necessity, Not Necessity to good performance

Related Work P(t | R) or term weighting prediction –Berkeley regression, Cooper et al (1993) –Regression rank, Lease et al (2009) –Exploratory data analysis, Greiff (1998) –Generalized IDF, Metzler (2008) Key concepts and long query reduction –Bendersky and Croft (2008) “must be … in a retrieved document in order for it to be relevant.” –Kumaran and Carvalho (2009) 26

Future Research Directions To improve necessity prediction –Click-through data & query rewrites Close to relevance judgements Associate clicks w/ result snippets –Better understanding of necessity & better features For applying necessity in retrieval –Ad hoc retrieval For phrases and more complex structured terms, -- recall Structured query formulation –( (A1 OR A2 OR A3) AND (B1 OR B2 OR B3) ) –Google’s automatic synonym operator: ~ –Where to expand, which expansion terms to include –Relevance feedback 27

Knowledge How Necessity explains behavior of IR techniques Why weight query bigrams 0.1, while query unigrams 0.9? –Bigram decreases term recall, weight reflects recall Why Bigram not gaining stable improvements? –Term recall is more of a problem Why using document structure (field, semantic annotation) not improving performance? –Improves precision, need to solve structural mismatch Word sense disambiguation –Enhances precision, instead, should use in mismatch modeling! Identify query term sense, for searchonym id, or learning across queries Disambiguate collection term sense for more accurate replaceability Personalization –biases results to what a community/person likes to read (precision) –may work well in a mobile setting, short queries 28

Why Necessity? System Failure Analysis Reliable Information Access (RIA) workshop (2003) –Failure analysis for 7 top research IR systems 11 groups of researchers (both academia & industry) 28 people directly involved in the analysis (senior & junior) >56 human*weeks (analysis + running experiments) 45 topics selected from 150 TREC 6-8 (difficult topics) –Causes (necessity in various disguise) Emphasize 1 aspect, missing another aspect(14+2 topics) Emphasize 1 aspect, missing another term(7 topics) Missing either 1 of 2 aspects, need both(5 topics) Missing difficult aspect that need human help(7 topics) Need to expand a general term e.g. “Europe”(4 topics) Precision problem, e.g. “euro”, not “euro-…”(4 topics)Precision problem, e.g. “euro”, not “euro-…”(4 topics) 29

Recurring Words How much is necessity term dependent? –Term necessity in one query to predict same term’s necessity in another query Easy for industry scale relevance judgements or query logs 32

Local LSI Top Similar Terms Oil spillsInsurance coverage which pays for long term care Term limitations for US Congress members Vitamin the cure of or cause for human ailments oilterm ail spill0.5828term0.3310term0.3339ail0.4415 oil0.4210long0.2173limit0.1696health0.0825 tank0.0986nurse0.2114ballot0.1115disease0.0720 crude0.0972care0.1694elect0.1042basler0.0718 water0.0830home0.1268care0.0997dr0.0695 33

Predicting Necessity Problem definition –Training samples: features of q i from query q ~ –Prediction using model M : –Training objective (minimize prediction loss): 34

Necessity Term Weighting Baseline vs. True necessity weighting –30-80% gain in MAP Baseline vs. Predicted necessity weighting –10-25% gain in MAP Relevance Model vs. Reweight-Only & Necessity prediction on RM weights –Weighting matters more than expansion (long queries) Ablation study –All features help 36

(Examples from TREC 3 topics) Oil Spills Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments How Necessary are Words? 0.9914 0.98310.68850.2821 0.1071 37 Query Term Oil Spills Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments P(t | R)0.99140.98310.68850.28210.1071 idf5.2012.010 1.6476.405

True Necessity Weighting TREC4689101214 Document collectiondisk 2,3disk 4,5d4,5 w/o crWT10g.GOV.GOV2 Topic numbers201-250301-350401-450451-500501-550TD1-50751-800 LM desc – Baseline0.17890.15860.19230.21450.16270.02390.1789 LM desc – Necessity0.27030.28080.30570.27700.22160.08680.2674 Improvement51.09%77.05%58.97%29.14%36.20%261.7%49.47% p - randomization0.0000 0.0001 p - sign test0.0000 0.00050.0000 0.0002 Multinomial-abs0.19880.20880.23450.22390.16530.06450.2150 Multinomial RM0.26130.26600.29690.25900.22590.12190.2260 Okapi desc – Baseline0.20550.17730.21830.19440.15910.04490.2058 Okapi desc – Necessity0.26790.27860.28940.23870.20030.07760.2403 LM title – BaselineN/A0.23620.25180.18900.15770.09640.2511 LM title – NecessityN/A0.25140.26060.20580.21370.10420.2674 38

Feature Correlation f 1 Centrf 2 Synf 3 Replf 4 DepLeaff 5 idfRMw 0.37190.3758-0.18720.1278-0.13390.6296 39 Predicted Necessity: 0.7989

Prediction Based on RMw (x-axis) 40

Ablation Study Features usedMAPFeatures usedMAP IDF only0.1776All 5 features0.2261 IDF + Centrality0.2076All but Centrality0.2235 IDF + Synonymy0.2129All but Synonymy0.2066 IDF + Replaceable0.1699All but Replaceable0.2211 IDF + DepLeaf0.1900All but DepLeaf0.2226 41

Using Document Structure Stylistic: XML Syntactic/Semantic: POS, Semantic Role Label Current approaches –All precision oriented Need to solve mismatch first? 42

Be mean! Apply Necessity to your retrieval models! 43

Be mean! Is the term Necessary for doc relevance? 44 IR theory Potential in reality Prediction Factors Term weighting Features

Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct.

Similar presentations

Presentation on theme: "Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct.

Similar presentations

Presentation on theme: "Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct."— Presentation transcript:

Similar presentations

About project

Feedback