Modeling and Solving Term Mismatch for Full-Text Retrieval

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct.
Structured Queries for Legal Search TREC 2007 Legal Track Yangbo Zhu, Le Zhao, Jamie Callan, Jaime Carbonell Language Technologies Institute School of.
Evaluating Search Engine
Chapter 7 Retrieval Models.
Term Necessity Prediction P(t | R q ) Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Oct.
How to Make Manual Conjunctive Normal Form Queries Work in Patent Search Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
INFO 624 Week 3 Retrieval System Evaluation
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Scalable Text Mining with Sparse Generative Models
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Automatic Term Mismatch Diagnosis for Selective Query Expansion Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
WikiQuery.org -- An interactive collaboration interface for creating, storing and sharing effective CNF queries Le Zhao*, Xiaozhong Liu #, Jamie Callan*
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modern Retrieval Evaluations Hongning Wang
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
1 Information Retrieval LECTURE 1 : Introduction.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
Modern Retrieval Evaluations Hongning Wang
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
John Lafferty, Chengxiang Zhai School of Computer Science
CS246: Information Retrieval
INF 141: Information Retrieval
Introduction to Search Engines
Presentation transcript:

Modeling and Solving Term Mismatch for Full-Text Retrieval Dissertation Presentation Le Zhao Language Technologies Institute School of Computer Science Carnegie Mellon University July 26, 2012 Committee: Jamie Callan (Chair) Talk comfortably, never haste! Quick sentences, but more pauses and transitions! One transition sentence for every important slide to motivate! Spend more time on charts and results. Explain. Equations say: you don’t need to fully understand this, just know that ... Explain Jargon: TREC, idf, precision. Hello everyone, my name is Le, it’s pronounced like “Learn” but without the rn at the end. It means happy in Chinese, and I’m very glad to be here to give my job talk on the term mismatch problem in retrieval. Before I begin with details, I want to share with you my view of why I think we should work on search. Jaime Carbonell Yiming Yang Bruce Croft (UMass)

What is Full-Text Retrieval The task The Cranfield evaluation [Cleverdon 1960] abstracts away the user, allows objective & automatic evaluations Results User Query Retrieval Engine User Document Collection

Where are We (Going)? Current retrieval models formal models from 1970s, best ones 1990s based on simple collection statistics (tf.idf), no deep understanding of natural language texts Perfect retrieval Query: “information retrieval”, A: “… text search …” Textual entailment (difficult natural language task) Searcher frustration [Feild, Allan and Jones 2010] Still far away, what have been holding us back? imply This work argues that two long standing problems might be the culprit.

Two Long Standing Problems in Retrieval Term mismatch [Furnas, Landauer, Gomez and Dumais 1987] No clear definition in retrieval Relevance (query dependent term importance – P(t | R)) Traditionally, idf (rareness) P(t | R) [Robertson and Spärck Jones 1976; Greiff 1998] Few clues about estimation This work connects the two problems, shows they can result in huge gains in retrieval, and uses a predictive approach toward solving both problems. No clear definition of term mismatch in a retrieval setting. Few clues about how to effectively estimate it.

What is Term Mismatch & Why Care? Job search You look for information retrieval jobs on the market. They want text search skills. cost you job opportunities, (50% even if you are careful) Legal discovery You look for bribery or foul play in corporate documents. They say grease, pay off. cost you cases Patent/Publication search cost businesses Medical record retrieval cost lives Why you should care about term mismatch. So, in the areas where you care most, mismatch in search can cost you a lot. Of course there is also the lady gaga picture query, where you probably don’t care that much if you miss a few good results. But if you are able to find cool lady gaga pictures that others couldn’t find, that can make you kind of a cool person. And if you care about that, you should also care about term mismatch and this talk. Because in this talk, I will show you the following

Prior Approaches Document: Full text indexing Stemming Instead of only indexing key words Stemming Include morphological variants Document expansion Inlink anchor, user tags Query: Query expansion, reformulation Both: Latent Semantic Indexing Translation based models Important problem, many techniques were designed to solve mismatch. But, our solution is different. We start by clearly defining the problem, what is the problem, and then see if it really is a problem.

Main Questions Answered Definition Significance (theory & practice) Mechanism (what causes the problem) Model and solution To solve a problem, we first clearly define it and quantitatively analyze mismatch, we try to understand the significance of the problem both theoretically and practically. To further address the problem, we try to understand the underlying mechanism of how mismatch affects retrieval as well as what factors may cause mismatch. Based on these understandings, we design principled solutions to model and predict mismatch, and solve mismatch in a theoretically motivated manner. We’ll come back to this slide thru out the talk to show that I’ve fulfilled my promises. Two bullets show that mismatch is a real & significant problem, Two bullets reveal the underlying mechanism of how the problem occurs, & how it affects retrieval performance, Two bullets show how to model and solve mismatch, & improve retrieval in principled ways. (Is there a problem? How large is the problem? How to solve it?)

Definition of Mismatch P(t | Rq) Importance Prediction Solution Definition of Mismatch P(t | Rq) _ Collection Relevant (q) Jobs mismatched All relevant jobs Documents that contain t “retrieval” We define term mismatch as the probability that a term t doesn’t appear in a document relevant to the query. Ofcourse, new topics w/o rel, we need to predict it. That’s the main problem in this work, and we start by assuming that we have rel judgments so that we can understand it before being able to predict it. _ mismatch (P(t | Rq)) == 1 – term recall (P(t | Rq)) Directly calculated given relevance judgments for q P( 𝑡 | 𝑅 𝑞 )= |{𝑑:𝑡∉𝑑 & 𝑑∈ 𝑅 𝑞 }| | 𝑅 𝑞 | [CIKM 2010]

How Often do Terms Match? Definition Importance Prediction Solution How Often do Terms Match? Term in Query Oil Spills Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments P(t | R) 0.9914 0.9831 0.6885 0.2821 0.1071 (Example TREC-3 topics) Let’s look at some examples. These queries are from TREC, they are part of NIST, build standard test collections including rel judgments to help us evaluate retrieval techniques.

Main Questions Definition P(t | R) or P(t | R), simple, estimated from relevant documents, analyze mismatch Significance (theory & practice) Mechanism (what causes the problem) Model and solution Main Questions _ With a simple definition that allows us to estimate mismatch from relevance judgments, we can analyze mismatch in more detail than prior research.

Term Mismatch & Probabilistic Retrieval Models Definition Importance: Theory Prediction Solution Term Mismatch & Probabilistic Retrieval Models Binary Independence Model [Robertson and Spärck Jones 1976] Optimal ranking score for each document d Term weight for Okapi BM25 Other advanced models behave similarly Used as effective features in Web search engines Term recall Idf (rareness) Technical, if familiar with IR, if not you only need to know that ... BIM basic model, assumes binary term occurrence, but more advanced models like Okapi BM25 are built on top of BIM, and these advanced models have a similar ranking behavior. Explain IDF!

Term Mismatch & Probabilistic Retrieval Models Definition Importance: Theory Prediction Solution Term Mismatch & Probabilistic Retrieval Models Binary Independence Model [Robertson and Spärck Jones 1976] Optimal ranking score for each document d “Relevance Weight”, “Term Relevance” P(t | R): only part about the query, & relevance Term recall Idf (rareness) PAUSE 5 secs after this slide. Single most important theoretical slide in the talk, shows how the term mismatch and P(t|R) prediction problems are related, and how they affect retrieval.

Main Questions Definition Significance Theory (as idf & only part about relevance) Practice? Mechanism (what causes the problem) Model and solution Main Questions

Term Mismatch & Probabilistic Retrieval Models Definition Importance: Practice: Mechanism Prediction Solution Term Mismatch & Probabilistic Retrieval Models Binary Independence Model [Robertson and Spärck Jones 1976] Optimal ranking score for each document d “Relevance Weight”, “Term Relevance” P(t | R): only part about the query, & relevance Term recall Idf (rareness)

Without Term Recall The emphasis problem for tf.idf retrieval models Definition Importance: Practice: Mechanism Prediction Solution Without Term Recall The emphasis problem for tf.idf retrieval models Emphasize high idf (rare) terms in query “prognosis/viability of a political third party in U.S.” (Topic 206) When a retrieval model only uses IDF for term weighting,

Ground Truth (Term Recall) Definition Importance: Practice: Mechanism Prediction Solution Ground Truth (Term Recall) Query: prognosis/viability of a political third party party political third viability prognosis True P(t | R) 0.9796 0.7143 0.5918 0.0408 0.0204 idf 2.402 2.513 2.187 5.017 7.471 Emphasis Wrong Emphasis

Top Results (Language model) Definition Importance: Practice: Mechanism Prediction Solution Top Results (Language model) Query: prognosis/viability of a political third party 1. … discouraging prognosis for 1991 … 2. … Politics … party … Robertson's viability as a candidate … 3. … political parties … 4. … there is no viable opposition … 5. … A third of the votes … 6. … politics … party … two thirds … 7. … third ranking political movement… 8. … political parties … 9. … prognosis for the Sunday school … 10. … third party provider … All are false positives. Emphasis / Mismatch problem, not precision. ( , are better, but still have top 10 false positives. Emphasis / Mismatch also a problem for large search engines!) Hazard of emphasis problem: easy to misunderstand the situation and think that it’s a precision problem!

Without Term Recall The emphasis problem for tf.idf retrieval models Definition Importance: Practice: Mechanism Prediction Solution Without Term Recall The emphasis problem for tf.idf retrieval models Emphasize high idf (rare) terms in query “prognosis/viability of a political third party in U.S.” (Topic 206) False positives throughout rank list especially detrimental at top rank No term recall hurts precision at all recall levels How significant is the emphasis problem?

Failure Analysis of 44 Topics from TREC 6-8 Definition Importance: Practice: Mechanism Prediction Solution Failure Analysis of 44 Topics from TREC 6-8 Mismatch 27% Recall term weighting Mismatch guided expansion Basis: Term Mismatch Prediction People have found it difficult to interpret & use the RIA analysis results, but we can explain 90% of the failures with term mismatch. Behaviors: Bigram 0.1 weight vs. unigram 0.9; field based retrieval, WSD, Personalization: precision enhancing, but not the largest problem for ad hoc retrieval. RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Failure analyses of retrieval models & techniques still standard today

Main Questions Definition Significance Theory: as idf & only part about relevance Practice: explains common failures, other behavior: Personalization, WSD, structured Mechanism (what causes the problem) Model and solution Main Questions Starting at 20mins.

Failure Analysis of 44 Topics from TREC 6-8 Definition Importance: Practice: Potential Prediction Solution Failure Analysis of 44 Topics from TREC 6-8 Mismatch 27% Recall term weighting Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

True Term Recall Effectiveness Definition Importance: Practice: Potential Prediction Solution True Term Recall Effectiveness +100% over BIM (in precision at all recall levels) [Robertson and Spärk Jones 1976] +30-80% over Language Model, BM25 (in MAP) This work For a new query w/o relevance judgments, Need to predict Predictions don’t need to be very accurate to show performance gain Not clear! Ok, failure analysis, What about more directly, end2end performance gain? … Ok, now that we know the problem to attack, where do we start? Data, always data.

Main Questions Definition Significance Theory: as idf & only part about relevance Practice: explains common failures, other behavior, +30 to 80% potential from term weighting Mechanism (what causes the problem) Model and solution Main Questions Starting at 21mins. New MSR timing: 27mins

How Often do Terms Match? Definition Importance Prediction: Idea Solution How Often do Terms Match? Same term, different Recall Term in Query Oil Spills Term limitations for US Congress members Insurance Coverage which pays for Long Term Care School Choice Voucher System and its effects on the US educational program Vitamin the cure or cause of human ailments P(t | R) 0.9914 0.9831 0.6885 0.2821 0.1071 Varies 0 to 1 (Examples from TREC 3 topics) idf 5.201 2.010 1.647 6.405 Before we go on to prediction, let’s look at some statistics. Differs from idf

Statistics Term recall across all query terms (average ~55-60%) Definition Importance Prediction: Idea Solution Statistics Term recall across all query terms (average ~55-60%) TREC 3 titles, 4.9 terms/query TREC 9 descriptions, 6.3 terms/query average 55% term recall average 59% term recall

Statistics Term recall on shorter queries (average ~70%) Definition Importance Prediction: Idea Solution Statistics Term recall on shorter queries (average ~70%) TREC 9 titles, 2.5 terms/query TREC 13 titles, 3.1 terms/query average 70% term recall average 66% term recall This should be very alarming to you, because whenever you include a keyword in your query you are likely to have excluded 30%-40% of the relevant documents.

Statistics Query dependent (but for many terms, variance is small) Definition Importance Prediction: Idea Solution Statistics Query dependent (but for many terms, variance is small) 364 recurring words from TREC 3-7, 350 topics Term Recall for Repeating Terms Take some time to explain or drop?

P(t | R) vs. idf TREC 4 desc query terms Definition Importance Prediction: Idea Solution P(t | R) vs. idf P(t | R) vs. df/N (Greiff, 1998) P(t | R) TREC 4 desc query terms Don’t talk too quickly, walk audience through the slide! df/N

Prior Prediction Approaches Definition Importance Prediction: Idea Solution Prior Prediction Approaches Croft/Harper combination match (1979) treats P(t | R) as a tuned constant, or estimated from PRF when >0.5, rewards docs that match more query terms Greiff’s (1998) exploratory data analysis Used idf to predict overall term weighting Improved over basic BIM Metzler’s (2008) generalized idf Used idf to predict P(t | R) Simple feature (idf), limited success Missing piece: P(t | R) = term recall = 1 – term mismatch Don’t know R, need to predict for query w/o judgments.

What Factors can Cause Mismatch? Definition Importance Prediction: Idea Solution What Factors can Cause Mismatch? Topic centrality (Is concept central to topic?) “Laser research related or potentially related to defense” “Welfare laws propounded as reforms” Synonyms (How often they replace original term?) “retrieval” == “search” == … Abstractness “Laser research … defense” “Welfare laws” “Prognosis/viability” (rare & abstract)

Main Questions Definition Significance Mechanism Causes of mismatch: Unnecessary concepts, replaced by synonyms or more specific terms Model and solution Main Questions

Designing Features to Model the Factors Definition Importance Prediction: Implement Solution Designing Features to Model the Factors We need to Identify synonyms/searchonyms of a query term in a query dependent way External resource? (WordNet, wiki, or query log) Biased (coverage problem, collection independent) Static (not query dependent) Not easy, not used here Term-term similarity in concept space! Local LSI (Latent Semantic Indexing) Features. (Yi:Why design features? 1. No simple numeric feature correlate well with mismatch, 2. The feature space is huge, while the set of effective features is relatively small. Designing seems like the easiest way to go.) In an attempt to design general and effective features, we made specific choices in our design. There could be better ways to design these features, but the query dependent features we developed try to model the causes of mismatch and have been shown successful. So, let’s see how we did it. Need to find synonyms/searchonyms  find similar terms in concept space, need to find them in a query dependent way. LSI: like topic modeling, generate a concept space, and represent terms in that space, compute term similarity in that space. Results Query Retrieval Engine Concept Space (150 dim) Results Top (500) Results Document Collection

Synonyms from Local LSI Definition Importance Prediction: Implement Solution Synonyms from Local LSI Term limitation for US Congress members Insurance Coverage which pays for Long Term Care Vitamin the cure or cause of human ailments 0.9831 0.6885 0.1071 P(t | Rq) We have here top similar terms for each query term. We can compute the features as follows: We want the term itself to be central to the topic. We want the supporting synonyms to be central as well. We want supporting synonyms to not appear in place of the original term in collection documents.

Synonyms from Local LSI Definition Importance Prediction: Implement Solution Synonyms from Local LSI Term limitation for US Congress members Insurance Coverage which pays for Long Term Care Vitamin the cure or cause of human ailments 0.9831 0.6885 0.1071 P(t | Rq) (1) Magnitude of self similarity – Term centrality (2) Avg similarity of supporting terms – Concept centrality We have here top similar terms for each query term. We can compute the features as follows: We want the term itself to be central to the topic. We want the supporting synonyms to be central as well. We want supporting synonyms to not appear in place of the original term in collection documents. (3) How likely synonyms replace term t in collection

Features that Model the Factors Definition Importance Prediction: Experiment Solution Features that Model the Factors Correlation with P(t | R) 0.3719 Term centrality Self-similarity (length of t) after dimension reduction Concept centrality Avg similarity of supporting terms (top synonyms) Replaceability How frequently synonyms appear in place of original query term in collection documents Abstractness Users modify abstract terms with concrete terms idf: – 0.1339 0.3758 – 0.1872 Given these numeric features, for a new term in a new query, we can predict its recall. – 0.1278 effects on the US educational program prognosis of a political third party

Prediction Model Regression modeling Definition Importance Prediction: Implement Solution Prediction Model Regression modeling Model: M: <f1, f2, .., f5>  P(t | R) Train on one set of queries (known relevance), Test on another set of queries (unknown relevance) RBF kernel Support Vector regression Also used Boosted Dtree, similar but slightly lower results

A General View of Retrieval Modeling as Transfer Learning Definition Importance Prediction Solution A General View of Retrieval Modeling as Transfer Learning The traditional restricted view sees a retrieval model as a document classifier for a given query. More general view: A retrieval model really is a meta-classifier, responsible for many queries, mapping a query to a document classifier. Learning a retrieval model == transfer learning Using knowledge from related tasks (training queries) to classify documents for a new task (test query) Our features and model facilitate the transfer. More general view  more principled investigations and more advanced techniques I’ll briefly mention for people familiar with machine learning literature that our term weight learning framework demands a more general view of the retrieval modeling task … (Prior research focused on a lower level of abstraction than we here do. This is perhaps part of the reason why prior work did not come up with our framework and new features.)

Experiments Term recall prediction error Definition Importance Prediction: Experiment Solution Experiments Term recall prediction error L1 loss (absolute prediction error) Term recall based term weighting retrieval Mean Average Precision (overall retrieval success) Precision at top 10 (precision at top of rank list)

Term Recall Prediction Example Definition Importance Prediction: Experiment Solution Term Recall Prediction Example Query: prognosis/viability of a political third party. (Trained on TREC 3) party political third viability prognosis True P(t | R) 0.9796 0.7143 0.5918 0.0408 0.0204 Predicted 0.7585 0.6523 0.6236 0.3080 0.2869 Emphasis

Term Recall Prediction Error Definition Importance Prediction: Experiment Solution Term Recall Prediction Error The lower, the better Linear is similar to All 5 features, 0.1800 L1 Loss:

Main Questions Definition Significance Mechanism Model and solution Can be predicted; Framework to design and evaluate features Main Questions

Using 𝑃 (t | R) in Retrieval Models Definition Importance Prediction Solution: Weighting Using 𝑃 (t | R) in Retrieval Models In BM25 Binary Independence Model In Language Modeling (LM) Relevance Model [Lavrenko and Croft 2001] RM: instead of uniform query term Prob, use the relevance distribution, and calc KL(R||D). Only term weighting, no expansion.

Predicted Recall Weighting Definition Importance Prediction Solution: Weighting Predicted Recall Weighting 10-25% gain (MAP) Recall LM desc 50 queries for each TREC set, about 300 training samples to learn the nece pred model. 6 different train-test datasets, from earlier TREC ad hoc retrieval tracks to later Web track data. Ad hoc: informational search on newswire texts, Web: on Web pages. Earlier ad hoc tracks have different colls each year, 0.5 mil docs, fairly complete judgments. Datasets: train -> test “*”: significantly better by sign & randomization tests

Predicted Recall Weighting Definition Importance Prediction Solution: Weighting Predicted Recall Weighting 10-20% gain (top Precision) Recall LM desc Datasets: train -> test “*”: Prec@10 is significantly better. “!”: Prec@20 is significantly better.

vs. Relevance Model Relevance Model [Laverenko and Croft 2001] Definition Importance Prediction Solution: Weighting vs. Relevance Model Relevance Model [Laverenko and Croft 2001] Term occurrence in top docs Unsupervised Query Likelihood y RM weight (x) ~ Term recall (y) Pm(t1 | R) Pm(t2 | R) ~ P(t1 | R) ~ P(t2 | R) Be clear, we are not doing expansion, only re-weighting original query terms. However, we did use top ranked documents to generate features for prediction. 5-10% better than unsupervised x

Main Questions Definition Significance Mechanism Model and solution Term weighting solves emphasis problem for long queries Mismatch problem? Main Questions Starting at 34mins. New timing for MSR: 48mins

Failure Analysis of 44 Topics from TREC 6-8 Definition Importance Prediction Solution: Expansion Failure Analysis of 44 Topics from TREC 6-8 Mismatch 27% Recall term weighting Mismatch guided expansion Basis: Term Mismatch Prediction A technique that can solve both emphasis and mismatch problems. RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

Recap: Term Mismatch Term mismatch ranges 30%-50% on average Definition Importance Prediction Solution: Expansion Recap: Term Mismatch Term mismatch ranges 30%-50% on average Relevance matching can degrade quickly for multi-word queries Solution: Fix every query term Even if you make an honest effort to create the query, chances are, 30% loss with each query term. And before you know it, by just using two terms, you will be left with less than 50% of the rel results to search from. E.g. Merger statement by CEO, idfH: Merger, neceL: CEO, neceL wins in MAP 25%. 0.022->0.0414|0.0559->0.0868 from legal2007, topic 100. [SIGIR 2012]

Conjunctive Normal Form (CNF) Expansion Definition Importance Prediction Solution: Expansion Conjunctive Normal Form (CNF) Expansion Example keyword query: placement of cigarette signs on television watched by children  Manual CNF: (placement OR place OR promotion OR logo OR sign OR signage OR merchandise) AND (cigarette OR cigar OR tobacco) AND (television OR TV OR cable OR network) AND (watch OR view) AND (children OR teen OR juvenile OR kid OR adolescent) Expressive & compact (1 CNF == 100s alternatives) Highly effective (this work: 50-300% over base keyword) Used by lawyers, librarians and other expert searchers But, tedious & difficult to create, little research Users do not effectively create free form Boolean queries. Can we guide user effort and make the task easier?

Diagnostic Intervention Definition Importance Prediction Solution: Expansion Diagnostic Intervention Query: placement of cigarette signs on television watched by children Diagnosis: Low 𝑃 (𝑡|𝑅) terms High idf (rare) terms placement of cigarette signs on television watched by children placement of cigarette signs on television watched by children Expansion: CNF CNF (placement OR place OR promotion OR sign OR signage OR merchandise) AND cigar AND television AND watch AND (children OR teen OR juvenile OR kid OR adolescent) (placement OR place OR promotion OR sign OR signage OR merchandise) AND cigar AND (television OR tv OR cable OR network) AND watch AND children Guide user efforts to the more productive areas of the whole search interaction. Goal Least amount user effort  near optimal performance E.g. expand 2 terms  90% of total improvement

Diagnostic Intervention Definition Importance Prediction Solution: Expansion Diagnostic Intervention Query: placement of cigarette signs on television watched by children Diagnosis: Low 𝑃 (𝑡|𝑅) terms High idf (rare) terms placement of cigarette signs on television watched by children placement of cigarette signs on television watched by children Expansion: Bag of word Original query Bag of word [ 0.9 (placement cigar television watch children) 0.1 (0.4 place 0.3 promotion 0.2 logo 0.1 sign 0.3 signage 0.3 merchandise 0.5 teen 0.4 juvenile 0.2 kid 0.1 adolescent) ] [ 0.9 (placement cigar television watch children) 0.1 (0.4 place 0.3 promotion 0.2 logo 0.1 sign 0.3 signage 0.3 merchandise 0.5 tv 0.4 cable 0.2 network) ] Expansion query Guide user efforts to the more productive areas of the whole search interaction. Goal Least amount user effort  near optimal performance E.g. expand 2 terms  90% of total improvement

Diagnostic Intervention (We Hope to) Definition Importance Prediction Solution: Expansion Diagnostic Intervention (We Hope to) User Keyword query Diagnosis system (P(t | R) or idf) Problem query terms User expansion Expansion terms Query formulation (CNF or Keyword) Retrieval engine Evaluation (child AND cigar) (child  teen) (child > cigar) Online user study (child OR teen) AND cigar

Diagnostic Intervention (We Hope to) Definition Importance Prediction Solution: Expansion Diagnostic Intervention (We Hope to) User Keyword query Diagnosis system (P(t | R) or idf) Problem query terms User expansion Expansion terms Query formulation (CNF or Keyword) Retrieval engine Evaluation (child AND cigar) (child  teen) (child > cigar) Online user study (child OR teen) AND cigar

We Ended up Using Simulation Definition Importance Prediction Solution: Expansion We Ended up Using Simulation Expert user Keyword query Diagnosis system (P(t | R) or idf) Problem query terms User expansion Expansion terms Query formulation (CNF or Keyword) Retrieval engine Evaluation Online simulation Full CNF Offline (child OR teen) AND (cigar OR tobacco) (child AND cigar) Online simulation (child  teen) (child > cigar) Even though simulation, it uses real user queries that are fully expanded to simulate partial expansions, and is still very realistic. This allows us to answer many of the same questions we hope get answered from an online user study. (child OR teen) AND cigar

Diagnostic Intervention Datasets Definition Importance Prediction Solution: Expansion Diagnostic Intervention Datasets Document sets TREC 2007 Legal track, 7 million tobacco company TREC 4 Ad hoc track, 0.5 million newswire CNF Queries, 50 topics per dataset TREC 2007 by lawyers, TREC 4 by Univ. Waterloo Relevance Judgments TREC 2007 sparse, TREC 4 dense Evaluation measures TREC 2007 statAP, TREC 4 MAP

Results – Diagnosis P(t | R) vs. idf diagnosis Full Expansion Definition Importance Prediction Solution: Expansion Results – Diagnosis P(t | R) vs. idf diagnosis Full Expansion No Expansion Diagnostic CNF expansion on TREC 4 and 2007

Results – Form of Expansion Definition Importance Prediction Solution: Expansion Results – Form of Expansion CNF vs. bag-of-word expansion Similar level of gain in top precision 50% to 300% gain P(t | R) guided expansion on TREC 4 and 2007

Main Questions Definition Significance Mechanism Model and solution Term weighting for long queries Term mismatch prediction diagnoses problem terms, and produces simple & effective CNF queries New, previously people expand words that are easy to expand, now we can expand words that really need expansion.

Efficient P(t | R) Prediction Definition Importance Prediction: Efficiency Solution: Weighting Efficient P(t | R) Prediction 3-10X speedup (close to simple keyword retrieval), while maintaining 70-90% of the gain Predict using P(t | R) values from similar, previously-seen queries [CIKM 2012]

Contributions Two long standing problems: mismatch & P(t | R) Definition Importance Prediction Solution Contributions Two long standing problems: mismatch & P(t | R) Definition and initial quantitative analysis of mismatch Do better/new features and prediction methods exist? Role of term mismatch in basic retrieval theory Principled ways to solve term mismatch What about advanced learning to rank, transfer learning? Ways to automatically predict term mismatch Initial modeling of causes of mismatch, features Efficient prediction using historic information Are there better analyses or modeling of the causes? This work has connected two long standing problems in retrieval, and made progress toward solving them: the term mismatch problem and the estimation of the term relevance probability PtR.

Contributions Effectiveness of ad hoc retrieval Definition Importance Prediction Solution Contributions Effectiveness of ad hoc retrieval Term weighting & diagnostic expansion How to do automatic CNF expansion? Better formalisms: transfer learning, & more tasks? Diagnostic intervention Mismatch diagnosis guides targeted expansion How to diagnose specific types of mismatch problems or different problems (mismatch/emphasis/precision)? Guide NLP, Personalization, etc. to solve the real problem How to proactively identify search and other user needs?

Acknowledgements Committee: Jamie Callan, Jaime Carbonell, Yiming Yang, Bruce Croft Ni Lao, Frank Lin, Siddharth Gopal, Jon Elsas, Jaime Arguello, Hui (Grace) Yang, Stephen Robertson, Matthew Lease, Nick Craswell, Yi Zhang (and her group), Jin Young Kim, Yangbo Zhu, Runting Shi, Yi Wu, Hui Tan, Yifan Yanggong, Mingyan Fan, Chengtao Wen Discussions & references & feedback Reviewers: papers & NSF proposal David Fisher, Mark Hoy, David Pane Maintaining the Lemur toolkit Andrea Bastoni and Lorenzo Clemente Maintaining LSI code for Lemur toolkit SVM-light, Stanford parser TREC: data NSF Grant IIS-1018317 Xiangmin Jin, and my whole family and volleyball packs at CMU & SF Bay

END

Prior Definition of Mismatch Vocabulary mismatch (Furnas et al., 1987) How likely 2 people disagree in vocab choice Domain experts disagree 80-90% of the times Leads to Latent Semantic Indexing (Deerwester et al., 1988) Query independent = Avgq P(t | Rq) can be reduced to our query dependent definition of term mismatch -

Knowledge How Necessity explains behavior of IR techniques Why weight query bigrams 0.1, while query unigrams 0.9? Bigram decreases term recall, weight reflects recall Why Bigram not gaining stable improvements? Term recall is more of a problem Why using document structure (field, semantic annotation) not improving performance? Improves precision, need to solve structural mismatch Word sense disambiguation Enhances precision, instead, should use in mismatch modeling! Identify query term sense, for searchonym id, or learning across queries Disambiguate collection term sense for more accurate replaceability Personalization biases results to what a community/person likes to read (precision) may work well in a mobile setting, short queries

Why Necessity? System Failure Analysis Reliable Information Access (RIA) workshop (2003) Failure analysis for 7 top research IR systems 11 groups of researchers (both academia & industry) 28 people directly involved in the analysis (senior & junior) >56 human*weeks (analysis + running experiments) 45 topics selected from 150 TREC 6-8 (difficult topics) Causes (necessity in various disguise) Emphasize 1 aspect, missing another aspect (14+2 topics) Emphasize 1 aspect, missing another term (7 topics) Missing either 1 of 2 aspects, need both (5 topics) Missing difficult aspect that need human help (7 topics) Need to expand a general term e.g. “Europe” (4 topics) Precision problem, e.g. “euro”, not “euro-…” (4 topics)

Local LSI Top Similar Terms Oil spills Insurance coverage which pays for long term care Term limitations for US Congress members Vitamin the cure of or cause for human ailments oil term ail spill 0.5828 0.3310 0.3339 0.4415 0.4210 long 0.2173 limit 0.1696 health 0.0825 tank 0.0986 nurse 0.2114 ballot 0.1115 disease 0.0720 crude 0.0972 care 0.1694 elect 0.1042 basler 0.0718 water 0.0830 home 0.1268 0.0997 dr 0.0695

Necessity vs. idf (and emphasis)

True Necessity Weighting TREC 4 6 8 9 10 12 14 Document collection disk 2,3 disk 4,5 d4,5 w/o cr WT10g .GOV .GOV2 Topic numbers 201-250 301-350 401-450 451-500 501-550 TD1-50 751-800 LM desc – Baseline 0.1789 0.1586 0.1923 0.2145 0.1627 0.0239 LM desc – Necessity 0.2703 0.2808 0.3057 0.2770 0.2216 0.0868 0.2674 Improvement 51.09% 77.05% 58.97% 29.14% 36.20% 261.7% 49.47% p - randomization 0.0000 0.0001 p - sign test 0.0005 0.0002 Multinomial-abs 0.1988 0.2088 0.2345 0.2239 0.1653 0.0645 0.2150 Multinomial RM 0.2613 0.2660 0.2969 0.2590 0.2259 0.1219 0.2260 Okapi desc – Baseline 0.2055 0.1773 0.2183 0.1944 0.1591 0.0449 0.2058 Okapi desc – Necessity 0.2679 0.2786 0.2894 0.2387 0.2003 0.0776 0.2403 LM title – Baseline N/A 0.2362 0.2518 0.1890 0.1577 0.0964 0.2511 LM title – Necessity 0.2514 0.2606 0.2137 0.1042

Predicted Necessity Weighting 10-25% gain (necessity weight) 10-20% gain (top Precision) TREC train sets 3 3-5 3-7 7 Test/x-validation 4 6 8 LM desc – Baseline 0.1789 0.1586 0.1923 LM desc – Necessity 0.2261 0.1959 0.2314 0.2333 Improvement 26.38% 23.52% 20.33% 21.32% P@10 Baseline 0.4160 0.2980 0.3860 Necessity 0.4940 0.3420 0.4220 0.4380 P@20 0.3450 0.2440 0.3310 0.4180 0.2900 0.3540 0.3610 50 queries for each TREC set, about 300 training samples to learn the nece pred model. 6 different train-test datasets, from earlier TREC ad hoc retrieval tracks to later Web track data Earlier ad hoc tracks have different colls each year, 0.5 mil docs, fairly complete judgments.

Predicted Necessity Weighting (ctd.) TREC train sets 3-9 9 11 13 Test/x-validation 10 12 14 LM desc – Baseline 0.1627 0.0239 0.1789 LM desc – Necessity 0.1813 0.1810 0.0597 0.2233 Improvement 11.43% 11.25% 149.8% 24.82% P@10 Baseline 0.3180 0.0200 0.4720 Necessity 0.3280 0.3400 0.0467 0.5360 P@20 0.2400 0.0211 0.4460 0.2790 0.2810 0.0411 0.5030 Later web track colls are much larger, wt10g for 9/10 has 1.7 mil docs, .gov for 11/12 has 1.2 mil, .gov2 for 13/14 has 25 mil docs, sparser judgment level.

vs. Relevance Model x ~ y w1 ~ P(t1|R) w2 ~ P(t2|R) Relevance Model: #weight( 1-λ #combine( t1 t2 ) λ #weight( w1 t1 w2 t2 w3 t3 … ) y x Weight Only ≈ Expansion Supervised > Unsupervised (5-10%) Test/x-validation 4 6 8 10 12 14 LM desc – Baseline 0.1789 0.1586 0.1923 0.1627 0.0239 Relevance Model desc 0.2423 0.1799 0.2352 0.1888 0.0221 0.1774 RM reweight-Only desc 0.2215 0.1705 0.2435 0.1700 0.0692 0.1945 RM reweight-Trained desc 0.2330 0.1921 0.2542 0.2563 0.1809 0.1793 0.0534 0.2258

Feature Correlation ≈ f1 Term f2 Con f3 Repl f4 DepLeaf f5 idf RMw 0.3719 0.3758 -0.1872 0.1278 -0.1339 0.6296 Predicted Necessity: 0.7989 (TREC 4 test set) ≈

vs. Relevance Model Supervised > Unsupervised (5-10%) Weight Only ≈ Expansion Supervised > Unsupervised (5-10%) RM is unstable Datasets: train -> test

Efficient Prediction of Term Recall Definition Importance Prediction: Idea Solution Efficient Prediction of Term Recall Currently: slow query dependent features that requires retrieval Can they be more effective and more efficient? Need to understand the causes of the query dependent variation Design a minimal set of efficient features to capture the query dependent variations

Causes of Query Dependent Variation (1) Example Cause Different word sense

Causes of Query Dependent Variation (2) Example Cause Different word use, e.g. term in phrase vs. not

Causes of Query Dependent Variation (3) Example Cause Different Boolean semantics of the queries, AND vs. OR

Causes of Query Dependent Variation (4) Example Cause Different association level with topic

Efficient P(t | R) Prediction (2) Causes of P(t | R) variation of same term in different queries Different query semantics: Canada or Mexico vs. Canada Different word sense: bear (verb) vs. bear (noun) Different word use: Seasonal affective disorder syndrome (SADS) vs. Agoraphobia as a disorder Difference in association level with topic Use historic occurrences to predict current 70-90% of the total gain 3-10X faster, close to simple keyword retrieval

Efficient P(t | R) Prediction (2) Low variation of same term in different queries Use historic occurrences to predict current 3-10X faster, close to the slower method & real time

Using Document Structure Stylistic: XML Syntactic/Semantic: POS, Semantic Role Label Current approaches All precision oriented Need to solve mismatch first?

Motivation Search is important, information portal Search is research worthy SIGIR, WWW, CIKM, ASIST, ECIR, AIRS, Search is difficult Retrieval modeling difficulty >= sentence paraphrasing Since 1970s, but still not fully understood, basic problem like mismatch Adapt to changing requirements of mobile, social and semantic Web Modeling user’s needs Results User Query Retrieval Model Document Collection User Results Query Activities Retrieval Model Collections

Online or Offline Study? Controlling confounding variables Quality of expansion terms User’s prior knowledge of the topic Interaction form & effort Enrolling many users and repeating experiments Offline simulations can avoid all these and still make reasonable observations

Simulation Assumptions Real full CNFs to simulate partial expansions 3 assumptions about user expansion process Expansion of individual terms are independent of each other A1: always same set of expansion terms for a given query term, no matter which subset of query terms get expanded. A2: same sequence of expansion terms, no matter … A3: Keyword query is re-constructed from the CNF query Procedure to ensure vocabulary faithful to that of the original keyword description Highly effective CNF queries ensure reasonable kw baseline Talk this slide into previous slide!

Take Home Message for Ordinary Search Users (people and software)

Be mean! Is the term Necessary for doc relevance? If not, remove, replace or expand. Is the term really necessary for doc relevance, will all relevant documents contain the term?