1 Integrating Term Relationships into Language Models for Information Retrieval Jian-Yun Nie RALI, Dept. IRO University of Montreal, Canada.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Information Retrieval in Practice
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Chapter 7 Retrieval Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Language Modeling Approaches for Information Retrieval Rong Jin.
Chapter 7 Retrieval Models.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Natural Language Processing Statistical Inference: n-grams
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Minimum Rank Error Training for Language Modeling Meng-Sung Wu Department of Computer Science and Information Engineering National Cheng Kung University,
Relevance Feedback Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
A Study of Poisson Query Generation Model for Information Retrieval
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Lecture 13: Language Models for IR
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
Introduction to Statistical Modeling
John Lafferty, Chengxiang Zhai School of Computer Science
CS 4501: Information Retrieval
Topic Models in Text Processing
CS590I: Information Retrieval
INF 141: Information Retrieval
Language Models for TR Rong Jin
Presentation transcript:

1 Integrating Term Relationships into Language Models for Information Retrieval Jian-Yun Nie RALI, Dept. IRO University of Montreal, Canada

2 Overview Language model Interesting theoretical framework Efficient probability estimation and smoothing methods Good effectiveness Limitations Most approaches use uni-grams, and independence assumption Just a different way to weight terms? Extensions Integrating term relationships? Experiments Conclusions

3 Principle of language modeling Goal: create a statistical model so that one can calculate the probability of a sequence of words s = w 1, w 2, …, w n in a language. General approach: Training corpus Probabilities of the observed elements s P(s)P(s)

4 Examples of utilization Speech recognition Training corpus = signals + words probabilities: P(word|signal), P(word2|word1) Utilization: signals sequence of words Statistical tagging Training corpus = words + tags (n, v) Probabilities: P(word|tag), P(tag2|tag1) Utilization: sentence sequence of tags

5 Prob. of a sequence of words Elements to be estimated: - If h i is too long, one cannot observe ( h i, w i ) in the training corpus, and (hi, wi ) is hard generalize - Solution: limit the length of h i

6 n-grams Limit h i to n-1 preceding words Most used cases Uni-gram: Bi-gram: Tri-gram:

7 A simple example ( corpus = words, bi-grams ) Uni-gram: P(I, talk) = P(I) P(talk) = 0.001* P(I, talks) = P(I) P(talks) = 0.001* Bi-gram:P(I, talk) = P(I | #) P(talk | I) = 0.008*0.2 P(I, talks) = P(I | #) P(talks | I) = 0.008*0

8 Estimation History: short long modeling: coarserefined Estimation:easydifficult Maximum likelihood estimation MLE If (h i m i ) is not observed in training corpus, P(w i |h i )=0 P(they, talk)=P(they|*) P(talk|they) = 0 smoothing

9 Smoothing Goal: assign a low probability to words or n-grams not observed in the training corpus word P MLE smoothed

10 Smoothing methods n-gram:  Change the freq. of occurrences Laplace smoothing (add-one): Good-Turing change the freq. r to n r = no. of n-grams of freq. r

11 Smoothing (cont’d) Combine a model with a lower-order model Backoff (Katz) Interpolation (Jelinek-Mercer) In IR, combine doc. with corpus

12 Smoothing (cont’d) Dirichlet Two-stage

13 Using LM in IR Principle 1: Document D: Language model P(w|M D ) Query Q = sequence of words q 1, q 2, …,q n (uni-grams) Matching: P(Q|M D ) Principle 2: Query Q: Language model P(w|M Q ) Document D = sequence of words d 1, d 2, …,d n Matching: P(D|M Q ) Principle 3: Document D: Language model P(w|M D ) Query Q: Language model P(w|M Q ) Matching: comparison between P(w|M D ) and P(w|M Q ) Principle 4: Translate D to Q

14 Principle 1: Document LM Document D: Model M D Query Q: q 1,q 2,…,q n : uni-grams P(Q|D) = P(Q| M D ) = P(q 1 | M D ) P(q 2 | M D ) … P(q n | M D ) Problem of smoothing Short document Coarse M D Unseen words Smoothing Change word freq. Smooth with corpus Exemple

15 Determine Expectation maximization (EM): Choose that maximizes the likelihood of the text Initialize E-step M-step Loop on E and M

16 Principle 2: Query LM Query Q: M Q Document D: d 1,d 2,…,d n Matching: P(Q|D) = P(D|M Q ) P(M Q ) / P(D)  P(D|M Q ) / P(D) Query even shorter, P(D|M Q ) difficult to calculate Not directly used

17 Principle 3: Doc. likelihood / divergence between M d and M Q Question: Is the document likelihood increased when a query is submitted ? (Is the query likelihood increased when D is retrieved?) - P(Q|D) calculated with P(Q|M D ) - P(Q) estimated as P(Q|M C )

18 Divergence of M D and M Q KL: Kullback-Leibler divergence, measuring the divergence of two probability distributions Assume Q follows a multinomial distribution :

19 Principle 4: IR as translation Noisy channel: message received Transmit D through the channel, and receive Q P(w j |D): prob. that D generates w j P(q i | w j ): prob. of translating w j by q i Possibility to consider relationships between words How to estimate P(q i | w j )? Berger&Lafferty: Pseudo-parallel texts (align sentence with paragraph)

20 Summary on LM Can a query be generated from a document model? Does a document become more likely when a query is submitted (or reverse)? Is a query a "translation" of a document? Smoothing is crucial Often use uni-grams

21 Beyond uni-grams Bi-grams Bi-term Do not consider word order in bi-grams (analysis, data) – (data, analysis)

22 Relevance model LM does not capture “Relevance” Using pseudo-relevance feedback Construct a “relevance” model using top- ranked documents Document model + relevance model (feedback) + corpus model

23 Model using document cluster Document smoothing: Collection Some documents are more similar to the given document (document clustering) Different levels of smoothing: (Document + cluster) + collection

24 Experimental results LM vs. Vector space model with tf*idf (Smart) Usually better LM vs. Prob. model (Okapi) Often similar bi-gram LM vs. uni-gram LM Slight improvements (but with much larger model)

25 Comparaison: LM v.s. tf*idf Log P(Q|D) ~ VSM with tf*idf and document length normalization Smoothing ~ idf + length normalization idf

26 Contributions of LM to IR Well founded theoretical framework Exploit the mass of data available Techniques of smoothing for probability estimation Explain some empirical and heuristic methods by smoothing Interesting experimental results Existing tools for IR using LM (Lemur)

27 Problems Increased complexity Limitation to uni-grams: No dependence between words Problems with bi-grams Consider all the adjacent word pairs (noise) Cannot consider more distant dependencies Word order – not always important for IR Entirely data-driven, no external knowledge e.g. programming computer Logic well hidden behind numbers Key = smoothing Maybe too much emphasis on smoothing, and too little on the underlying logic Direct comparison between D and Q Requires that D and Q contain identical words (except translation model) Cannot deal with synonymy and polysemy

28 Extensions Classical LM: Document t1, t2, …Query (ind. terms) 1. Document comp.archi. Query (dep. terms) 2. Document prog. comp.Query (term relations)

29 Extensions (1): link terms in document and query Dependence LM (Gao et al. 04): Capture more distant dependencies within a sentence Syntactic analysis Statistical analysis Only retain the most probable dependencies in the query (how)(has)affirmativeactionaffected(the)constructionindustry

30 Estimate the prob. of links (EM) For a corpus C: 1. Initialization: link each pair of words with a window of 3 words 2. For each sentence in C: Apply the link prob. to select the strongest links that cover the sentence 3. Re-estimate link prob. 4. Repeat 2 and 3

31 Calculation of P(Q|D) 1.Determine the links in Q (the required links) 2.Calculate the likelihood of Q (words and links) Requirement on words and bi-terms links

32 Experiments

33 Extension (2): Inference in IR Logical deduction (A  B)  (B  C)   A  C In IR: (D  Q’)  (Q’  Q)   D  Q (D  D’)  (D’  Q)   D  Q Direct matchingInference on query Inference on doc.Direct matching

34 How to make inference in IR? - Language modeling Translation model: Classical LM

35 How to make inference in IR simply? - Language modeling Term relationships from Co-occurrences Use document collection to estimate P(w 2 |w 1 ) Term relationships from a thesaurus Use term relationships in Wordnet: synonymy, hypernymy, … + co-occurrence information to estimate their prob. Combining both through smoothing: Term relationship

36 Illustration: Bayesian network qiqi w 1 w 2 … w n WN model CO modelUG model document λ1λ1 λ2λ2 λ3λ3 P WN (q i |w 1 ) P CO (q i |w 1 ) P WN (w i |D)P CO (w i |D) P UG (w i |D)

37 Experimental results (Cao et al. 05) Coll. Unigram Model Dependency Model LM with unique WN relations LM with typed WN relations AvgPRec.AvgP%changeRec.AvgP%changeRec. WSJ / * 1706/ * 1719/2172 AP / ** 3523/ ** 3530/6101 SJM / / /2322 Integrating different types of relationships in LM may improve effectiveness

38 Doc expansion v.s. Query expansion Document expansion Query expansion

39 Question: How to implement QE in LM? Considered as a difficult task KL divergence:

40 Expanding query model Classical LMRelation model

41 Using co-occurrence information Using an external knowledge base (e.g. Wordnet) Other term relationships

42 Defining relational model HAL (Hyperspace Analogue to Language): a special co-occurrence matrix (Bruza&Song) “the effects of pollution on the population” “effects” and “pollution” co-occur in 2 windows (L=3) HAL(effects, pollution) = 2 = L – distance + 1

43 From HAL to Inference relation superconductors : Combining terms: space  program Different importance for space and program

44 From HAL to Inference relation (information flow) space  program |- {program:1.00 space:1.00 nasa:0.97 new:0.97 U.S.:0.96 agency:0.95 shuttle:0.95 … science:0.88 scheduled:0.87 reagan:0.87 director:0.87 programs:0.87 air:0.87 put:0.87 center:0.87 billion:0.87 aeronautics:0.87 satellite:0.87, …>

45 Two types of term relationship Pairwise P(t 2 |t 1 ): Inference relationship Inference relationships are less ambiguous and produce less noise (Qiu&Frei 93)

46 1. Query expansion with pairwise term relationships Select a set (85) of strongest HAL relationships

47 2. Query expansion with IF term relationships 85 strongest IF relationships

48 Experiments (Bai et al. 05) (AP89 collection, query 1-50) Doc. Smooth. LM baselineQE with HALQE with IFQE with IF & FB AvgPr Jelinek- Merer (+5%) (+30%) (+35%) Dirichlet (+4%) (+25%) (+32%) Abslute (+5%) (+26%) (+35%) Two- Stage (+3%) (+25%) (+31%) Recall Jelinek- Merer 1542/ /3301 (+3%)2240/3301 (+45%) 2366/3301 (+53%) Dirichlet1569/ /3301 (+2%)2246/3301 (+43%)2356/3301 (+50%) Abslute1560/ /3301 (+3%)2151/3301 (+38%) 2289/3301 (+47%) Two- Stage 1573/ /3301 (+1%)2221/3301 (+41%)2356/3301 (+50%)

49 Experiments (AP88-90, topics ) Doc. Smooth. LM baselineQE with HALQE with IFQE with IF & FB AvgPr Jelinek- Mercer (+5%) (+29%) (+51%) Dirichlet (+4%) (+17%) (+35%) Abslute (+5%) (+22%) (+43%) Two-Stage (+4%) (+19%) (+35%) Recall Jelinek- Mercer 3061/ /3301 (+3%)3675/4805 (+20%) 3895/4805 (+27%) Dirichlet 3156/ /3301 (+3%)3738/4805 (+18%) 3930/4805 (+25%) Abslute 3031/ /3301 (+3%)3572/4805 (+18%) 3842/4805 (+27%) Two-Stage 3134/ /3301 (+2%)3713/4805 (+18%) 3901/4805 (+24%)

50 Observations Possible to implement query/document expansion in LM Expansion using inference relationships is more context-sensitive: Better than context- independent expansion (Qiu&Frei) Every kind of knowledge always useful (co- occ., Wordnet, IF relationships, etc.) LM with some inferential power

51 Conclusions LM = suitable model for IR Classical LM = independent terms (n-grams) Possibility to integrate term relationships: Within document and within query (link constraint ~ compound term) Between document and query (inference) Both (future work) Automatic parameter estimation = powerful tool for data-driven IR First experiments showed encouraging results