Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Information Retrieval – Language models for IR
CS276A Text Retrieval and Mining Lecture 12 [Borrows slides from Viktor Lavrenko and Chengxiang Zhai]
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Information Retrieval Models: Probabilistic Models
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Chapter 7 Retrieval Models.
1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Language Modeling Approaches for Information Retrieval Rong Jin.
Chapter 7 Retrieval Models.
1 Advanced Smoothing, Evaluation of Language Models.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.
Language Models. Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability.
ICIP 2004, Singapore, October A Comparison of Continuous vs. Discrete Image Models for Probabilistic Image and Video Retrieval Arjen P. de Vries.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
A Study of Poisson Query Generation Model for Information Retrieval
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Language Model for Machine Translation Jang, HaYoung.
Probabilistic Retrieval LBSC 708A/CMSC 838L Session 4, October 2, 2001 Philip Resnik.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Plan for Today’s Lecture(s)
CS276A Text Information Retrieval, Mining, and Exploitation
Lecture 13: Language Models for IR
CSCI 5417 Information Retrieval Systems Jim Martin
Information Retrieval Models: Probabilistic Models
Language Models for Information Retrieval
Lecture 12 The Language Model Approach to IR
John Lafferty, Chengxiang Zhai School of Computer Science
Language Model Approach to IR
Language Models Hongning Wang
CS 430: Information Discovery
CS590I: Information Retrieval
INF 141: Information Retrieval
Information Retrieval and Web Design
Language Models for TR Rong Jin
Presentation transcript:

Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin (Maryland) and the course website for CS276A (Stanford)

Standard Probabilistic IR query d1 d2 dn Information need document collection matching ….

IR based on Language Model (LM) query d1 d2 dn … Information need document collection generation …  A common search heuristic is to use words that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!  The LM approach directly exploits that idea!

Formal Language Model  Traditional generative model: generates strings – Finite state machines or regular grammars, etc I wish I wish I wish I wish I wish I wish I wish … Does not generate: wish I wish Generates:

Stochastic Language Models  An alphabet ∑  What is the probability of some string being generated? 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … themanlikesthewoman multiply? Model M P(the man likes the woman | M) = Yes, if the words occur independently! Probability?

Stochastic Language Models  A statistical model for generating text  Probability distribution over strings in a given language M P ( | M )= P ( | M ) P ( | M, )

Unigram and higher-order models  Unigram Language Models  Bigram (generally, n-gram) Language Models  Other Language Models – Grammar-based models (PCFGs), etc = P ( )P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective!

Using Language Models in IR  Treat each document as the basis for a model (e.g., unigram statistics)  Rank document d based on P(d | q)  P(d | q) = P(q | d) × P(d) / P(q) – P(q) is the same for all documents, so ignore – P(d) [the prior] is often treated as the same for all d But we could use criteria like authority, length, genre – P(q | d) is the probability of q given d’s model  Very general formal approach

The fundamental problem of LMs  Usually we don’t know the model M – But have a sample of text representative of that model  Estimate a language model from a sample  Then compute the observation probability P ( | M ( ) ) M

Ranking with Language Models  Build a model for every document  Rank document d based on P(M D | q)  Using Bayes ’ Theorem  Same as ranking by P(q | M D ) P(q) is same for all documents; doesn’t change ranks P(M D ) [the prior] is assumed to be the same for all d

What does it mean? Ranking by P(M D | q)… Hey, what’s the probability this query came from you? model 1 Hey, what’s the probability that you generated this query? model 1 is the same as ranking by P(q | M D ) Hey, what’s the probability this query came from you? model 2 Hey, what’s the probability that you generated this query? model 2 Hey, what’s the probability this query came from you? model n Hey, what’s the probability that you generated this query? model n ……

Ranking Models? Hey, what’s the probability that you generated this query? model 1 Ranking by P(q | M D ) Hey, what’s the probability that you generated this query? model 2 Hey, what’s the probability that you generated this query? model n … … is a model of document 1 … is a model of document 2 … is a model of document n … is the same as ranking documents

Building Document Models  How do we build a language model for a document? What ’ s in the urn? M Physical metaphor: What colored balls and how many of each?

P ( ) = 1/4 Simple Approach  Simply count the frequencies in the document = maximum likelihood estimate M P ( ) = 1/2 P(w | M S ) = #(w,S) / |S| Sequence S #(w,S)= number of times w occurs in S |S|= length of S

Query generation probability (1)  Ranking formula  The probability of producing the query given the language model of document d using MLE is: Unigram assumption: Given a particular language model, the query terms occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document d

Zero-Frequency Problem  Suppose some event is not in our observation S – Model will assign zero probability to that event M P ( ) = 1/2 P ( ) = 1/4 Sequence S P ( )P ( ) = = (1/2)  (1/4)  0  (1/4) = 0 P ( )    !!

Why is this a bad idea?  Modeling a document – Just because a word didn ’ t appear doesn ’ t mean it ’ ll never appear… – But safe to assume that unseen words are rare  Think of the document model as a topic – There are many documents that can be written about a single topic – We ’ re trying to figure out what the model is based on just one document  Practical effect: assigning zero probability to unseen words forces exact match – But partial matches are useful also! Analogy: fishes in the sea

Smoothing P(w) w Maximum Likelihood Estimate The solution: “smooth” the word probabilities Smoothed probability distribution

How do you smooth?  Assign some small probability to unseen events – But remember to take away “ probability mass ” from other events  Simplest example: for words you didn ’ t see, pretend you saw it once  Other more sophisticated methods: – Absolute discounting – Linear interpolation, Jelinek-Mercer – Dirichlet, Witten-Bell – Good-Turing –…–…  Lots of performance to be gotten out of smoothing!

Mixture model  P(w|d) = P mle (w|M d ) + (1 – )P mle (w|M c )  Mixes the probability from the document with the general collection frequency of the word.  Correctly setting is very important  A high value of lambda makes the search “conjunctive-like” – suitable for short queries  A low value is more suitable for long queries  Can tune to optimize performance – Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

Basic mixture model summary  General formulation of the LM for IR – The user has a document in mind, and generates the query from this document. – The equation represents the probability that the document that the user had in mind was in fact this one. general language model individual-document model

 The main difference is whether “ Relevance ” figures explicitly in the model or not – LM approach attempts to do away with modeling relevance  LM approach asssumes that documents and expressions of information problems are of the same type  Computationally tractable, intuitively appealing LM vs. Prob. Model for IR

 Problems of basic LM approach – Assumption of equivalence between document and information problem representation is unrealistic – Very simple models of language – Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance – Can ’ t easily accommodate phrases, passages, Boolean operators  Current extensions focus on putting relevance back into the model, etc. LM vs. Prob. Model for IR

Language models: pro & con  Novel way of looking at the problem of text retrieval based on probabilistic language modeling Conceptually simple and explanatory Formal mathematical model Natural use of collection statistics, not heuristics (almost … ) – LMs provide effective retrieval and can be improved to the extent that the following conditions can be met Our language models are accurate representations of the data. Users have some sense of term distribution.* – *Or we get more sophisticated with translation model

Comparison With Vector Space  There’s some relation to traditional tf.idf models: – (unscaled) term frequency is directly in model – the probabilities do length normalization of term frequencies – the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

Comparison With Vector Space  Similar in some ways – Term weights based on frequency – Terms often used as if they were independent – Inverse document/collection frequency used – Some form of length normalization useful  Different in others – Based on probability rather than similarity Intuitions are probabilistic rather than geometric – Details of use of document length and term, document, and collection frequency differ

Resources J.M. Ponte and W.B. Croft A language modelling approach to information retrieval. In SIGIR 21. D. Hiemstra A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569 – 584. A. Berger and J. Lafferty Information retrieval as statistical translation. SIGIR 22, pp. 222 – 229. D.R.H. Miller, T. Leek, and R.M. Schwartz A hidden Markov model information retrieval system. SIGIR 22, pp. 214 – 221. [Several relevant newer papers at SIGIR 23 – 25, 2000 – 2002.] Workshop on Language Modeling and Information Retrieval, CMU The Lemur Toolkit for Language Modeling and Information Retrieval. 2.cs.cmu.edu/~lemur/. CMU/Umass LM and IR system in C(++), currently actively developed.