Language Model Approach to IR

Slides:



Advertisements
Similar presentations
Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea Universität des SaarlandesMarch 10, 2005 Seminar CS 555 – Language.
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Information Retrieval – Language models for IR
CS276A Text Retrieval and Mining Lecture 12 [Borrows slides from Viktor Lavrenko and Chengxiang Zhai]
Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
Information Retrieval Models: Probabilistic Models
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Chapter 7 Retrieval Models.
1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou
ISP 433/533 Week 2 IR Models.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Language Modeling Approaches for Information Retrieval Rong Jin.
Chapter 7 Retrieval Models.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.
Language Models. Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Probabilistic Retrieval LBSC 708A/CMSC 838L Session 4, October 2, 2001 Philip Resnik.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Plan for Today’s Lecture(s)
CS276A Text Information Retrieval, Mining, and Exploitation
Lecture 13: Language Models for IR
Text Based Information Retrieval
CSCI 5417 Information Retrieval Systems Jim Martin
Information Retrieval Models: Probabilistic Models
Language Models for Information Retrieval
Murat Açar - Zeynep Çipiloğlu Yıldız
Introduction to Statistical Modeling
Basic Information Retrieval
Representation of documents and queries
Lecture 12 The Language Model Approach to IR
John Lafferty, Chengxiang Zhai School of Computer Science
Junghoo “John” Cho UCLA
Language Models Hongning Wang
CS 430: Information Discovery
CS590I: Information Retrieval
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Information Retrieval and Web Design
Language Models for TR Rong Jin
Presentation transcript:

Language Model Approach to IR Hankz Hankui Zhuo http://plan.sysu.edu.cn

Standard Probabilistic IR Information need d1 matching d2 query … dn document collection

IR based on Language Model (LM) Information need d1 generation d2 query … … A common search heuristic is to use words that you expect to find in matching documents as your query The LM approach directly exploits that idea! dn document collection

Formal Language (Model) Traditional generative model: generates strings Finite state machines or regular grammars, etc. Example: I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … *wish I wish

Stochastic Language Models Models probability of generating strings in the language (commonly all strings over alphabet ∑) Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 multiply P(s | M) = 0.00000008

Stochastic Language Models Model probability of generating any string Model M1 Model M2 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman maiden class pleaseth yon the 0.0005 0.01 0.0001 0.2 0.02 0.1 P(s|M2) > P(s|M1)

Stochastic Language Models A statistical model for generating text Probability distribution over strings in a given language M P ( | M ) = P ( | M ) P ( | M, ) P ( | M, ) P ( | M, )

Unigram and higher-order models Unigram Language Models Bigram (generally, n-gram) Language Models Other Language Models Grammar-based models (PCFGs), etc. Probably not the first thing to try in IR P ( ) = P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective! P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | )

Using Language Models in IR Treat each document as the basis for a model Rank document d based on P(d | Q) P(d | Q) = P(Q | d) x P(d) / P(Q) P(Q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d But we could use criteria like authority, length, genre P(Q | d) is the probability of Q given d’s model Very general formal approach

The fundamental problem of LMs Usually we don’t know the model M But have a sample of text representative of that model Estimate a language model from a sample Then compute the observation probability P ( | M ( ) ) M

Query generation probability (1) Ranking formula The probability of producing the query given the language model of document d using MLE is: Unigram assumption: Given a particular language model, the query terms occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document d

Insufficient data Zero probability General approach May not wish to assign a probability of zero to a document that is missing one or more of the query terms [gives conjunction semantics] General approach If , : raw count of term t in the collection : raw collection size(total number of tokens in the collection)

Mixture model P(t|d) = Pmle(t|Md) + (1 – )Pmle(t|Mc) Mixes the probability from the document with the general collection frequency of the word. Correctly setting  is very important General formulation of the LM for IR general language model individual-document model

Example Document collection (2 documents) d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases further Model: MLE unigram from documents;  = ½ Query: revenue down P(Q|d1) = ? P(Q|d2) = ? Ranking: d1 > d2

Example Document collection (2 documents) d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases further Model: MLE unigram from documents;  = ½ Query: revenue down P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] = 1/8 x 3/32 = 3/256 P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256 Ranking: d1 > d2

LM vs. Prob. Model for IR The main difference is whether “Relevance” figures explicitly in the model or not LM approach attempts to do away with modeling relevance Computationally tractable, intuitively appealing

Extension: 3-level model Whole collection model ( ) Specific-topic model ( ) Individual-document model ( ) Relevance hypothesis A request is generated from a specific-topic model { , }. If a document is relevant to the topic, the same topic model will apply to the document. It will replace part of the individual-document model in explaining the document.

3-level model d1 d2 … … … dn Information need generation query document collection

Translation model Basic LMs do not address issues of synonymy. Or any deviation in expression of information need from language of documents A translation model lets you generate query words not in document via “translation” to synonyms etc. Or to do cross-language IR, or multimedia IR Basic LM Translation Need to learn a translation model (using a dictionary or via statistical machine translation)

Language models: pro & con Novel way of looking at the problem of text retrieval based on probabilistic language modeling Conceptually simple and explanatory Formal mathematical model LMs provide effective retrieval and can be improved to the extent that the following conditions can be met Our language models are accurate representations of the data Users have some sense of term distribution Or we get more sophisticated with translation model

Comparison With Vector Space Similar in some ways Term weights based on frequency Terms often used as if they were independent Inverse document/collection frequency used Some form of length normalization useful Different in others Based on probability rather than similarity Intuitions are probabilistic rather than geometric Details of use of document length and term, document, and collection frequency differ

Resources J.M. Ponte and W.B. Croft. 1998. A language modelling approach to information retrieval. In SIGIR 21. D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584. A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229. D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221. [Several relevant newer papers at SIGIR 23–25, 2000–2002.] Workshop on Language Modeling and Information Retrieval, CMU 2001. http://la.lti.cs.cmu.edu/callan/Workshops/lmir01/ . The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently actively developed.