Language Models for Information Retrieval

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Information retrieval – LSI, pLSI and LDA
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Visual Recognition Tutorial
Chapter 7 Retrieval Models.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Visual Recognition Tutorial
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Latent Dirichlet Allocation
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Relevance Feedback Hongning Wang
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Sentiment analysis algorithms and applications: A survey
Online Multiscale Dynamic Topic Models
Lecture 13: Language Models for IR
Statistical Models for Automatic Speech Recognition
Maximum Likelihood Estimation
Information Retrieval Models: Probabilistic Models
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Hidden Markov Models Part 2: Algorithms
Representation of documents and queries
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
John Lafferty, Chengxiang Zhai School of Computer Science
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
Language Model Approach to IR
CS 4501: Information Retrieval
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Models for Retrieval Berlin Chen 2003
CS 430: Information Discovery
CS590I: Information Retrieval
INF 141: Information Retrieval
Learning to Rank using Language Models and SVMs
Information Retrieval and Web Design
SVMs for Document Ranking
Language Models for Information Retrieval
Language Models for TR Rong Jin
Presentation transcript:

Language Models for Information Retrieval Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval. July 2003 2. X. Liu and W.B. Croft, Statistical Language Modeling For Information Retrieval, the Annual Review of Information Science and Technology, vol. 39, 2005 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. (Chapter 12) 4. D. A. Grossman, O. Frieder, Information Retrieval: Algorithms and Heuristics, Springer, 2004 (Chapter 2) 5. C.X. Zhai. Statistical Language Models for Information Retrieval (Synthesis Lectures Series on Human Language Technologies). Morgan & Claypool Publishers, 2008

Taxonomy of Classic IR Models Set Theoretic Fuzzy Extended Boolean Set-based Classic Models Boolean Vector Probabilistic Document Property Text Links Multimedia Algebraic Generalized Vector Latent Semanti Indexing Neural Networks Support Vector Machines Proximal Nodes, others XML-based Semi-structured Text Page Rank Hubs & Authorities Web Probabilistic BM25 Language Models Divergence from Ramdomness Bayesian Networks Image retrieval Audio and Music Retrieval Video Retrieval Multimedia Retrieval

Statistical Language Models (1/2) A probabilistic mechanism for “generating” a piece of text Define a distribution over all possible word sequences Used LM to quantify the acceptability of a given word sequence What is LM Used for ? Speech recognition Spelling correction Handwriting recognition Optical character recognition Machine translation Document classification and routing Information retrieval …

Statistical Language Models (2/2) (Statistical) language models (LM) have been widely used for speech recognition and language (machine) translation for more than thirty years However, their use for information retrieval started only in 1998 [Ponte and Croft, SIGIR 1998] Basically, a query is considered generated from an “ideal” document that satisfies the information need The system’s job is then to estimate the likelihood of each document in the collection being the ideal document and rank then accordingly (in decreasing order) Ponte and Croft. A language modeling approach to information retrieval. SIGIR 1998

Three Ways of Developing LM Approaches for IR (a) Query likelihood (b) Document likelihood (c) Model comparison literal term matching or concept matching

Query-Likelihood Language Models Criterion: Documents are ranked based on Bayes (decision) rule is the same for all documents, and can be ignored might have to do with authority, length, genre, etc. There is no general way to estimate it Can be treated as uniform across all documents Documents can therefore be ranked based on The user has a prototype (ideal) document in mind, and generates a query based on words that appear in this document A document is treated as a model to predict (generate) the query document model

Another Criterion: Maximum Mutual Information Documents can be ranked based their mutual information with the query (in decreasing order) Document ranking by mutual information (MI) is equivalent that by likelihood being the same for all documents, and hence can be ignored

Yet Another Criterion: Minimum KL Divergence Documents are ranked by Kullback-Leibler (KL) divergence (in increasing order) Query model Document model The same for all document => can be disregarded Cross entropy between the language models of a query and a document Equivalent to ranking in decreasing order of Relevant documents are deemed to have lower cross entropies

Schematic Depiction for Query-Likelihood Approach . Document Collection MD1 MD2 MD3 query (Q) P(Q|MD1) P(Q|MD3) Models

Building Document Models: n-grams Multiplication (Chain) rule Decompose the probability of a sequence of events into the probability of each successive events conditioned on earlier events n-gram assumption Unigram Each word occurs independently of the other words The so-called “bag-of-words” model (e.g., how to distinguish “street market” from “market street) Bigram Most language-modeling work in IR has used unigram models IR does not directly depend on the structure of sentences

Unigram Model (1/4) The likelihood of a query given a document Words are conditionally independent of each other given the document How to estimate the probability of a (query) word given the document ? Assume that words follow a multinomial distribution given the document permutation is considered here

Unigram Model (2/4) Use each document itself a sample for estimating its corresponding unigram (multinomial) model If Maximum Likelihood Estimation (MLE) is adopted Doc D wc wa wb wb wa wa wc wd wa wb The zero-probability problem If we and wf do not occur in D then P(we |MD)= P(wf |MD)=0 This will cause a problem in predicting the query likelihood (See the equation for the query likelihood in the preceding slide) P(wb|MD)=0.3 P(wc |MD)=0.2 P(wd |MD)=0.1 P(we |MD)=0.0 P(wf |MD)=0.0

A document model Query Unigram Model (3/4) Smooth the document-specific unigram model with a collection model (two states, or a mixture of two multinomials) The role of the collection unigram model Help to solve zero-probability problem Help to differentiate the contributions of different missing terms in a document (global information like IDF ? ) The collection unigram model can be estimated in a similar way as what we do for the document-specific unigram model Normalized doc freq

Unigram Model (4/4) An evaluation on the Topic Detection and Tracking (TDT) corpora Language Model Vector Space Model Consideration of contextual information (Higher-order language models, e.g., bigrams) will not always lead to improved performance

Training Mixture Weights of LMs Expectation-Maximization (EM) Training The weights are tied among the documents E.g. m1 of Type I HMM can be trained using the following equation: Where is the set of training query exemplars, is the set of docs that are relevant to a specific training query exemplar , is the length of the query , and is the total number of docs relevant to the query the old weight 819 queries ≦2265 docs the new weight

Discriminative Training of LMs (1/5) Minimum Classification Error (MCE) Training Given a query and a desired relevant doc , define the classification error function as: “>0”: means misclassified; “<=0”: means a correct decision Transform the error function to the loss function In the range between 0 and 1 1 B. Chen et al., “A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken documents,” ACM Transactions on Asian Language Information Processing, 3(2), pp. 128-145, June 2004

Discriminative Training of LMs (2/5) Minimum Classification Error (MCE) Training Apply the loss function to the MCE procedure for iteratively updating the weighting parameters Constraints: Parameter Transformation, (e.g.,Type I HMM) and Iteratively update (e.g., Type I HMM) Where, Gradient descent

Discriminative Training of LMs (3/5) Minimum Classification Error (MCE) Training Iteratively update (e.g., Type I HMM)

Discriminative Training of LMs (4/5) Minimum Classification Error (MCE) Training Iteratively update (e.g., Type I HMM) the new weight the old weight

Discriminative Training of LMs (5/5) Minimum Classification Error (MCE) Training Final Equations Iteratively update can be updated in the similar way

Statistical Translation Model (1/2) Berger & Lafferty (1999) A query is viewed as a translation or distillation from a document That is, the similarity measure is computed by estimating the probability that the query would have been generated as a translation of that document Assumption of context-independence (the ability to handle the ambiguity of word senses is limited) However, it has the capability of handling the issues of synonymy (multiple terms having similar meaning) and polysemy (the same term having multiple meanings) word-to-word translation Synonymy:同義;同義詞研究 Polysemy:一詞多義;多種解釋 A. Berger and J. Lafferty. Information retrieval as statistical translation. SIGIR 1999

Statistical Translation Model (2/2) Weakness of the statistical translation model The need of a large collection of training data for estimating translation probabilities, and inefficiency for ranking documents Jin et al. (2002) proposed a “Title Language Model” approach to capture the intrinsic document to query translation patterns Queries are more like titles than documents (queries and titles both tend to be very short and concise descriptions of information, and created through a similar generation process) Train the statistical translation model based on the document-title pairs in the whole collection R. Jin et al. Title language model for information retrieval. SIGIR 2002

Probabilistic Latent Semantic Analysis (PLSA) Hofmann (1999) Also called The Aspect Model, Probabilistic Latent Semantic Indexing (PLSI) Graphical Model Representation (a kind of Bayesian Networks) Language (unigram) model PLSA The latent variables =>The unobservable class variables Tk (topics or domains) T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 2001

PLSA: Formulation Definition : the prob. when selecting a doc : the prob. when pick a latent class for the doc : the prob. when generating a word from the class

PLSA: Assumptions Bag-of-words: treat docs as memoryless source, words are generated independently Conditional independent: the doc and word are independent conditioned on the state of the associated latent variable

PLSA: Training (1/2) Probabilities are estimated by maximizing the collection likelihood using the Expectation-Maximization (EM) algorithm EM tutorial: - Jeff A. Bilmes "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," U.C. Berkeley TR-97-021

PLSA: Training (2/2) E (expectation) step M (Maximization) step

PLSA: Latent Probability Space (1/2) Dimensionality K=128 (latent classes) medical imaging image sequence analysis context of contour boundary detection phonetic segmentation . . =

PLSA: Latent Probability Space (2/2) D1 D2 Di Dn T1…Tk… TK D1 D2 Di Dn w1 w2 wj wm w1 w2 wj wm kxk VTkxn rxn Σk = mxn mxk P Umxk

PLSA: One more example on TDT1 dataset aviation space missions family love Hollywood love

PLSA: Experiment Results (1/4) Experimental Results Two ways to smoothen empirical distribution with PLSA Combine the cosine score with that of the vector space model (so does LSA) PLSA-U* (See next slide) Combine the multinomials individually PLSA-Q* Both provide almost identical performance The performance of PLSA ( ) is not promising when being used alone

PLSA: Experiment Results (2/4) PLSA-U* Use the low-dimensional representation and (be viewed in a k-dimensional latent space) to evaluate relevance by means of cosine measure Combine the cosine score with that of the vector space model Use the ad hoc approach to re-weight the different model components (dimensions) by online folded-in

PLSA: Experiment Results (3/4) Why ? Reminder that in LSA, the relations between any two docs can be formulated as PLSA mimics LSA in similarity measure Ds Di A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)T

PLSA: Experiment Results (4/4)

PLSA vs. LSA Decomposition/Approximation Computational complexity LSA: least-squares criterion measured on the L2- or Frobenius norms of the word-doc matrices PLSA: maximization of the likelihoods functions based on the cross entropy or Kullback-Leibler divergence between the empirical distribution and the model Computational complexity LSA: SVD decomposition PLSA: EM training, is time-consuming for iterations ? The model complexity of Both LSA and PLSA grows linearly with the number of training documents There is no general way to estimate or predict the vector representation (of LSA) or the model parameters (of PLSA) for a newly observed document

Latent Dirichlet Allocation (LDA) (1/2) Blei et al. (2003) The basic generative process of LDA closely resembles PLSA; however, In PLSA, the topic mixture is conditioned on each document ( is fixed, unknown) While in LDA, the topic mixture is drawn from a Dirichlet distribution, so-called the conjugate prior, ( is unknown and follows a probability distribution) In Bayesian probability theory, a class of prior probability distributions p(θ) is said to be conjugate to a class of likelihood functions p(x|θ) if the resulting posterior distributions p(θ|x) are in the same family as p(θ). Blei et al. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003

Latent Dirichlet Allocation (2/2) word 2 word 3 word 1 X (P(w1)) Y (P(w2)) Z (P(w3)) X+Y+Z=1

Word Topic Models (WTM) Each word of language are treated as a word topical mixture model for predicting the occurrences of other words WTM also can be viewed as a nonnegative factorization of a “word-word” matrix consisting probability entries Each column encodes the vicinity information of all occurrences of a distinct word in the document collection B. Chen, “Word topic models for spoken document retrieval and transcription,” ACM Transactions on Asian Language Information Processing, 8(1), pp. 2:1-2:27, March 2009

Comparison of WTM and PLSA/LDA A schematic comparison for the matrix factorizations of PLSA/LDA and WTM documents topics documents topics PLSA/LDA words words mixture weights normalized “word-document” co-occurrence matrix mixture components vicinities of words topics vicinities of words WTM words words topics mixture weights normalized “word-word” co-occurrence matrix mixture components

WTM: Information Retrieval (1/4) The relevance measure between a query and a document can be expressed by Unsupervised training The WTM of each word can be trained by concatenating those words occurring within a context window of size around each occurrence of the word, which are postulated to be relevant to the word

WTM: Information Retrieval (2/4) Supervised training: The model parameters are trained using a training set of query exemplars and the associated query-document relevance information Maximize the log-likelihood of the training set of query exemplars generated by their relevant documents

WTM: Information Retrieval (3/4) Example topic distributions of WTM Topic 13 word weight Vena (靜脈) 1.202 Resection (切除) 0.674 Myoma (肌瘤) 0.668 Cephalitis (腦炎) 0.618 Uterus (子宮) 0.501 Bronchus (支氣管) 0.500 Topic 14 word weight Land tax (土地稅) 0.704 Tobacco and alcohol tax law (菸酒稅法) 0.489 Tax (財稅) 0.457 Amend drafts (修正草案) 0.446 Acquisition (購併) 0.396 Insurance law (保險法) 0.373 Topic 23 word weight Cholera (霍亂) 0.752 Colorectal cancer (大腸直腸癌) 0.681 Salmonella enterica (沙門氏菌) 0.471 Aphtae epizooticae (口蹄疫) 0.337 Thyroid (甲狀腺) 0.303 Gastric cancer (胃癌) 0.298

WTM: Information Retrieval (4/4) Pairing of PLSA and WTM Sharing the same set of latent topics vicinity documents vicinity documents documents topics documents Topics PLSA WTM words words mixture weights normalized “word-document” & “word-word” co-occurrence matrix mixture components

Applying Relevance Feedback to LM Framework (1/2) There is still no formal mechanism to incorporate relevance feedback (judgments) into the language modeling framework (especially for the query-likelihood approach) The query is a fixed sample while focusing on estimating accurate estimation of document language models Ponte (1998) proposed a limited way to incorporate blind reference feedback into the LM framework Think of example relevant documents as examples of what the query might have been, and re-sample (or expand) the query by adding k highly descriptive words from the these documents (blind reference feedback) J. M. Ponte, A language modeling approach to information retrieval, Ph.D. dissertation, UMass, 1998

Applying Relevance Feedback to LM Framework (2/2) Miller et al. (1999) propose two relevance feedback approach Query expansion: add those words to the initial query that appear in two or more of the top m retrieved documents Document model re-estimation: use a set of outside training query exemplars to train the transition probabilities of the document models Where is the set of training query exemplars, is the set of documents that are relevant to a specific training query exemplar , is the length of the query , and is the total number of documents relevant to the query the old weight A document model 819 queries ≦2265 docs the new weight Query Miller et al. , A hidden Markov model information retrieval system, SIGIR 1999

Relevance Modeling (1/3) A schematic illustration of the information retrieval system with relevance feedback for improved query modeling Initial Query Model Document Models Initial Round of Retrieval Top-Ranked Documents Representative Various Query Modeling Second Round of Retrieval Document Collection Retrieval Result Query K.-Y. Chen et al., "Exploring the use of unsupervised query modeling techniques for speech recognition and summarization,“ Speech Communication, 80, pp. 49-59, June 2016.

Relevance Modeling (2/3) Use the top-ranked pseudo-relevant documents to approximate the relevance class of each query The joint probability of a query Q and any word w being generated by the relevance class of is computed as follows, on the basis of the top-ranked list of pseudo-relevant documents obtained from the initial round of retrieval: If we further assume that words are conditionally independent given Dm and their order is of no importance

Relevance Modeling (3/3) The enhanced query model , therefore, can be expressed by Incorporation of Latent Topic Information

Leveraging Non-relevant Information Hypothesize that the low-ranked (or pseudo non-relevant) documents can provide useful cues as well to boost the retrieval effectiveness of a given query For this idea to work, we may estimate a non-relevance model for each test query based on those selected pseudo non-relevant documents The similarity measure between a query and document thus can be computed as follows

Incorporating Prior Knowledge into LM Framework Several efforts have been paid to using prior knowledge for the LM framework, especially modeling the document prior Document length Document source Average word-length Aging (time information/period) URL Page links

Implementation Notes: Probability Manipulation For language modeling approaches to IR, many conditional probabilities are usually multiplied. This can result in a “floating point underflow” It is better to perform the computation by “adding” logarithms of probabilities instead The logarithm function is monotonic (order-preserving) We also should avoid the problem of “zero probabilities (or estimates)” owing to sparse data, by using appropriate probability smoothing techniques

Implementation Notes: Converting to tf-idf-like Weighting The query likelihood retrieval model Logarithm is a monotonic (rank-preserving) transformation Being the same for all documents => can be discarded! Therefore, the similarity score is directly proportional to the document frequency and inversely proportional to the collection frequency. => Can be efficiently implemented with inverted files (To be discussed later on!)

D w N M T : observed variable : latent variable N: number of distinct in the vocabulary M: number of documents in the collection : observed variable : latent variable

A document model Query