Models for Retrieval Berlin Chen 2003

Slides:

Advertisements

Similar presentations

Eigen Decomposition and Singular Value Decomposition

Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Information retrieval – LSI, pLSI and LDA

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

Dimensionality Reduction PCA -- SVD

Supervised Learning Recap

Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Visual Recognition Tutorial

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Hinrich Schütze and Christina Lioma

Principal Component Analysis

1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Vector Space Model CS 652 Information Extraction and Integration.

Probabilistic Latent Semantic Analysis

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Chapter 2 Dimensionality Reduction. Linear Methods

CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

SINGULAR VALUE DECOMPOSITION (SVD)

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

HMM - Part 2 The EM algorithm Continuous density HMM.

Linear Models for Classification

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Latent Dirichlet Allocation

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Natural Language Processing Topics in Information Retrieval August, 2002.

Unsupervised Learning II Feature Extraction

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Information Retrieval: Models and Methods

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Information Retrieval: Models and Methods

LECTURE 10: DISCRIMINANT ANALYSIS

Data Mining Lecture 11.

LSI, SVD and Data Management

Relevance Feedback Hongning Wang

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Word Embedding Word2Vec.

LECTURE 15: REESTIMATION, EM AND MIXTURES

LECTURE 09: DISCRIMINANT ANALYSIS

CS 4501: Information Retrieval

Junghoo “John” Cho UCLA

Topic Models in Text Processing

Parametric Methods Berlin Chen, 2005 References:

CS 430: Information Discovery

Discriminative Feature Extraction and Dimension Reduction - PCA, LDA and LSA Berlin Chen Graduate Institute of Computer Science & Information Engineering.

Restructuring Sparse High Dimensional Data for Effective Retrieval

Berlin Chen Department of Computer Science & Information Engineering

Learning to Rank using Language Models and SVMs

Language Models for Information Retrieval

Language Models for Information Retrieval

Presentation transcript:

Models for Retrieval Berlin Chen 2003 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2003 References: 1. Berlin Chen et al., “An HMM/N-gram-based Linguistic Processing Approach for Mandarin Spoken Document Retrieval,” EUROSPEECH 2001 2. M. W. Berry et al., “Using Linear Algebra for Intelligent Information Retrieval,” technical report, 1994 3. Thomas Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, 2001

Taxonomy of Classic IR Models Set Theoretic Fuzzy Extended Boolean Classic Models Boolean Vector Probabilistic Algebraic Generalized Vector Latent Semantic Indexing (LSI) Neural Networks U s e r T a k Retrieval: Adhoc Filtering Non-Overlapping Lists Proximal Nodes Structured Models Probabilistic Inference Network Belief Network Hidden Markov Model Probabilistic LSI Language Model Browsing Browsing Flat Structure Guided Hypertext probability-based

HMM/N-gram-based Model Model the query as a sequence of input observations (index terms), Model the doc as a discrete HMM composed of distribution of N-gram parameters The relevance measure, , can be estimated by the N-gram probabilities of the index term sequence for the query, , predicted by the doc A generative model for IR with the assumption that ……

HMM/N-gram-based Model Given a word sequence, , of length N How to estimate its corresponding probability ? chain rule is applied Too complicate to estimate all the necessary probability items !

HMM/N-gram-based Model N-gram approximation (Language Model) Unigram Bigram Trigram ……..

HMM/N-gram-based Model A discrete HMM composed of distributions of N-gram parameters A mixture of N probability distributions

HMM/N-gram-based Model Three Types of HMM Structures Type I: Unigram-Based (Uni) Type II: Unigram/Bigram-Based (Uni+Bi) Type III: Unigram/Bigram/Corpus-Based (Uni+Bi*) P(陳水扁總統視察阿里山小火車|D) =[m1P(陳水扁|D)+m2 P(陳水扁|C)] x [m1P(總統|D)+m2P(總統|C)+ m3P(總統|陳水扁,D)+m4P(總統|陳水扁,C)] x[m1P(視察|D)+m2P(視察|C)+ m3P(視察|總統,D)+m4P(視察|總統,C)] x ……….

HMM/N-gram-based Model The role of the corpus N-gram probabilities Model the general distribution of the index terms Help to solve zero-frequency problem Help to differentiate the contributions of different missing terms in a doc The corpus N-gram probabilities were estimated using an outside corpus P(qa |D)=0.4 P(qb|D)=0.3 Doc D P(qc |D)=0.2 qa qc qb qb P(qd |D)=0.1 qa qa qc qd P(qe |D)=0.0 qa qb P(qf |D)=0.0

HMM/N-gram-based Model Estimation of N-grams (Language Models) Maximum likelihood estimation (MLE) for doc N-grams Unigram Bigram Similar formulas for corpus N-grams Counts of term qi in the doc D Length of the doc D Or Number of terms in the doc D Counts of term pair (qj,qi ) in the doc D Counts of term qi in the doc D Corpus: an outside corpus or just the doc collection

HMM/N-gram-based Model Basically, m1, m2, m3, m4, can be estimated by using the Expectation-Maximization (EM) algorithm All docs share the same weights mi here The N-gram probability distributions also can be estimated using the EM algorithm instead of the maximum likelihood estimation For those docs with training queries, m1, m2, m3, m4, can be estimated by using the Minimum Classification Error (MCE) training algorithm The docs can have different weights because of the insufficiency of training data

HMM/N-gram-based Model Expectation-Maximization Training The weights are tied among the documents E.g. m1 of Type I HMM can be trained using the following equation: Where is the set of training query exemplars, is the set of docs that are relevant to a specific training query exemplar , is the length of the query , and is the total number of docs relevant to the query the old weight 819 queries ≦2265 docs the new weight

HMM/N-gram-based Model 1 Expectation-Maximization Training Step 1: Expectation Log-likelihood expression and take expectation over K 2 3 4 query word sequence mixture sequence Take expectation on all possible mixture sequences K (conditioned on Q,D )

HMM/N-gram-based Model 1 2 Explanation 3 4 Independence Assumption

HMM/N-gram-based Model Expectation-Maximization Training Step 1: Expectation (cont.) Express using two auxiliary functions where

HMM/N-gram-based Model Expectation-Maximization Training Step 1: Expectation (cont.) We want

HMM/N-gram-based Model Expectation-Maximization Training Step 1: Expectation (cont.) has the following property Kullbuack-Leibler (KL) distance Jensen’s inequality

HMM/N-gram-based Model Expectation-Maximization Training Step 1: Expectation (cont.) Therefore, for maximizing , we only need to maximize the Φ-function (auxiliary function) If unigram was used, the Φ-function can be further expressed as ?

HMM/N-gram-based Model Expectation-Maximization Training Step 1: Expectation (cont.) empirical distribution the model Auxiliary function

HMM/N-gram-based Model Expectation-Maximization Training Step 1: Expectation (cont.) the Φ-function (auxiliary function) can be treated in two parts The reestimation of probabilities will not be discussed here !

HMM/N-gram-based Model Expectation-Maximization Training Step 2: Maximization Apply Lagrange Multiplier Constraint

HMM/N-gram-based Model Expectation-Maximization Training Step 2: Maximization (cont.) Apply Lagrange Multiplier normalization constraints using Lagrange multipliers

HMM/N-gram-based Model Expectation-Maximization Training Step 2: Maximization (cont.) Extension: Multiple training queries for a doc Weights are tied among docs

HMM/N-gram-based Model Experimental results with EM training HMM/N-gram-based approach Vector space model HMM/N-gram-based approach is consistently better than vector space model

Review: The EM Algorithm Introduction of EM (Expectation Maximization): Why EM? Simple optimization algorithms for likelihood function relies on the intermediate variables, called latent (隱藏的)data In our case here, the state sequence is the latent data Direct access to the data necessary to estimate the parameters is impossible or difficult Two Major Steps : E : expectation with respect to the latent data using the current estimate of the parameters and conditioned on the observations M: provides a new estimation of the parameters according to ML (or MAP)

Review: The EM Algorithm The EM Algorithm is important to HMMs and other learning techniques Discover new model parameters to maximize the log-likelihood of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data Example The observable training data We want to maximize , is a parameter vector The hidden (unobservable) data E.g. the component densities of observable data , or the underlying state sequence in HMMs

HMM/N-gram-based Model Minimum Classification Error (MCE) Training Given a query and a desired relevant doc , define the classification error function as: “>0”: means misclassified; “<=0”: means a correct decision Transform the error function to the loss function In the range between 0 and 1 1

HMM/N-gram-based Model Minimum Classification Error (MCE) Training Apply the loss function to the MCE procedure for iteratively updating the weighting parameters Constraints: Parameter Transformation, (e.g.,Type I HMM) and Iteratively update (e.g., Type I HMM) Where,

HMM/N-gram-based Model Minimum Classification Error (MCE) Training Iteratively update (e.g., Type I HMM)

HMM/N-gram-based Model Minimum Classification Error (MCE) Training Iteratively update (e.g., Type I HMM) the new weight the old weight

HMM/N-gram-based Model Minimum Classification Error (MCE) Training Final Equations Iteratively update can be updated in the similar way

HMM/N-gram-based Model Experimental results with MCE training The results for the syllable-level index features were significantly improved Iterations=100 Before MCE Training Average Precision MCE Iterations (Word-based) MCE Iterations (syllable-based) TQ/TD TQ/SD

HMM/N-gram-based Model Advantages A formal mathematic framework Use collection statistics but not heuristics The retrieval system can be gradually improved through usage Disadvantages Only literal term matching (or word overlap measure) The issue of relevance or aboutness is not taken into consideration The implementation relevance feedback or query expansion is not straightforward

Latent Semantic Indexing (LSI) LSI: a technique that projects queries and docs into a space with “latent” semantic dimensions Co-occurring terms are projected onto the same dimensions In the latent semantic space (with fewer dimensions), a query and doc can have high cosine similarity even if they do not share any terms Dimensions of the reduced space correspond to the axes of greatest variation Closely related to Principal Component Analysis (PCA)

Latent Semantic Indexing (LSI) Dimension Reduction and Feature Extraction PCA SVD (in LSI) feature space n k n orthonormal basis U mxr rxn VT latent semantic space Σ k A’ A kxk rxr r ≤ min(m,n) mxn mxn k latent semantic space

Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) used for the word-document matrix A least-squares method for dimension reduction Projection of a Vector :

Latent Semantic Indexing (LSI) Frameworks to circumvent vocabulary mismatch Doc terms structure model doc expansion latent semantic structure retrieval literal term matching query expansion Query terms structure model

Latent Semantic Indexing (LSI)

Latent Semantic Indexing (LSI) Query: “human computer interaction” An OOV word

Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) d1 d2 dn d1 d2 dn w1 w2 wm w1 w2 wm Both U and V has orthonormal column vectors Σr VTrxm UTU=IkXk A Umxr rxr rxn = VTV=IkXk R ≤ min(m,n) K ≤ r mxr ||A||F2 ≥ ||A’|| F2 mxn d1 d2 dn d1 d2 dn w1 w2 wm w1 w2 wm Σk VTkxm A’ Umxk kxk kxn = Docs and queries are represented in a k-dimensional space. The quantities of the axes can be properly weighted according to the associated diagonal values of Σk mxn mxk

Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) ATA is symmetric nxn matrix All eigenvalues λj are nonnegative real numbers All eigenvectors vj are orthonormal Define singular values: As the square roots of the eigenvalues of ATA As the lengths of the vectors Av1, Av2 , …., Avn sigma For λi≠ 0, i=1,…r, {Av1, Av2 , …. , Avr } is an orthogonal basis of Col A

Latent Semantic Indexing (LSI) {Av1, Av2 , …. , Avr } is an orthogonal basis of Col A Suppose that A (or ATA) has rank r ≤ n Define an orthonormal basis {u1, u2 ,…., ur} for Col A Extend to an orthonormal basis {u1, u2 ,…, um} of Rm V : an orthonormal matrix ? ?

Latent Semantic Indexing (LSI) mxn

Latent Semantic Indexing (LSI) Fundamental comparisons based on SVD The original word-document matrix (A) The new word-document matrix (A’) compare two terms → dot product of two rows of U’Σ’ compare two docs → dot product of two rows of V’Σ’ compare a query word and a doc → each individual entry of A’ d1 d2 dn compare two terms → dot product of two rows of A or an entry in AAT compare two docs → dot product of two columns of A or an entry in ATA compare a term and a doc → each individual entry of A w1 w2 wm A mxn U’=Umxk Σ’=Σk V’=Vnxk A’A’T=(U’Σ’V’T) (U’Σ’V’T)T=U’Σ’V’TV’Σ’TU’T =(U’Σ’)(U’Σ’)T For stretching or shrinking I A’TA’=(U’Σ’V’T)T ’(U’Σ’V’T) =V’Σ’T’UT U’Σ’V’T=(V’Σ’)(V’Σ’)t

Latent Semantic Indexing (LSI) Fold-in: find representations for pesudo-docs q For objects (new queries or docs) that did not appear in the original analysis Fold-in a new mx1 query (or doc) vector Cosine measure between the query and doc vectors in the latent semantic space The separate dimensions are differentially weighted Just like a row of V Query represented by the weighted sum of it constituent term vectors row vectors

Latent Semantic Indexing (LSI) Fold-in a new 1xn term vector

Latent Semantic Indexing (LSI) Experimental results HMM is consistently better than VSM at all recall levels LSI is better than VSM at higher recall levels Recall-Precision curve at 11 standard recall levels evaluated on TDT-3 SD collection. (Using word-level indexing terms)

Latent Semantic Indexing (LSI) Advantages A clean formal framework and a clearly defined optimization criterion (least-squares) Conceptual simplicity and clarity Handle synonymy problems (“heterogeneous vocabulary”) Good results for high-recall search Take term co-occurrence into account Disadvantages High computational complexity LSI offers only a partial solution to polysemy E.g. bank, bass,…

Probabilistic Latent Semantic Analysis (PLSA) Thomas Hofmann 1999 Also called The Aspect Model, Probabilistic Latent Semantic Indexing (PLSI) Can be viewed as a complex HMM Model The latent variables =>The unobservable class variables Ti (topics or domains) ?

Probabilistic Latent Semantic Analysis (PLSA) Definition : the prob. when selecting a doc : the prob. when pick a latent class for the doc : the prob. when generating a word from the class

Probabilistic Latent Semantic Analysis (PLSA) Assumptions Bag-of-words: treat docs as memoryless source, words are generated independently Conditional independent: the doc and word are independent conditioned on the state of the associated latent variable Can be viewed as the topics are tied among HMMs

Probabilistic Latent Semantic Analysis (PLSA) Probability estimation using EM (expectation-maximization) algorithm E (expectation) step Define the auxiliary function With the property: take expectation complete data likelihood empirical distribution the model without the introduction of query exemplars for training

Probabilistic Latent Semantic Analysis (PLSA) Probability estimation using EM (expectation-maximization) algorithm E (expectation) step The expression can be further decomposed as The auxiliary function Kullback-Leibler divergence

Probabilistic Latent Semantic Analysis (PLSA) Probability estimation using EM M (maximization) step normalization constraints using Lagrange multipliers

Probabilistic Latent Semantic Analysis (PLSA) Probability estimation using EM M (maximization) step Take differentiation The training formula The training formula

Probabilistic Latent Semantic Analysis (PLSA) Latent Probability Space Dimensionality K=128 (latent classes) medical imaging image sequence analysis context of contour boundary detection phonetic segmentation . . =

Probabilistic Latent Semantic Analysis (PLSA) Latent Probability Space d1 d2 di dn d1 d2 di dn t1…tk… w1 w2 wj wm w1 w2 wj wm rxr VTrxm rxn Σr = mxn mxr P Umxr

Probabilistic Latent Semantic Analysis (PLSA) One more example on TDT1 dataset aviation space missions family love Hollywood love

Probabilistic Latent Semantic Analysis (PLSA) Comparison with LSI Decomposition/Approximation LSI: least-squares criterion measured on the L2- or Frobenius norms of the word-doc matrices PLSA: maximization of the likelihoods functions based on the cross entropy or Kullback-Leibler divergence between the empirical distribution and the model Computational complexity LSI: SVD decomposition PLSA: EM training, is time-consuming for iterations ?

Probabilistic Latent Semantic Analysis (PLSA) Experimental Results PLSI-U* Two ways to smoothen empirical distribution with PLSI Combine the cosine score with that of the vector space model (so does LSI) Combine the multinomials individually Both provide almost identical performance It’s not known if PLSA was used alone

Probabilistic Latent Semantic Analysis (PLSA) Experimental Results PLSI-Q* Use the low-dimensional representation and (be viewed in a k-dimensional latent space) to evaluate relevance by means of cosine measure Combine the cosine score with that of the vector space model Use the ad hoc approach to reweight the different model components (dimensions) by

Probabilistic Latent Semantic Analysis (PLSA) Experimental Results