Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004.

Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004

Abstract Discriminative model vs. Generative model Discriminative – attractive theoretical properties Performance Comparison Discriminative – Maximum Entropy, Support Vector Machine Generative – Language Modeling Experiment Ad-hoc Retrieval ME is worse than LM, SVM are on par with LM Home-Page Finding Prefer SVM over LM

Introduction Traditional IR A problem of measuring the similarity between docs and query, such as Vector Space Model Shortcoming Term-weights – empirically tuned No theoretical basis for computing optimum weights Binary independence retrieval (BIR) Robertson and Sparck Jones (1976) First model that viewed IR as a classification problem This allows us to leverage many sophisticated techniques developed in ML domanin Discriminative models Good success in many applications of ML

Discriminative and Generative Classifiers Pattern Classification The problem of classifying an example based on its vector of features x into its class C through a posterior probability P(C|x) or simply a confidence score g(C|x) Discriminative models Model the posterior directly or learn a direct map from inputs x to the class labels C Generative models Model the class-conditional probability P(x|C) and the prior probability P(C) and estimate the posterior through the Bayes ’ rule

Probabilistic IR models as Classifiers (1/3) Binary Independence Retrieval (BIR) model Ranking is done by the log-likelihood ration of relevance The model has not met with good empirical success owing to the difficulty in estimating the class conditional P(x i =1|R) Assume uniform probability distribution over the entire vocabulary and update the probabilities as relevant docs are provided by the user

Probabilistic IR models as Classifiers (2/3) Two-Poisson model Follow the same framework as that of BIR model, but they use a mixture of two Poisson distributions to model the class conditions and This also is a generative model Similar to the BIR model, it also needs relevance feedback for accurate parameter estimation

Probabilistic IR models as Classifiers (3/3) Language Models Ponte and Croft (1998) The ranking of a doc is given by the probability of generation of the query from doc ’ s language model This model circumvent the problem of estimating the model of relevant documents that the BIR model and Two-Poisson suffer from LM can be considered generative classifiers in a multi-class classification sense

The Case for Discriminative Models for IR (1/3) Discriminative vs. Generative One should solve the problem (classification) directly and never solve a more general problem (class-conditional) as an intermediate step Model Assumptions GM Terms are conditionally independent LM assume docs obey a multinomial distribution of terms DM Typically make few assumptions and in a sense, let the data speak for itself.

The Case for Discriminative Models for IR (2/3) Expressiveness GM - LM are not expressive enough to incorporate many features into the model DM - It can include all features effortlessly into a single model Learning arbitrary features In view of the many query dependent and query- independent doc features and user-preferences that influence features, we believe that a DM that learns all the features is best suited for the generalized IR problem

The Case for Discriminative Models for IR (3/3) Notion of Relevance In LM, there is no explicit notion of relevance. There has been considerable controversy on the missing relevance variable in LM. We believe that Robertson ’ s view of IR as a binary classification problem of relevance is more realistic than the implicit notion of relevance as it exists in LM.

Discriminative Models Used in Current Work (1/2) Maximum Entropy Model The principle of ME – model all that is known and assume nothing about that which is unknown The parametric form of the ME probability function can be expressed by The feature weights (λ) are learned from training data using a fast gradient descent algorithm As in Robertson ’ s BIR model, we use the log-likelihood ratio as the scoring function for ranking as shown follows

Discriminative Models Used in Current Work (2/2) Support Vector Machines (SVM) Basic idea – find the hyper-plane that separate the two classes of training examples with the largest margin If f(D,Q) is the vector of features, then the discriminative function is given by The SVM is trained such that g(R|D,Q)>=1 for positive (relevant) examples and g(R|D,Q) <= -1 for negative (non-relevant) examples as long as the data is separable Both DMs Retaining the basic framework of the BIR model, while avoid estimating the class-conditional and instead directly compute the posterior P(R|Q,D) or the mapping function g(R|R,Q)

Other Modeling Issues Out of Vocabulary words (OOV) problem Test queries are almost always guaranteed to contain words that are not seen in the training queries The features are not based on words themselves, but on query-based statistics of documents such as the total frequency of occurrences or the sum-total of the idf-values of the query terms Unbalanced data The classes (non-relevant) is a large portion of all the examples, while the other (relevant) class have only a small percent of of the examples Over-sampling the minority class Under-sampling the majority class

Experiments and Results (1/8) Ad-hoc retrieval Data set Preprocessing K-stemmer and removing stop-words Only use title queries for retrieval LM Training the LM consists of learning the optimal value of the smoothing parameter All LM runs were performed using Lemur

Experiments and Results (2/8) DM Features SVM svm-light for SVM runs Linear kernel gives the best performance on most data set (converge rapidly) ME The toolkit of Zhang

Experiments and Results (3/8) The comparison of performance of LM, SVM and ME 50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is statistically better than LM; 37.5 (6/16) is that LM is superior to SVM

Experiments and Results (4/8) Discussion Official TREC runs – query expansion DM can improve the performance by including other features such as proximity of query terms, occurrence of query terms as noun-phrases, etc. Such features would not be easy to incorporate into the LM framework In the emergence of modern IR collections such as web and scientific literature that are characterized by a diverse variety of features, we will increasingly rely on models that can automatically learn these features from examples

Experiments and Results (5/8) Home-page finding on web collection Choose the home-page finding task of TREC-10 where many features such as title, anchor-text and link structure influence relevance Example – returned the web-page http://trec.nist.gov where the query “ Text Retrieval Conference ” is issuedhttp://trec.nist.gov Corpus – WT10G Queries 50 for training, 50 for development and 145 for testing Evaluation Metrics The mean reciprocal rank (MRR) Success rate – an answer is found in top 10 Failure rate – no answer is returned in top 100

Experiments and Results (6/8) Three Indexes A content index consisting of the textual content of the documents with all the HTML tags removed An index of the anchor text documents An index of the titles of all documents 20 Features Use 6 previous features from each of the three indexes Two additional features URL-depth – a home-page typically is at depth 1 Link-Factor

Experiments and Results (7/8) Performance on development set Performance on test set

Experiments and Results (8/8) Discussion SVMs leverage a variety of features and improve on the baseline LM performance by 48.6% in MRR The best run in TREC-10 achieved an MRR of 0.77 on the test set; however, their feature weights were optimized using empirical means while our models learn them automatically. Only demonstrate the learning ability of SVMs We believe there is a lot more that needs to be in defining the right kind of features, such as PageRank for the link- factor feature.

Related Works A few attempts in applying discriminative models for IR Cooper and Huizinga make a strong case for applying the maximum entropy approach to the problems of information retrieval Kantor and Lee extend the analysis of the principle of maximum entropy in the context of information retrieval Greiff and Ponte showed that the classic binary independence model and the maximum entropy approach are equivalent Gey suggested the method of logistic regression, which is equivalent to the method of maximum entropy used in our work

Conclusion and Future Work (1/2) Treat IR as a problem of binary classification Quantify relevance explicitly Permit us apply sophisticated pattern classification technique Explore SVMs and MEs to IR Their main utility to IR lies in their ability to learn automatically a variety of features that influence relevance Ad-hoc retrieval SVMs perform as well as LMs Home-page finding SVMs outperform the baseline runs by about 50% in MRR

Conclusion and Future Work (2/2) Future Work Further improvement through better feature engineering and by leveraging a huge body of literature on SMVs and other learning algorithms Evaluate the performance of SVMs on ad-hoc retrieval task with longer queries Enhance features such as proximity of query terms, synonyms, etc. Study user modeling by incorporating user-preferences as features in the SVMs

Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004.

Similar presentations

Presentation on theme: "Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004.

Similar presentations

Presentation on theme: "Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004."— Presentation transcript:

Similar presentations

About project

Feedback