Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Introduction to Information Retrieval (Part 2) By Evren Ermis.

Probabilistic Ranking Principle

Information Retrieval in Practice

Information Retrieval Models: Probabilistic Models

Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR

Chapter 7 Retrieval Models.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Modeling Modern Information Retrieval

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Scalable Text Mining with Sparse Generative Models

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.

1 Probabilistic Language-Model Based Document Retrieval.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Universit at Dortmund, LS VIII

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.

A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

NTU & MSRA Ming-Feng Tsai

Relevance Feedback Hongning Wang

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

John Lafferty, Chengxiang Zhai School of Computer Science

Presentation transcript:

Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004

Abstract Discriminative model vs. Generative model Discriminative – attractive theoretical properties Performance Comparison Discriminative – Maximum Entropy, Support Vector Machine Generative – Language Modeling Experiment Ad-hoc Retrieval ME is worse than LM, SVM are on par with LM Home-Page Finding Prefer SVM over LM

Introduction Traditional IR A problem of measuring the similarity between docs and query, such as Vector Space Model Shortcoming Term-weights – empirically tuned No theoretical basis for computing optimum weights Binary independence retrieval (BIR) Robertson and Sparck Jones (1976) First model that viewed IR as a classification problem This allows us to leverage many sophisticated techniques developed in ML domanin Discriminative models Good success in many applications of ML

Discriminative and Generative Classifiers Pattern Classification The problem of classifying an example based on its vector of features x into its class C through a posterior probability P(C|x) or simply a confidence score g(C|x) Discriminative models Model the posterior directly or learn a direct map from inputs x to the class labels C Generative models Model the class-conditional probability P(x|C) and the prior probability P(C) and estimate the posterior through the Bayes ’ rule

Probabilistic IR models as Classifiers (1/3) Binary Independence Retrieval (BIR) model Ranking is done by the log-likelihood ration of relevance The model has not met with good empirical success owing to the difficulty in estimating the class conditional P(x i =1|R) Assume uniform probability distribution over the entire vocabulary and update the probabilities as relevant docs are provided by the user

Probabilistic IR models as Classifiers (2/3) Two-Poisson model Follow the same framework as that of BIR model, but they use a mixture of two Poisson distributions to model the class conditions and This also is a generative model Similar to the BIR model, it also needs relevance feedback for accurate parameter estimation

Probabilistic IR models as Classifiers (3/3) Language Models Ponte and Croft (1998) The ranking of a doc is given by the probability of generation of the query from doc ’ s language model This model circumvent the problem of estimating the model of relevant documents that the BIR model and Two-Poisson suffer from LM can be considered generative classifiers in a multi-class classification sense

The Case for Discriminative Models for IR (1/3) Discriminative vs. Generative One should solve the problem (classification) directly and never solve a more general problem (class-conditional) as an intermediate step Model Assumptions GM Terms are conditionally independent LM assume docs obey a multinomial distribution of terms DM Typically make few assumptions and in a sense, let the data speak for itself.

The Case for Discriminative Models for IR (2/3) Expressiveness GM - LM are not expressive enough to incorporate many features into the model DM - It can include all features effortlessly into a single model Learning arbitrary features In view of the many query dependent and query- independent doc features and user-preferences that influence features, we believe that a DM that learns all the features is best suited for the generalized IR problem

The Case for Discriminative Models for IR (3/3) Notion of Relevance In LM, there is no explicit notion of relevance. There has been considerable controversy on the missing relevance variable in LM. We believe that Robertson ’ s view of IR as a binary classification problem of relevance is more realistic than the implicit notion of relevance as it exists in LM.

Discriminative Models Used in Current Work (1/2) Maximum Entropy Model The principle of ME – model all that is known and assume nothing about that which is unknown The parametric form of the ME probability function can be expressed by The feature weights (λ) are learned from training data using a fast gradient descent algorithm As in Robertson ’ s BIR model, we use the log-likelihood ratio as the scoring function for ranking as shown follows

Discriminative Models Used in Current Work (2/2) Support Vector Machines (SVM) Basic idea – find the hyper-plane that separate the two classes of training examples with the largest margin If f(D,Q) is the vector of features, then the discriminative function is given by The SVM is trained such that g(R|D,Q)>=1 for positive (relevant) examples and g(R|D,Q) <= -1 for negative (non-relevant) examples as long as the data is separable Both DMs Retaining the basic framework of the BIR model, while avoid estimating the class-conditional and instead directly compute the posterior P(R|Q,D) or the mapping function g(R|R,Q)

Other Modeling Issues Out of Vocabulary words (OOV) problem Test queries are almost always guaranteed to contain words that are not seen in the training queries The features are not based on words themselves, but on query-based statistics of documents such as the total frequency of occurrences or the sum-total of the idf-values of the query terms Unbalanced data The classes (non-relevant) is a large portion of all the examples, while the other (relevant) class have only a small percent of of the examples Over-sampling the minority class Under-sampling the majority class

Experiments and Results (1/8) Ad-hoc retrieval Data set Preprocessing K-stemmer and removing stop-words Only use title queries for retrieval LM Training the LM consists of learning the optimal value of the smoothing parameter All LM runs were performed using Lemur

Experiments and Results (2/8) DM Features SVM svm-light for SVM runs Linear kernel gives the best performance on most data set (converge rapidly) ME The toolkit of Zhang

Experiments and Results (3/8) The comparison of performance of LM, SVM and ME 50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is statistically better than LM; 37.5 (6/16) is that LM is superior to SVM

Experiments and Results (4/8) Discussion Official TREC runs – query expansion DM can improve the performance by including other features such as proximity of query terms, occurrence of query terms as noun-phrases, etc. Such features would not be easy to incorporate into the LM framework In the emergence of modern IR collections such as web and scientific literature that are characterized by a diverse variety of features, we will increasingly rely on models that can automatically learn these features from examples

Experiments and Results (5/8) Home-page finding on web collection Choose the home-page finding task of TREC-10 where many features such as title, anchor-text and link structure influence relevance Example – returned the web-page where the query “ Text Retrieval Conference ” is issuedhttp://trec.nist.gov Corpus – WT10G Queries 50 for training, 50 for development and 145 for testing Evaluation Metrics The mean reciprocal rank (MRR) Success rate – an answer is found in top 10 Failure rate – no answer is returned in top 100

Experiments and Results (6/8) Three Indexes A content index consisting of the textual content of the documents with all the HTML tags removed An index of the anchor text documents An index of the titles of all documents 20 Features Use 6 previous features from each of the three indexes Two additional features URL-depth – a home-page typically is at depth 1 Link-Factor

Experiments and Results (7/8) Performance on development set Performance on test set

Experiments and Results (8/8) Discussion SVMs leverage a variety of features and improve on the baseline LM performance by 48.6% in MRR The best run in TREC-10 achieved an MRR of 0.77 on the test set; however, their feature weights were optimized using empirical means while our models learn them automatically. Only demonstrate the learning ability of SVMs We believe there is a lot more that needs to be in defining the right kind of features, such as PageRank for the link- factor feature.

Related Works A few attempts in applying discriminative models for IR Cooper and Huizinga make a strong case for applying the maximum entropy approach to the problems of information retrieval Kantor and Lee extend the analysis of the principle of maximum entropy in the context of information retrieval Greiff and Ponte showed that the classic binary independence model and the maximum entropy approach are equivalent Gey suggested the method of logistic regression, which is equivalent to the method of maximum entropy used in our work

Conclusion and Future Work (1/2) Treat IR as a problem of binary classification Quantify relevance explicitly Permit us apply sophisticated pattern classification technique Explore SVMs and MEs to IR Their main utility to IR lies in their ability to learn automatically a variety of features that influence relevance Ad-hoc retrieval SVMs perform as well as LMs Home-page finding SVMs outperform the baseline runs by about 50% in MRR

Conclusion and Future Work (2/2) Future Work Further improvement through better feature engineering and by leveraging a huge body of literature on SMVs and other learning algorithms Evaluate the performance of SVMs on ad-hoc retrieval task with longer queries Enhance features such as proximity of query terms, synonyms, etc. Study user modeling by incorporating user-preferences as features in the SVMs