Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Linear Regression.
On-line learning and Boosting
Logistic Regression Psy 524 Ainsworth.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
What is Statistical Modeling
Confidence Measures for Speech Recognition Reza Sadraei.
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Statistical Methods Chichang Jou Tamkang University.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Decision Theory Naïve Bayes ROC Curves
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Fitting. Choose a parametric object/some objects to represent a set of tokens Most interesting case is when criterion is not local –can’t tell whether.
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Active Learning for Class Imbalance Problem
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Week 41 Estimation – Posterior mean An alternative estimate to the posterior mode is the posterior mean. It is given by E(θ | s), whenever it exists. This.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Classification Ensemble Methods 1
NTU & MSRA Ming-Feng Tsai
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Chapter 8: Introduction to Hypothesis Testing. Hypothesis Testing A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis.
Chapter 7: The Distribution of Sample Means
Learning with General Similarity Functions Maria-Florina Balcan.
Analysis of financial data Anders Lundquist Spring 2010.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 23: INFORMATION THEORY REVIEW
Linear Discrimination
Machine Learning: Lecture 5
Presentation transcript:

Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR 2003

Abstract  Text classifiers that give probability estimates are more readily applicable in a variety of scenarios.  The quality of estimates is crucial  Review: a variety of standard approaches to converting scores (and poor probability estimates) from text classifier to high quality estimates

Cont’d  New models: motivated by the intuition that the empirical score distributions for the “extremely irrelevant”, “hard to discriminate”, and “obviously relevant” are often significantly different.

Problem Definition & Approach  Difference from earlier approaches –Asymmetric parametric models suitable for use when little training data is available –Explicitly analyze the quality of probability estimates and provide significance tests –Target text classifier outputs where a majority of the previous literature targeted the output of search engine

Problem Definition

Cont’d  There are two general types of parametric approaches: –Fit the posterior function directly, i.e., there is one function estimator that performs a direct mapping of the score s to the probability P(+|s(d)) –Break the problem down as shown in the the gray box. An estimator for each of the class- conditional densities (p(s|+) and p(s|-)) is produced, then Bayes’ rule and the class priors are used to obtain the estimate for P(+|s(d))

Motivation for Asymmetric Distributions  Using standard Gaussians fails to capitalize on a basic characteristic commonly seen  Intuitively, the area between the modes corresponds to the hard examples, which are difficult for this classifier to distinguish, while areas outside the modes are the extreme examples that are usually easily distinguished

Cont’d

 Ideally, there will exist scores θ - and θ + such that all examples with score greater than θ + are relevant and all examples with scores less than θ - are irrelevant.  The distance |θ - - θ + | corresponds to the margin in some classifiers, and an attempt is often made to maximize this quantity.  Because text classifiers have training data to use to separate the classes, the final behavior of the score distributions is primarily a factor of the amount of training data and the consequent separation in the classes achieved.

Cont’d  Practically, some examples will fail between θ - and θ +, and it is often important to estimate the probabilities of these examples well (since they correspond to the “hard” examples)  Justifications can be given for both why you may find more and less examples between θ - and θ + than outside of them, but there are few empirical reasons to believe that the distributions should be symmetric.  A natural first candidate for an asymmetric distribution is to generalize a common symmetric distribution, e.g. the Laplace or the Gaussian

Asymmetric Laplace

Asymmetric Gaussian

Gaussians v.s. Asymmetric Gaussian

Parameter Estimation  Two choices: –(1) Use numerical estimation to estimate all three parameters at once. –(2) Fix the value of θ, and estimate the other tow given our choice of θ, then consider alternate values of θ.  Because of the simplicity of analysis in the latter alternative, we choose this method.

Asymmetric Laplace MLEs

Asymmetric Gaussian MLEs

Methods Compared  Gaussians  Asymmetric Gaussians  Laplace Distributions  Asymmetric Laplace Distributions  Logistic Regression  Logistic Regression with Noisy Class Labels

Data  MSN Web Directory –A large collection of heterogeneous web pages that have been hierarchically classified. –13 categories used, train/test = 50078/10024  Reuters –The Reuters corpus. –135 classes, train/test = 9603/3299  TREC-AP –A collection of AP news stories from 1988 to –20 categories, train/test = /66992

Performance Measures  Log-loss –For a document d with class, log-loss is defined as where if a = b and 0 otherwise.  Squared error  Error –How the methods would perform if a false positive was penalized the same as a false negative.

Results & Discussion

Cont’d  A. Laplace, LR+Noise, and LogReg quite clearly outperform the other methods.  LR+Noise and LogReg tend to perform slightly better than A. Laplace at some tasks with respect to log-loss and squared error.  However, A. Laplace always produces the least number of errors for all the tasks

Goodness of Fit – naive Bayes

Cont’d -- SVM

LogOdds v.s. s(d) – naive Bayes

Cont’d -- SVM

Gaussian v.s. Laplace  The asymmetric Gaussian tends to place the mode more accurately than a symmetric Gaussian.  However, the asymmetric Gaussian distributes too much mass to the outside tails while failing to fit around the mode accurately enough.  A. Gaussian is penalized quite heavily when outliers present.

Cont’d  The asymmetric Laplace places much more emphasis around the mode.  Even in cases where the test distribution differs from the training distribution, A. Laplace still yields a solution that gives a better fit than LogReg.