Tackling the Poor Assumptions of Naive Bayes Text Classifiers Pubished by: Jason D.M.Rennie, Lawrence Shih, Jamime Teevan, David R.Karger Liang Lan 11/19/2007.

Slides:

Advertisements

Similar presentations

Basics of Statistical Estimation

Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Assuming normally distributed data! Naïve Bayes Classifier.

Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Visual Recognition Tutorial

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

Scalable Text Mining with Sparse Generative Models

Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Thanks to Nir Friedman, HU

Naive Bayes model Comp221 tutorial 4 (assignment 1) TA: Zhang Kai.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Exercise Session 10 – Image Categorization

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.

Advanced Multimedia Text Classification Tamara Berg.

(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人：黃子齊

Bayesian Networks. Male brain wiring Female brain wiring.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Text Classification, Active/Interactive learning.

Naive Bayes Classifier

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Oliver Schulte Machine Learning 726 Bayes Net Classifiers The Naïve Bayes Model.

Optimal Bayes Classification

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

INTRODUCTION TO Machine Learning 3rd Edition

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab Christina Wallin, Period 3 Computer Systems Research Lab

Speaker ： Shau-Shiang Hung ( 洪紹祥 ) Adviser ： Shu-Chen Cheng ( 鄭淑真 ) Date ： 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine.

Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.

Machine Learning 5. Parametric Methods.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

KNN & Naïve Bayes Hongning Wang

Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.

Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab

LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Naive Bayes Classifier

CH 5: Multivariate Methods

Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009

Classification of unlabeled data:

Lecture 15: Text Classification & Naive Bayes

Machine Learning. k-Nearest Neighbor Classifiers.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Categorical Data Analysis

Additional notes on random variables

Additional notes on random variables

Michal Rosen-Zvi University of California, Irvine

LECTURE 23: INFORMATION THEORY REVIEW

LECTURE 07: BAYESIAN ESTIMATION

Topic Models in Text Processing

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

Presentation transcript:

Tackling the Poor Assumptions of Naive Bayes Text Classifiers Pubished by: Jason D.M.Rennie, Lawrence Shih, Jamime Teevan, David R.Karger Liang Lan 11/19/2007

Outline Introduce the Multinomial Naive Bayes Model for Text Classification. The Poor Assumption of Multinomial Naive Bayes Model. Solutions to some problem of the Naive Bayes Classifier.

Multinomial Naive Bayes Model for Text Classification Given: A description of the document d: f = (f1,…..,fn) fi is the frequency count of word i occurring in document d A fixed number of classes: C = {1, 2,…, m}, Parameter Vector for each class The parameter vector for a class c is ci is the probability of word i occurs in class c Determine: The class label of d. θ

Introduce the Multinomial Naive Bayes Model for Text Classification The likelihood of a document is a product of the parameters of the words that appear in the document. Selecting the class with the largest posterior probability

Parameter Estimation for Naive Bayes Model The parameters θci must be estimated from the training data. Then, we get the MNB classifier. For simplicity, we use uniform prior estimate. lMNB(d) = argmaxc(fiwci)

The Poor Assumption of Multinomial Naive Bayes Model Two systemic errors (Occurring in any naive bayes classifier ) 1. Skewed Data Bias ( uneven training size) 2. Weight Magnitude Errors (Caused by the independence assumption) The Multinomial does not model the text well

Correcting the skewed data bias More training examples for one class than another --- can cause the classifier to prefer one class over the other. Using Complement Naive Bayes Nci is the number of times word i occurred in documents in classes other than c. ~

Correcting the Weight Magnitude Errors Caused by the independence assumption Ex. “San Francisco” , “Boston” Normalizing the Weight Vectors We call this Weight-normalized Complement Naive Bayes(WCNB).

Modeling Text Better Transforming Term Frequency Transforming by Document Frequency Transforming Based on Length

Transforming Term Frequency The term distribution had heavier tails than predicted by the multinomial model, instead appearing like a power-law distribution. The probability is also proportional to So we can use the multinomial model to generate probabilities proportional to a class of power law distribution via a simple transform,

Transforming by Document Frequency Common words are unlikely to be related to the class of a document, but random variations can create apparent fictitious correlation. Discount the weight of the common words. Inverse document frequency (a common IR transform)– to discount terms by their document frequency.

Transforming Based on Length The jump for larger term frequency is disproportionally large with the length of the document. Discount the influence of long documents by transforming the term frequency:

The New Naive Bayes procedure

The result of experiment comparing MNB to TWCNB and the SVM shows that the TWCNB’s performance is substantially better than MNB, and approach the SVM’s performance.