On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Evaluating Classifiers
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CHAPTER 25: One-Way Analysis of Variance Comparing Several Means
Learning Algorithm Evaluation
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Model Assessment, Selection and Averaging
What is Statistical Modeling
Visual Recognition Tutorial
Evaluation.
Classification and risk prediction
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Statistical Methods Chichang Jou Tamkang University.
Statistics Are Fun! Analysis of Variance
Evaluation.
Ensemble Learning: An Introduction
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Visual Recognition Tutorial
Experimental Evaluation
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Principles of Pattern Recognition
1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Machine Learning CSE 681 CH2 - Supervised Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Bayesian Networks Martin Bachler MLA - VO
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Learning from Observations Chapter 18 Through
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
1 E. Fatemizadeh Statistical Pattern Recognition.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Ensemble Methods in Machine Learning
Machine Learning 5. Parametric Methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
NTU & MSRA Ming-Feng Tsai
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 2. Bayesian Decision Theory
Chapter 7. Classification and Prediction
Boosting and Additive Trees (2)
Ch3: Model Building through Regression
Data Mining Lecture 11.
Generally Discriminant Analysis
Lecture # 2 MATHEMATICAL STATISTICS
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007

Introduction to the simple Bayesian classifier and its optimality. The simple Bayesian classifier in machine learning and the empirical evidence. Optimality without independence and its conditions When will the Bayesian classifier outperform other learners? How is the Bayesian classifier best extended? Conclusions.

1. Introduction Many classifiers can be viewed as computing a set of discriminant functions of the example. If We choose class for the example. In which is a vector of attributes. Zero-one loss is minimized if and only if E is assigned to the class for which is maximum. If the attributes are independent given the class

Question: whether the simple Bayesian classifier (BC) can be optimal even when the assumption of attribute independence does not hold. The tacit assumption for this question is “No”. However, the BC can perform very well even in the domain where attribute dependences exist. This article derives the most general conditions for the BC’s optimality and give a corollary as follows: The BC’s true region of optimal performance is far greater than that implied by the attribute independence assumption.

2. The simple BC in machine learning To be compared with more sophisticated algorithms, the simple BC was the most accurate one overall. The BC’s limited performance in many domains was not in fact intrinsic to it, but due to the unwarranted Gaussian assumptions. It is not always to be helpful to improve the accuracy by dealing with attribute dependences. The simple Bayesian classifier is more robust in accuracy even compared with Bayesian networks.

Empirical evidence: Numeric attributes were discretized for verifying the Bayesian classifier performance. Zero counts problem is solved by the Laplace correction: The uncorrected estimate of is. The corrected estimate is Missing values were ignored. Experiment results: twenty-eight data sets; of the data for training; Twenty runs were conducted to show the average accuracy and confidence levels. Where is the number of values of attribute with

Dependencies between pairs of attributes given the class: The BC achieves higher accuracy than more sophisticated approaches in many domains. The attribute dependence is not a good predictor of the BC’s different performance.

3. Optimality without independence Just consider three attributes A,B and C, two classes “+” and “-”( ). Assume A=B and A is independent with C. The optimal classification procedure: Assign to class “+” if The BC will assign to “+” if Let and, then the two classification procedure can be represented: ( optimal) (Simple Bayes)

Local optimality Definition 1: zero-one loss

A more generally definition: Definition 2: Bayes rate Lowest zero-one loss achievable by any classifier on the example Definition 3: locally optimal If the classifier for a given example has a zero-one loss equal to the Bayes rate. Definition 4: globally optimal If the classifier for every example in the sample data is locally optimal; A classifier is globally optimal for a given problem iff it is globally optimal for all possible samples of that problem.

zero-one loss for classification V.S. the square error loss for probability estimation: Equation 2 yields minimal square-error estimates of the class probability only when the estimates are equal to the true values (i.i.d. assumption holds). But with equation 1, it can still yield minimal zero-one loss as long as the class with highest estimated probability,, is the class with highest true probability.

Consider the two-class in general: “+” and “-”. A necessary and sufficient condition for the local optimality of BC is as follows: Theorem1: The Bayesian classifier is locally optimal under zero-one loss for an example E iff for E: Corollary1: The Bayesian classifier is locally under zero- one loss in half the volume of the space of possible values of. It is not an asymptotic result, also valid for finite samples.

Under squared error loss, Eq 2 is optimal only when the i.i.d. holds: r=p & s=1-p (intersect line). Incorrectly applying intuitions based on SE loss to the BC’s performance under zero-one loss.

Global optimality: Theorem 2: The Bayesian classifier is globally optimal under zero-one loss for a sample (data set) iff Necessary conditions: Theorem 3: The Bayesian classifier cannot be globally optimal for more than different problems. **d is the number of different numbers representable on the machine implementing the Bayesian classifier. For example: 16 bits, d=.** Theorem 4: When all attributes are nominal, the Bayesian classifier is not globally optimal for classes that are not discriminable by linear functions of the corresponding features.

The Bayesian classifier is equivalent to a linear machine, whose discriminant function for class is But it fails for concepts even they are linearly seperable. m-of-n concept is true if m or more out of the n attributes defining the example space are true. Theorem 5: The Bayesian classifier is not globally optimal for m-of-n concepts. : probability that an attribute A is true given the concept C is true,

If the Bayesian classifier is trained with all examples of an m-of-n concept, and a test example has j true-valued attributes, then the BC will make a false positive error if is positive and ; a false negative error if is negative and.

Sufficient conditions: Theorem6: The Bayesian classifier is globally optimal if for all classes and examples, Theorem7: The Bayesian classifier is globally optimal for learning conjunctions of literals. Theorem8: The Bayesian classifier is globally optimal for learning disjunctions of literals. 4. When will the BC outperform other learners? The squared error loss=noise+ statistical bias + the variance BC is often a more accurate classifier than C4.5 because a classifier with high bias and low variance will tend to produce lower zero-one loss.

16 attributes 32 attributes 64 attributes

When the sample is the dominant limiting factor, BC may be better; However, as the sample size increases, the BC’s capacity to store information will be exhausted sooner than that of more powerful classifiers, the more powerful classifiers are better. 5. How is the BC best extended? Detecting the attribute dependences is not necessarily the best way to improve performance. Two measures for determining the best pair were compared: leave-one-out cross validation on the training set. Equation 4 to find the attributes had the largest violation of the conditional independence assumption.

Accuracy on the test set vs. accuracy estimation on the training set Entropy representing the correlation degree of features Cross-validation accuracy is a better predictor of the effect of an attribute join than the degree of dependence given the class.

Under zero-one loss, the Bayesian classifier can tolerate some significant violations of the i.i.d assumption, an approach that directly estimates the effect of the possible changes on this loss measure resulted in a more substantial improvement.

6. Conclusions. Verify that the BC performs quite well even strong attribute dependences are present. Derive some necessary and sufficient conditions for the BC’s optimality. Hypothesized that the BC may often be a better classifier than more powerful alternatives when the sample size is small. Verify that searching for attribute dependences is not necessarily the best approach to improve the BC’s performance.