Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

Slides:



Advertisements
Similar presentations
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Evaluating Classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Yazd University, Electrical and Computer Engineering Department Course Title: Machine Learning By: Mohammad Ali Zare Chahooki Bayesian Decision Theory.
What is Statistical Modeling
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification and risk prediction
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Decision Theory Naïve Bayes ROC Curves
ROC Curves.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Thanks to Nir Friedman, HU
INTRODUCTION TO Machine Learning 3rd Edition
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Principles of Pattern Recognition
Bayesian Networks. Male brain wiring Female brain wiring.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 3. Bayes Decision Theory: Part II. Prof. A.L. Yuille Stat 231. Fall 2004.
Classification Techniques: Bayesian Classification
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Machine Learning 5. Parametric Methods.
Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Genetic Algorithms Schematic of neural network application to identify metabolites by mass spectrometry (MS) Developed by Dr. Lars Kangas Input to Genetic.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian classification review Bayesian statistics derive K nearest neighbors (KNN) classifier analysis of 2-way classification results homework assignment.
Applied statistics Usman Roshan.
Lecture 2. Bayesian Decision Theory
Usman Roshan CS 675 Machine Learning
Review of statistics in data mining
CH 5: Multivariate Methods
CHAPTER 3: Bayesian Decision Theory
INTRODUCTION TO Machine Learning
INTRODUCTION TO Machine Learning 3rd Edition
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
INTRODUCTION TO Machine Learning
Foundations 2.
Presentation transcript:

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

2 Modeling data as random variables Example: coin toss Given sufficient knowledge, we could use Newton’s laws of motion to calculate the result of each toss with minimal uncertainty In conjunction with our model, analysis of experimental trajectories will probably reveal why the coin is unfair if heads and tails do not occur with equal probability Alternative: Accept doubt about result of toss. Treat result as random variable X subject to P(X=x). Use P(X=x) to make rational decision about result of next toss. Assume that we are not interested in why the coin is unfair if that is the case. “The reason is in the data”

Statistical Analysis of Coin-Toss Data Let heads = 1; tails = 0 Boolean random variables obey Bernoulli statistics P (x) = p o X (1 ‒ p o ) (1 ‒ X) p o = probability of heads Given a sample of N tosses, an unbiased estimator of p o is the fraction of tosses that show heads. Prediction of next toss: Heads if p o > ½, Tails otherwise 3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

4 posterior Class likelihoodprior normalization Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Prior is information relevant to classifying that is independent of attributes x Class likelihood is probability that member of class C will have attribute x Assign client with attribute x to class C if P(C|x) > 0.5 Review: Bayes’ Rule for binary classification

Review: Bayes’ Rule: K>2 Classes 5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

6 With class labels r i t, estimators are Review: Estimating priors and class likelihoods from data Number of examples in a class is an estimate of its prior. If we assume members of a class are Gaussian distributed, then mean and covariance parameterize class likelihood.

7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: naïve Bayes classification Each class is characterized by a set of means and variances for the components of the attributes in that class. A simpler model results from assuming that components of x are independent random variables. Covariance matrix is diagonal and p(x|C) is product of probabilities for each component of x.

Actions: α i assigning x to C i of K classes Loss λ ik occurs if we take α i when x belongs to C k Expected risk (Duda and Hart, 1973) 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Minimizing risk given attributes x

Special case: correct decisions no loss and error have equal cost: “0/1 loss function” 9 For minimum risk, choose the most probable class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Add rejection option: don’t assign a class 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) risk of no assignment risk of choosing C i 1- is risk making some assignment

R(  1|x) = 11 P(C1|x) + 12 P(C2|x) = 10 P(C2|x) R(  2|x) = 21 P(C1|x) + 22 P(C2|x) = P(C1|x) Choose C1 if R(  1|x) < R(  2|x), which is true if 10 P(C2|x) < P(C1|x), which becomes P(C1|x) > 10/11 using normalization of posteriors Consequence of erroneously assigning instance to C1 is so bad that we choose C1 only when we are virtually certain it is correct. Example of risk minimization with 11 = 22 = 0, 12 = 10, and 21 = 1 Loss λ ik occurs if we take α i when x belongs to C k

13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Bayes’ classifier based on neighbors Consider data set with N examples, N i of which belong to class i; P(C i ) = N i Given a new example x, draw a hyper-sphere of volume V in attribute space, centered on x and containing precisely  training examples, irrespective of their class. Suppose this sphere contains n i examples from class i, then p(x|C i )P(C i ) = V -1 (n i /N i )N i = V -1 n i

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Using Bayes’ rule we find posteriors p(C k |x) = n k /  Assign x to the class with highest posterior, which is the class with the highest representation among the  training examples in the hyper-sphere centered on x K=1 (nearest neighbor rule) assign x to the class of nearest neighbor in the training data. Bayes’ classifier based on neighbors

15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Usually chose  from a range values based on validation error In 2D, we can visualize the classification by applying KNN to every point in the (x 1,x 2 ) plane. As  increases expect fewer islands and smoother boundaries Bayes’ classifier based on  nearest neighbors (KNN)

Analysis of binary classification: beyond the confusion matrix

Quantities defined by binary confusion matrix Let C1 be positive class, C2 be negative class, N be # of instances Error rate = (FP+FN)/N = 1-accuracy False positive rate = FP / (FP+TN) = fraction of C2 instances misclassified Ture positive rate = TP / (TP+FN) = fraction of C1 instances correctly classified 18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Receiver operating characteristic (ROC) curve Let C1 be positive class Let  be the threshold of P(C1|x) for assignment of x to C1 If  is near 1, rare assignments to C1 have high probability of being correct both FP-rate and TP-rate are small As  decreases both FP-rate and TP-rate increase For every value of , (FP-rate, TP-rate) is point on ROC curve

20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Chance alone marginal success ROC curves

Drawing ROC curves Assume C1 is the positive class. Rand all examples by decreasing P(C1|x) In decreasing rank order, move up 1/P(C1) for each positive example and move right 1/P(C2) for each negative example If all examples are correctly classified, ROC curve will be in upper left. If P(C1|x) is not correlated with class labels, ROC curve will be close to the diagonal

Performance with reduced attribute set is slightly improved Slight improvement Misclassified malignant cases decreased by 2