Foundations 2.

Slides:

Advertisements

Similar presentations

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.

Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

Data Mining Classification: Alternative Techniques

Yazd University, Electrical and Computer Engineering Department Course Title: Machine Learning By: Mohammad Ali Zare Chahooki Bayesian Decision Theory.

Bayesian Decision Theory

What is Statistical Modeling

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Classification and risk prediction

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.

Thanks to Nir Friedman, HU

INTRODUCTION TO Machine Learning 3rd Edition

Bayesian Decision Theory Making Decisions Under uncertainty 1.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

Probability theory: (lecture 2 on AMLbook.com)

Principles of Pattern Recognition

Bayesian Networks. Male brain wiring Female brain wiring.

Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

INTRODUCTION TO Machine Learning 3rd Edition

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.

Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

Machine Learning 5. Parametric Methods.

Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.

Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.

CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Bayesian classification review Bayesian statistics derive K nearest neighbors (KNN) classifier analysis of 2-way classification results homework assignment.

Lecture 1.31 Criteria for optimal reception of radio signals.

Lecture 15. Pattern Classification (I): Statistical Formulation

Review of statistics in data mining

CHAPTER 3: Bayesian Decision Theory

Data Mining Classification: Alternative Techniques

INTRODUCTION TO Machine Learning

Classification Techniques: Bayesian Classification

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

INTRODUCTION TO Machine Learning 3rd Edition

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Machine Learning” Dr. Alper Özpınar.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.

Computational Intelligence: Methods and Applications

Parametric Methods Berlin Chen, 2005 References:

INTRODUCTION TO Machine Learning

Linear Discrimination

Test #1 Thursday September 20th

Supervised machine learning: creating a model

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Bayesian Decision Theory

Presentation transcript:

Foundations 2

Modeling data as random variable Data usually come from process not completely known Example: coin toss Result (heads or tails) is deterministic Given sufficient knowledge, we could use Newton’s laws to calculate the result of each toss Alternative: accept doubt about result of toss Treat result as random variable X subject to P(X=x) Use P(X=x) to make rational decision about result of next toss Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

Statistical Analysis of Coin-Toss Data Let heads = 1; tails = 0 Obeys Bernoulli statistics P (x) = poX (1 ‒ po)(1 ‒ X) po = probability of heads Given a sample of N tosses an unbiased estimator of po = number heads/number tosses Prediction of next toss: Heads if po > ½, Tails otherwise Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

Discriminant functions Consider a 2D dichotomizer for bank loans: Attributes = income and savings Class = high risk P(C = 1|x1,x2) = probability of high risk given values of attributes P(C = 1|x1,x2) is a “discriminant” function for classification If probability is normalized, we can use the rule Choose C = 1 if P(C = 1|x1,x2) > 0.5 Otherwise choose C = 0 Normalization is unnecessary. We can use the rule Choose C = 1 if P(C = 1|x1,x2) > P(C = 0|x1,x2) 4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Bayes’ rule allows for prior knowledge about class posterior class likelihood prior evidence “most of our clients are high risk” p(C) = probability of class C independent of attributes p(x|C) = probability of attributes x given class C p(C|x) = probability of class C given attributes x p(x) is normalization factor independent of class Assign client to class with higher posterior maximum a posteriori (MAP) approach Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5

A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) What is the rule for using this discriminant function in classification? If g(x)>1, choose C1 Otherwise choose C2

A discriminant function for 2-class problem is defined as the Bayesian odds ratio g(x) = log(P(C1|x)/P(C2|x)) What is the rule for using this discriminant in classification? When does classification using this discriminant function become independent of attributes? g(x) = log(p(x|C1)/p(x|C2))+log((p(C1)/p(C2)) chose C1 if g(x) > 0 else chose C2 Priors could dominate

Normalized Bayesian dichotomizer posterior likelihood prior evidence Normalized Bayesian dichotomizer Normalized priors Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1 Normalized posteriors Assign client to class with higher posterior Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8

Normalized Bayesian classifier K>2 classes MAP approach Priors, likelihoods, posteriors, and margins are class specific Evidence is sum of margins over classes Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9

Minimizing risk of decisions Action αi is assigning x to Ci of K classes λik is loss that occurs if we take αi when x belongs to Ck Expected risk (Duda and Hart, 1973) 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

For minimum risk, choose the most probable class Special case: correct decisions no loss and errors have equal cost: “0/1 loss function” Normalized posteriors For minimum risk, choose the most probable class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

Duda and Hart model applied to 2-class problem R(a1|x) = l11 P(C1|x) + l12 P(C2|x) R(a2|x) = l21 P(C1|x) + l22 P(C2|x) l11 = l22 = 0 No cost for correct decisions l12 = 10, and l21 = 1 Cost incorrect assignment to C1 is ten times greater than cost of incorrect assignment to C2 Posteriors are normalized State the classification rule in terms of P(C1|x) First step? R(a1|x) = 10 P(C2|x) R(a2|x) = P(C1|x) Choose C1 if R(a1|x) < R(a2|x), which is true if 10 P(C2|x) < P(C1|x), which can be written as 10(1 – P(C1|x)) < P(C1|x) if posteriors are normalized This inequality can be solved to get a rule that just involves P(C1|X) Choose C1 if P(C1|x) > 10/11 Consequence of erroneously assigning instance to C1 is so great that we choose C1 only when we are virtually certain it is correct.

Substitute cost matrix into model R(a1|x) = 10 P(C2|x) R(a2|x) = P(C1|x) What’s rule for choosing C1?

Rule for assigning client to C1 Choose C1 if R(a1|x) < R(a2|x), which becomes 10 P(C2|x) < P(C1|x) in this risk model How do get a rule that just involves P(C1|x)?

Use normalization to eliminate P(C2|x) 10(1 – P(C1|x)) < P(C1|x) Solve for P(C1|X) Choose C1 if P(C1|x) > 10/11 What does this mean?

l12 = 10, and l21 = 1 Consequence of erroneously assigning client to C1 is so great that we choose C1 only when we are virtually certain it is correct.

Rejection option: risk of not assigning a class Assume normalized posteriors risk not assigning risk of choosing Ci = risk of not assign (i.e. risk of rejection) 1-l = risk assigning If the risk of making an assignment is low (large l), then the threshold on max posterior for choosing that class is lower. We can make class assignments even when posteriors are not very discriminating (i.e. even when the classifier is weak). We use such a classifier when the risk associated with no decision is high. If the risk of making an assignment is high (low l), then we want a classifier that is very discriminating (i.e. generates a posterior for the correct class that are near unity) 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Relationship between the accuracy of classification and the risk of not assigning With the risk model Choose Ci if P(Ci|x) is the highest posterior and P(Ci|x) > 1 – l, the risk of making an assignment If the classifier is accurate, the risk of making an assignment can be low, which is achieved by making l closer to unity. 18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Question: If more accurate classifiers are more expensive, propose a 3-level classification cascade to minimize cost? If the classifiers in the cascade adopt the 0/1 loss function with rejection, how should l change in the components of the cascade? The more accurate (more expensive) down-steam classifiers can have lower risk of assigning, which is 1 - l in the 0/1 risk model with rejection; therefore l1<l2<l3.

The more accurate (more expensive) are down-steam so that they are used only when necessary Down-steam classifiers can have lower risk of assigning, which is 1 - l in the 0/1 risk model with rejection; therefore l1<l2<l3.

Examples of discriminant functions Boundaries are defined by gk(x) = constant hyper-surfaces in attribute space that separate classes Example: P(C1|x) = 0.5 is a boundary for a Bayesian dichotomizer with normalized posteriors

Bayes’ classifier based on neighbors posterior likelihood prior evidence Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1 Assign client to class with highest posterior How can we use neighbors in attribute space to estimate posteriors? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22

Bayes’ classifier based on neighbors Consider data set with N examples, ni belonging to class i Given new example x, draw a hyper-sphere in attribute space, centered on x containing precisely K training examples, irrespective of their class. Suppose this sphere has volume V and contains ki examples from class i. Then class likelihood p(x|Ci) = ki /(ni V) Evidence p(x) = K/(N V) is the unconditional probability of drawing x irrespective of its class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23

Bayes’ classifier based on neighbors Class priors p(Ci) = ni/N (i.e. how well is class i represented in the training data) When we use Bayes’ rule to calculate posteriors, the explicit dependence on V cancels out and p(Ci|x) = ki/K Assign x to the class with highest posterior (i.e. the class with the highest representation among the K training examples in the hyper-sphere centered on x K=1 (nearest neighbor rule) assign x to the class of the nearest neighbor in the training data. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24

Pseudo-code for KNN Given K, how do we find calculate p(Ci|x) = ki/K?

Find the “distance” between x and all other members of the data set If attributes are real numbers use “Euclidian” distance If attributes are binary try Hamming distance Hamming: how many bits are different? x = 0 1 0 0 1 0 1 y = 1 1 0 1 1 0 0 Hamming distance between x and y = 3 How do I use the distance measurements?

Sort distances by increasing size Among the K smallest count the number in class i

Bayes’ classifier based on neighbors In a 2-attribute data set, we can visualize the classification by applying KNN to each point in plane. As K increases expect fewer islands and smoother boundaries Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28

Dichotomizer confusion matrix Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # true positives found/ # positives found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29

Definition of ROC curve Let q be the threshold of normalized P(C|x) for assignment of example x to positive class C If q is almost 1, then rarely make assignment to C but we expect few false positives As q decreases, we make more assignments to C and expect more false positives 0 < q < 1 parametrically defines a true-positive verses false-positive curve called the “receiver operating characteristic” (ROC) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30

Pseudo-code for ROC construction Rank all examples by value of P(C|x) (some diagnostic of positive-class) Flag each example by positive or negative (positive = correct class assignment) (negative = incorrect class assignment) For each example calculate (TP rate, FP rate) TP rate = fraction of positives with equal or greater score FP rate = fraction of negatives with equal or greater score Plot points and calculate area under curve

data ROC curves Filled circle positive Open circle negative Good classifier Poor classifier

ROC-related Curve Other combinations of confusion-matrix variables can be use in q-parameter curve definitions Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33