Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
INTRODUCTION TO Machine Learning 3rd Edition
Yazd University, Electrical and Computer Engineering Department Course Title: Machine Learning By: Mohammad Ali Zare Chahooki Bayesian Decision Theory.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Bayesian Decision Theory
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Machine Learning CMPT 726 Simon Fraser University
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Thanks to Nir Friedman, HU
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Principles of Pattern Recognition
Machine Learning CSE 681 CH2 - Supervised Learning.
Naive Bayes Classifier
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Machine Learning 5. Parametric Methods.
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Computacion Inteligente Least-Square Methods for System Identification.
DISCRIMINANT ANALYSIS. Discriminant Analysis  Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 2. Bayesian Decision Theory
Basic Technical Concepts in Machine Learning
Lecture 1.31 Criteria for optimal reception of radio signals.
Lecture 15. Pattern Classification (I): Statistical Formulation
Probability Theory and Parameter Estimation I
Special Topics In Scientific Computing
Naive Bayes Classifier
CH 5: Multivariate Methods
Data Mining Lecture 11.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen, 2005 References:
INTRODUCTION TO Machine Learning
Linear Discrimination
Foundations 2.
Presentation transcript:

Classification

Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2) Then low risk customer – ELSE High risk customer Having a rule like this the main application is prediction. Once we have a rule that fits the past data, then we can make correct prediction for new instances.

Introduction Given a new customer application with a certain income and savings we can easily decide whether it is low risk or high risk. In some cases, instead of making a 0/1 (low risk/ high risk) type decision, we may want to calculate a probability namely P(Y|X) where X are the customer attributes and Y is 0 or 1 respectively for low risk and high risk. From this perspective we can see classification as learning an association from X to Y.

Introduction Then for a given X=x, if we have P(Y=1|X=x) = 0.8 we say that the customer has an 80% probability of being high risk or equivalently a 20% probability of being low risk. y = g(x/Q) where g(.) is the model and Q are it parameters. Remember “y” is a number in regression and “y” is a class code (e.g 0/1) in case of classification. g(.) is the regression function in regression and in classification, it is the discriminant function separating the instances of different classes.

Example An AI program optimizes the parameters Q such that the approximation error is minimized, that is, our estimates are as closes as possible to the correct values given in the training set. Example we observe customer’s yearly income and savings, which we represented by two random variable X1 and X2. The credibility of the customer is denoted by a bernoulli random variable C conditioned on the observables X=[X1, X2] where C=1 indicates a high risk customer and C=0 indicates low risk customer.

Example If we know P(C|X1,X2) when a new application arrives with X1= x1 and X2 = x2, we can choose – C=1 if P(C=1|x1,x2) > 0.5 – C=0 otherwise OR – C=1 if P(C=1|x1,x2) > P(C=0|x1,x2) – C=0 otherwise

Bayes’ rule P(C|X) = P(X|C) P(C) / P(X) – P(C|X) is called Posterior – P(X|C) is called Likelihood – P(C) is called Prior – P(X) is called Evidence P(C=1) is called prior probability that C takes the value 1, which in our example corresponds to the probability that a customer is high risk, regardless of the value X

Bayes’ rule It is called prior probability because it is the knowledge we have as to the value of C before looking at the observation X, satisfying – P(C=0) + P(C=1) = 1 P(X|C) is called the class likelihood and is the conditional probability that an event belonging to C has the associated observation values X. P(x1,x2| C=1) is the probability that a high risk customer has his or her X1= x1 and X2 = x2. It is what the data tell us regarding the class.

Bayes’ rule P(X) the evidence, is the marginal probability that an observation X seen regardless of whether it is a +ve or –ve example. – P(X) = P(X|C=1).P(C=1) + P(X|C=0).P(C=0) Because of normalization by the evidence the posterior sum up to 1. – P(C=0|X) + P(C=1|X) = 1

General Case In the general case, we have K mutually exclusive and exhaustive classes. Ci, i=1,2…k. We have prior probabilities satisfying P(Ci)>= 0 and P(X|Ci) is the probability of seeing X as the input when it is known that it belongs to class Ci. The posterior probability of Class Ci can be calculated as P(Ci|X) = P(X|Ci) P(Ci) / P(X) P(X) = and for minimum error the Bayes’ classifier choose the class with the highest posterior probability that is we choose Ci if P(Ci|X)=maxk(P(Ck|X)

Learning Multiple Classes In the general case, we have K classes as Ci, i=1,2…K and an input instance belongs to one and exactly one of them. The training set is now of the form x={x t,r t } t N where r has k dimensions and r t = { 1 if x t ε C i } In AI for classification we would like to learn the boundary separating the instances of one class from the instances of all other classes. We can view a k-class classification problem as k two-class problems.