CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

Indian Statistical Institute Kolkata
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Classification and risk prediction
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimensional reduction, PCA
Statistical Decision Theory, Bayes Classifier
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Evaluating Hypotheses
Basics of discriminant analysis
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Multiple Discriminant Analysis and Logistic Regression.
Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 
Classification (Supervised Clustering) Naomi Altman Nov '06.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Principle Component Analysis and its use in MA clustering Lecture 12.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Review of statistical modeling and probability theory Alan Moses ML4bio.
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
YOUR LOGO Repetitiveness of Daily Trading Behavior and Price Directions for Canola Futures: A Discriminant Analysis Approach SANGHOON LEE SENIOR ECONOMIST,
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 12 – Discriminant Analysis
Probability Theory and Parameter Estimation I
CH 5: Multivariate Methods
Classification of unlabeled data:
Overview of Supervised Learning
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Advanced Pattern Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Advanced Pattern Recognition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Lecture 16. Classification (II): Practical Considerations
Is Statistics=Data Science
Presentation transcript:

CLASSIFICATION DISCRIMINATION LECTURE 15

What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~ N(  1,  1 ) and N(  2,  2 ) respectively. A new observation is observed and it is known to come from either of these populations. The task of a discriminant function is to determine a “rule” to decide from which of the two populations x is most likely to come from. How we come up with a rule is what we need to study.

Supervised Learning In computer Science this is known as SUPERVISED learning. Essentially we know the class labels ahead of time. What we need to do is find a RULE using features in the data that DISCRIMINATES effectively between the classes. So that if we have a new observation with its features we can correctly classify it.

Example 1 Suppose you are a doctor considering two different anesthetics for a patient. You have some information about the patient, gender, age, some medical history variables. So what we need is a data set where we have patient information and whether or not the anesthetic was SAFE for that patient. So what you want to do is USING the available variables build a MODEL or RULE that says whether anesthetic A or B is better for the patient. Then use this rule to decide whether or not to give the new patient A or B.

Example 2: Turkey Thief There was this legal case in Kansas where a turkey farmer accused his neighbor of stealing turkeys from the farm. When the neighbor was arrested and the police looked in the freezer, there were multiple frozen turkeys there. The accused claimed these were WILD turkey that he had caught. The Statistician was called in to give evidence as there are some biological differences between domestic and wild turkey. So the biologist measured the bones and other body characteristic of the domestic and Wild turkeys and the Statistician built a DISCRIMANT function. They used the classification function to see if the turkeys in the freezer fell into he WILD or DOMESTIC class. THEY ALL fell in the DOMESTIC classification!

The Idea USING knowledge of the classes we build the FUNCTION. We want to minimize misclassification error. Question: Should we use ALL the data to build the MODEL, because then we really do not have a good way to find out the misclassification probabilities. Generally: Training set and Testing sets are used.

Some common Statistical Rules Suppose we want to classify between two multivariate normal distribution P1 with parameters  and  and P2 with parameters  and . Suppose a new observation vector x is known to come from P1 or P2. There are various Statistical Rules allow us to PREDICT which population x most likely came from.

1. Likelihood Rule Choose P1 if L(x,  ) > L(x,  ) else choose P2. Here, x is the observation vector. This is a mathematical rule and reasonable under the assumption of normality.

2. Linear Discriminant Function (LDA)rule: Choose P1 if b’x – k > 0 and P2 otherwise. Here b=    and k=1/2(    ) The function b’x is called the linear discriminant function. This assumes equal covariance matrices . It’s a single linear function of x that summarizes all the information in x.

3. Mahalanobis Distance Rule Choose P1 if d1 < d2 where di = (x  i    x  i ) for i=1,2. The function di is a measure of how far away x is from mi taking the Variance-Covariance into account. This assumes equal covariance matrices . The Likelihood criterion under normality and equal variance is equivalent to this Rule.

4. Posterior probability rule Choose P1 if P(P1|x)>P(P2|x) where, P(Pi|x) = exp[(-1/2)di]/{exp[(-1/2)d1] + exp[(-1/2)d2] } Also assumes equal variance. Not a true probability as (P1|x) is not a random event as the observation belongs to either P1 or P2. Gives an idea of how confident we are in our effort to discriminate.

Caveats Generally  i and  i are not known and we use sample values. Under equal covariance all 4 rules are equivalent in terms of discrimination between groups. Also in general we have more than 2 populations to discriminate the observations into.

Sample Discriminant Rules Since we never know the parameters . we use sample estimates generally MLE estimates below and form discrimant rules as in given before.

Estimating Probability of Misclassification 1. Re-substitution Estimates: Apply the discriminant function to the data used to develop the rule and see how well it discriminates in general. USES the SAME data to make and validate models.

Holdout Data: Keep a part of the data out from the part used to construct the rule and use the rule on that part and see how well it performs. Problem is: if you don’t have a lot of samples its not the most efficient use of resources for building the model.

Cross Validation: Remove one observation at a time from the set, and construct the rule from the remaining observations and predict the first, do this for the second and third… Define a summary matrix for misclassifying each data point. Also called Jack-knifing. Obviously a rule classifying correctly a HIGHER number of times is preferred.

The Issue for MA Often it is known in advance WHERE the samples come from and what conditions they have been exposed to. In fact we are often interested in gene expression profiles to distinguish between different conditions or classes. In the past schemes like a voting scheme was used to look at class membership in MAs. MANY MANY methods available, but general consensus is that a few of the methods have robust performance e.g. Linear discriminant Function (LDA), k-Nearest Neighbors (k-NN).

Cost Function and Prior Probabilities When we there are only two populations all the four rules discussed earlier have the property that probability of misclassifying 1 to 2 is the same as 2 to 1. NOT generally a good idea especially in our anesthetic example. Idea is if you are going have to err, err in the side of caution. Hence we need to take into account the COST of misclassification.

Some Math Details Define U = b’x-k from LDA. U=(  )’  -1 x -.5 (  )’  -1 (  ) Under Normality and equal variance, if x comes from P1, U ~ N(d,d) and if x comes from P2, U ~ N(-d,d) Where d =(  )’  -1 (  ) And our Rule for LDA is P1 if U > 0 and p2 otherwise. To make it asymmetric you can use a rule U > u where we can pick the probability of misclassifying into one of the populations at most a fixed number say alpha.

A General Rule Define Cost Function as C(i|j) the cost of misclassifying an observation from Pj to Pi. Define Prior probability as pi for the ith group. Average Cost of Misclassification (two groups) p1C(2|1)P(2|1) + p2C(1|2)P(1|2) Bayes Rule: Choose P1 if p2f(x;  2)C(1|2) > p1f(x;  1)C(2|1) Observe if p1=p2 and C(2|1)=C(1|2) this reduces to the Likelihood rule. Under Normality and equal variance it reduces to: d1* < d2* where d1* =.5(x  )’  -1 (x  ) – log(p1.C(2|1))

Probabilistic Classification Theory (PCT) Most classification methods can be described as special implementations of Bayes’ Classifiers. The decision rule for classifying x into one of the classes P 1 …,P k depends upon: –Prior information about the class frequencies p1…pk. –Information about how the class membership affects the gene expression profiles xi (i=1…n) –Misclassification costs C(j,i) of classifying an observation which belongs to class Pi into Pj. Our aim is to find a classification rule R that minimizes the expected Classification Costs.

PCT II: Bayes Rule Recall Cost of Misclassification is given by: C(j|i) = 0 if i=j = C i, if i  j (generally C i is set to 1) Result: the classification rule that minimizes the expected misclassification cost is given by the posterior probability: R(x) = arg Min P(C|x) = arg Min P(x|C)  c This is called the Bayes Rule.

PCT III: Prior Information Hence the idea is: IF we know the Probability of Class membership  c, and the conditional probability of the data given the classifiers P(x|C), we can find the optimum Classification Rule. In general it is VERY difficult to KNOW the prior information about class membership. To find P(x|C) the Likelihood of the data, we often use the Normal distribution (or log-transformed gene expression to be Normal). This is done in the Training set.

Steps in Discriminant Analysis in MA Selection of features: Model Fitting Model Validation:

Selection of Features Selecting a set of genes. We do not want all the genes since it may have a tendency to over-fit the data also causes singularity. How to select genes (gene filtering): –Use ONLY differentially expressed genes using an ANOVA type model: xi =  C(xi) +  i –Look at multiple genes or gene groups. Do PCA on all the genes. Not very efficient –Partial least Squares(PLS), finds orthogonal linear combinations that maximize Cov (Xl,y). –Do PCA and then rank PCAs by ratio of between class to within class varaince –Other methods are Projection pursuit etc. Most common differential expression or PLS

MODEL FITTING Commonly used: LDA K Nearest Neighbor Other related DLDA (Diagonal LDA) RDA (Regularized DA) (there is a R package for this) PAM (Prediction Analysis for Microarrays) (there is a R package for this) FDA (Flexible DA)

Validation See how well the classifiers classify the observations into the different classes. Mostly commonly used method leave-one-out-cross validation. Though test data set (holdout sample) and resubmissions are still used.

Linear Discriminant Analysis(LDA) Easy useful method. Been found to be robust in MA. Idea: The main assumption is that the class densities can be written as Multivariate Normal. In R one uses lda in the MASS library. Hence, –P(x| C=k) = MVN (  1 …  k,  kk ) –Maximize : P(C=k| x) ={ P(x| C=k)  k }/  (P(x|C=j)  j –If feature set is known then it is fairly straight forward, else one has to use some technique (forward, backward or step-wise) for feature selection.

K-nearest Neighbor (kNN) Assumption: samples with almost the same feature should belong to the same class. In other words given a set of genes (g1,…,gm) known to be important in class membership, the kNN classifier assigns an unclassified sample to the class prevalent among the k samples whose expression values for the m genes are closest in the sample of interest. Typically each profile for sample j, is compared to the other profiles using Euclidean distances (however, any other distance like Manhattan, Correlation can be useful as well). The aim of kNN is to estimate the posterior probability P(C(X)=j|X=x) of a gene profile belonging to a class directly. For a particular k, it estimates the probability as a relative fraction of samples that belong to class j, among the k samples with most similar profiles. Essentially a non-linear classifier and may have VERY irregular edges.

lda example from R Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), + Sp = rep(c("s","c","v"), rep(50,3))) > train <- sample(1:150, 75) > table(Iris$Sp[train]) c s v > ## your answer may differ > ## c s v > ##

Running lda > z <- lda(Sp ~., Iris, prior = c(1,1,1)/3, subset = train) > predict(z, Iris[-train, ])$class [1] s s s s s s s s s s s s s s s s s s s s s s s s s s c c c c c c c c c c c c [39] c c c c c c c c c c c v v v v c v v v v v v v v v v v c v v c v v v v v v Levels: c s v

Contd… > (z1 <- update(z,. ~. - Petal.W.)) Call: lda(Sp ~ Sepal.L. + Sepal.W. + Petal.L., data = Iris, prior = c(1, 1, 1)/3, subset = train) Prior probabilities of groups: c s v

Contd… Group means: Sepal.L. Sepal.W. Petal.L. c s v Coefficients of linear discriminants: LD1 LD2 Sepal.L Sepal.W Petal.L Proportion of trace: LD1 LD

knn library(class) > train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) > test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]) > cl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) > knn(train, test, cl, k = 3, prob=TRUE)