Classification 10/03/07.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Model generalization Test error Bias, variance and complexity
Lecture 3 Nonparametric density estimation and classification
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Sparse vs. Ensemble Approaches to Supervised Learning
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Sparse vs. Ensemble Approaches to Supervised Learning
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Radial Basis Function Networks
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Crash Course on Machine Learning
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
This week: overview on pattern recognition (related to machine learning)
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Classification (Supervised Clustering) Naomi Altman Nov '06.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Machine Learning 5. Parametric Methods.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Classification and Regression Trees
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
CH 5: Multivariate Methods
COMP61011 : Machine Learning Ensemble Models
Machine Learning Basics
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
CAMCOS Report Day December 9th, 2015 San Jose State University
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Classification 10/03/07

Diagnose disease by gene expression pattern Golub et al. 1999

Diagnose disease by gene expression pattern Golub et al. 1999

Two types of statistical learning Supervised The classes are predefined. The membership for a set of objects are known. Try to develop a rule to predict the membership for a new object. Unsupervised Discover clusters of patterns from observed data. Both membership and the clusters need to be identified. Classification is a kind of supervised learning.

How good is good enough? Suppose a test is used to screen for a certain disease. The test has 99% sensitivity and 99% specificity. The disease is rare: 1 case out of 1 million people. Question: Is this test useful?

How good is good enough? Misclassification rate = 0.999999 * 0.01 + 0.000001 * 0.01 = 0.01 If we predict that no one has the disease, the misclassification rate = 0.000001 * 1 = 0.000001 Does that mean the test is no good?

Loss function Often our goal is to minimize the misclassification error rate. Sometimes an error in one direction outweighs an error in the other direction. For example, It is more costly to classify a patient as healthy then to classify a healthy patient as sick. In general, we want to minimize a loss function L(Ctrue, Cpredict).

Procedure for developing a classifier Collect data with known class association. Take out a subset, don’t touch it. This will be the testing subset. Building a model using information from the rest of the data, i.e., the training set. Apply the trained model to the testing data. Evaluate model performance. If you use all data to train your model, then you will be overfitting your model and the performance will be exaggerated.

k-nearest-neighbor classifier ?

k-nearest-neighbor classifier Find k-nearest neighbors 1 2 5 3 4

k-nearest-neighbor classifier Find k-nearest neighbors Classify the unknown case by majority vote. Despite its simplicity, kNN can be effective. 1 2 5 3 4

Issues with k-nearest-neighbor classifier Computationally intensive How to choose k Nearest-neighbors may not be close (especially when X is high dimensional). Most genes are probably irrelevant to the prediction anyhow. Pre-select features using dimension reduction methods (discussed by Prof. Cai last time). Dimension reduction is important for other classifiers as well.

Feature selection The dimension of the model = number of genes is very high. It is hard to find close neighbors in high dimensional space Many genes are irrelevant Pre-select genes using dimension reduction methods Dimension reduction is required for other models as well.

Feature selection

Feature selection methods Stepwise regression PCA PLS Ridge regression LASSO etc. (Cai)

Classification Methods Linear discriminant analysis (LDA) Logistic regression Classification trees Support vector machine (SVM) Neural network Many other methods!

Linear methods Class 2 Class 1 ???

Linear Discriminant Analysis (LDA) Class 2 Approximate the probability distribution within each class by a Gaussian distribution. Class 1

Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification rate. Maximum likelihood rule is equivalent to Bayes rule with uniform prior. Decision boundary is

Linear Discriminant Analysis Assume

Linear Discriminant Analysis

LDA The boundary is linear if the variances for the two classes are the same. Otherwise, the boundary is quadratic and the method is called QDA. Class 2 Class 1

Diabetes Data Set

Logistic regression Model the log-odds between the k-th class vs a reference class: e.g. 1st class. Select k with the largest P(G = k | X = x) Question: How to estimate the b’s?

Fitting logistic regression model Let Maximize the conditional log-likelihood. where In the special case of two classes, let yi = 0 when gi = 1, and yi = 1 when gi = 2. Then The maximum is achieved when

Fitting logistic regression model (ctd) Since this is a non-linear equation, it can only be solved numerically. This is achieved by the Newton-Raphson method. where Note: global convergence is not guaranteed. For multiple classes b can be solved similarly.

Connection between LDA and logistic regression

Diabetes Data Set

Naïve Bayes method From Bayes’ rule, If is high-dimensional (number of genes considered), pk(X) is difficult to estimate. However, if we assume the Xj’s are independent with each other, i.e., then pkj(Xj) can be easily estimated.

Naïve Bayes method Therefore Note: Surprisingly, even though the assumption that Xj’s are independent is almost never met, the naïve Bayes classified often performs well, even beating more sophisticated methods. Up to here we talk about linear methods. Nonlinear methods will be discussed in the following.

Classification tree Goal: Predict whether a person owns a house by asking a few questions with yes or no answers. Predictors: Age, Car Type, etc. Age >=30 YES <30 NO Car Type sports car minivan

Age Car Type Age >= 30 Sports car >=30 YES <30 NO sports car minivan minivan

Regression tree: Algorithm Response function is continuous. Goal: select a partition of regions (nodes): R1, …, RM, so that the response can be modeled as a constant cm in each region. Step 1: For a splitting variable Xj and a splitting point s, define Seek j and s, so that is minimized. Step 2: For each Rm , refine the partition by repeating step 1, stop when the number of nodes reaches a predefined cutoff.

Classification tree: Pruning Define a subtree to be any tree that can be obtained by pruning T. Let The quality of a tree is given by Define a cost-complexity criterion for a pre-selected level a Seek the subtree Ta that minimized the Ca(T).

Classification tree: Pruning Find the weak link, that is, a node that leads to minimum increase of . Repeat the above procedure until a single node tree is achieved. Theorem (Breiman et al. 1984): The optimal subtree is contained in the above sequence of subtrees. The level of a can be determined through cross-validation. (We will talk about cross-validation later.)

Classification tree Classification tree differs from regression tree in the quality term. For regression tree, minimize For classification tree, minimize Misclassification error: Gini index or Cross-entropy or deviance

Classification tree Advantage Drawback Visually intuitive Mathematically “simple” Drawback Unstable: tree structures are sensitive to data Theoretical properties are not well understood

Performance of a classifier Cross-validation Bootstrap

Cross-validation The data is divided into a training subset and a testing subset. Model building must be independent of testing subset, including variable selection, tree structure, and so on. Example: n-fold cross-validation A dataset is randomly divided into n subsets of equal size. Each subset is selected in turn as the testing set, whereas the rest are used as the training set. Expand cross-validation

Bootstrap methods Idea: Random draw with replacement from the training data, each sample the same size as the original training set. Fit the model using the resampled data, then treat the original training data as testing data. Estimate Improved version

Use cross-validation to select parameters A classifier may have several tunable parameters. For example, number of nearest neighbors, a for classification tree. These parameters can be selected by CV. In these cases, the full dataset is divided into three parts: training set, testing set 1, and testing set 2. Testing set 1 is used to tune parameters. So it cannot be used to objectively estimate model performance. Therefore, testing set 2 is needed.

Acknowledgement Sources of slides: Cheng Li http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf www.cse.msu.edu/~lawhiu/intro_SVM_new.ppt