Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Broad Institute of MIT and Harvard Classification / Prediction.

Similar presentations


Presentation on theme: "The Broad Institute of MIT and Harvard Classification / Prediction."— Presentation transcript:

1 The Broad Institute of MIT and Harvard Classification / Prediction

2 The Broad Institute of MIT and Harvard Classification “Supervised Learning” Use a “training set” of examples to create a model that is able to predict, given an unknown sample, which of two or more classes that sample belongs to. ?

3 The Broad Institute of MIT and Harvard What we’ll cover How to build a classifier. How to evaluate a classifier. Using GenePattern to classify expression data.

4 The Broad Institute of MIT and Harvard What Is a Classifier A predictive rule that uses a set of inputs (genes) to predict the values of the output (phenotype). Known examples (train data) are used to build the predictive rule. Goal: –Achieve high generalization (predictive) power. –Avoid over-fitting.

5 The Broad Institute of MIT and Harvard Classification Data Known Classes Assess Gene-Class Correlation Feature (Marker) Selection Assess Gene-Class Correlation Feature (Marker) Selection Build Classifier Test Classifier by Cross- Validation Evaluate Classifier on Independent Test Set Computational methodology

6 The Broad Institute of MIT and Harvard Classification Assess Gene-Class Correlation Feature Selection Assess Gene-Class Correlation Feature Selection Build Classifier Test Classifier by Cross- Validation Evaluate Classifier on Independent Test Set Expression Data Known Classes Regression Trees (CART) Weighted Voting k-Nearest Neighbors Support Vector Machines Computational methodology

7 The Broad Institute of MIT and Harvard Classifiers Important issues: –Few cases, many variables (genes) –redundancy: many highly correlated genes. –noise: measurements are very imprecise. –feature selection: reducing p is a necessity. Avoid over-fitting.

8 The Broad Institute of MIT and Harvard project samples in gene space gene 1 gene 2 class orange class black K-nn Classifier Example: K=5, 2 genes, 2 classes

9 The Broad Institute of MIT and Harvard gene 1 gene 2 class orange class black project unknown sample ? K-nn Classifier Example: K=5, 2 genes, 2 classes

10 The Broad Institute of MIT and Harvard gene 1 gene 2 class orange class black "consult" 5 closest neighbors: - 3 black - 2 orange Distance measures: Euclidean distance 1-Pearson correlation … ? K-nn Classifier Example: K=5, 2 genes, 2 classes

11 The Broad Institute of MIT and Harvard Support Vector Machine (SVM) Noble, Nat Biotech 2006

12 The Broad Institute of MIT and Harvard Weighted Voting Mixture of Experts approach : –Each gene casts a vote for one of the possible classes. –The vote is weighted by a score assessing the reliability of the expert (in this case, the gene). –The class receiving the highest vote will be the predicted class. –The vote can be used as a proxy for the probability of class membership (prediction strength). Slonim et al., RECOMB 2000 g 1 g 2 … g i … g n- 1 g n Class 1 centroid ? Class 2 centroid g 1 g 2 … g i … g n- 1 g n new sample

13 The Broad Institute of MIT and Harvard Class Prediction Expression Data Known Classes Assess Gene-Class Correlation Feature Selection Assess Gene-Class Correlation Feature Selection Build Predictor Test Predictor by Cross- Validation Evaluate Predictor on Independent Test Set Regression Trees (CART) Weighted Voting k-Nearest Neighbors Support Vector Machines Computational methodology

14 The Broad Institute of MIT and Harvard Testing the Classifier Evaluation on independent test set train set –Build the classifier on the train set. test set. –Assess prediction performance on test set. Maximize generalization/Avoid overfitting. Performance measure error rate = # of cases correctly classified total # of cases

15 The Broad Institute of MIT and Harvard Testing the Classifier Evaluation on independent test set –What if we don’t have an independent test set? Cross Validation (XV): –Split the dataset into n folds (e.g., 10 folds of 10 cases each). –For each fold (e.g., for each group of 10 cases), train (i.e., build model) on n-1 folds (e.g., on 90 cases), test (i.e., predict) on left-out fold (e.g., on remaining 10 cases). –Combine test results. –Frequently, leave-one-out XV (when small sample size).

16 The Broad Institute of MIT and Harvard Testing the Classifier ALL vs. MLL vs. AML Learning curves – leave one out cross validation

17 The Broad Institute of MIT and Harvard Testing the Classifier Error rate estimate: –Evaluate on independent test set: Best error estimate. –Cross Validation Needed when small sample size and for model selection.

18 The Broad Institute of MIT and Harvard Classification Cookbook Start by splitting data into train and test set (stratified). “Forget” about the test set until the very end. Explore different feature selection methods and different classifiers on train set by XV. Once the “best” classifier and best classifier parameters have been selected (based on XV) –Build a classifier with given parameters on entire train set. –Apply classifier to test set.

19 The Broad Institute of MIT and Harvard Classification Split data into train and test set – SplitDatasetTrainTest Explore different feature selection methods and different classifiers on train set by CV. –CARTXValidation –KNNXValidation –WeightedVotingXValidation Once the “best” classifier and best classifier parameters have been selected (based on CV) –CART –KNN –WeightedVoting Examine results: –PredictionReultsViewer –FeatureSummaryViewer GenePattern methods

20 The Broad Institute of MIT and Harvard References 1.Golub, T.R., et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science, October 15 1999. 286 (5439): p. 531-537. 2.Quackenbush, J., Computational Analysis of Microarray Data. Nature Reviews Genetics, June 2001. 2: p. 418-427. 3.Tibshirani, R., et al., Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 2002. 99(10): p. 6567-6572. 4.Slonim, D.K., et al., Class Prediction and Discovery Using Gene Expression Data, in RECOMB 2000: The Fourth Annual International Conference on Research in Computational Molecular Biology. 2000: Tokyo, Japan. p. 263-272. 5.Ramoni, M.F., P. Sebastiani, and I.S. Kohane, From the Cover: Cluster analysis of gene expression dynamics. PNAS, 2002. 99(14): p. 9121-9126. 6.Savage, K.J., et al., The molecular signature of mediastinal large B-cell lymphoma differs from that of other diffuse large B-cell lymphomas and shares features with classical Hodgkin's lymphoma. Blood, 2003. 102 (12): p. 3871-3879 7.…

21 The Broad Institute of MIT and Harvard Classification Example


Download ppt "The Broad Institute of MIT and Harvard Classification / Prediction."

Similar presentations


Ads by Google