Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Slides:



Advertisements
Similar presentations
Classification Classification Examples
Advertisements

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
x – independent variable (input)
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Part II: Discriminative Margin Clustering Joint work with: Rob Tibshirani, Dept of Statistics Patrick O. Brown, School of Medicine Stanford University.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Ensemble Learning: An Introduction
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Data mining and statistical learning - lecture 13 Separating hyperplane.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Outline Separating Hyperplanes – Separable Case
Classification (Supervised Clustering) Naomi Altman Nov '06.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
Diagnosis of multiple cancer types by shrunken centroids of gene expression Course: Topics in Bioinformatics Presenter: Ting Yang Teacher: Professor.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Diagnosis using computers. One disease Three therapies.
Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.
The Broad Institute of MIT and Harvard Classification / Prediction.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Today Ensemble Methods. Recap of the course. Classifier Fusion
1 E. Fatemizadeh Statistical Pattern Recognition.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning 5. Parametric Methods.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Claudio Lottaz and Rainer Spang
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Predicting Breast Cancer Diagnosis From Fine-Needle Aspiration
OVERVIEW OF LINEAR MODELS
Parametric Methods Berlin Chen, 2005 References:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Claudio Lottaz and Rainer Spang
Presentation transcript:

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Microarray of Ms. Smith Expression profile of Ms. Smith Ms. Smith Classification

The expression profile... - a list of 30,000 numbers - some of them reflect her health problem (e.g., cancer) -the profile is an image of Ms. Smith‘s physiology How can these numbers tell us (predict) whether Ms. Smith has tumor type A or tumor type B ? The properties of Mrs. Smith Classification

Mrs. Smith Compare her profile to profiles of people with tumor type A and to patients with tumor type B ? Looking for similarities Classification

Mrs. Smith There are patients of known class, the training samples There are patients of unknown class, the ”new“ samples Training and Prediction

Mrs. Smith Use the training samples to learn how to predict ”new“ samples Training and Prediction

Color coded expression levels of trainings samples Ms. Smith  type A Ms. Smith  type B Ms. Smith  borderline Which color shade is a good decision boundary? Prediction using one Gene A B

Use the cutoff with the fewest misclassifications on the trainings samples Smallest training error Distribution of expression values in type B Distribution of expression values in type A Decision boundary Training error Optimal decision rule A B

Test error The decision boundary was chosen to minimize the training error The two distributions of expression values for type A and B will be similar but not identical in a set of new cases We can not adjust the decision boundary because we do not know the class of the new samples Test errors are usually larger then training errors This phenomenon is called overfitting Optimal decision rule Test set Training set Training error

The top gene The average of the top 10 genes ALL vs. AML Golub et al. Combining information across genes Taking means across genes

Using a weighted average with “good weights” you get an improved separation Combining information across genes Expression values weights

The geometry of weighted averages Calculating a weighted average is identical to projecting (orthogonally) the expression profiles onto the line defined by the weights vector (of length 1). Combining information across genes

Hyperplanes A B Together with an offset β 0 the weight vector defines a hyperplane that cuts the data in two groups 2 genes 3 genes Linear decision rules

Linear Signatures Linear decision rules A B 2 genes If y≥0  Disease A If y<0  Disease B

Nearest Centroids

Diagonal Linear Discriminant Analysis (DLDA) Rescale axis according to the variances of genes Linear Discriminant Analysis

Discriminant Analysis The data often shows evidence of non identical covariances of genes in the two groups Hence using LDA, DLDA or NC introduces a model biad (=wrong model assumptions, here due to oversimplification) Linear Discriminant Analysis

Gene Filtering - Rank genes according to a score - Choose top n genes - Build a signature with these genes only Still weights, but most of them are zero … Note that the data decides which are zero and which are not Limitation: You have no(?) chance to find these two genes among 30,000 non- informative genes Feature Reduction

How many genes? Is this a biological or a statistical question? Biology: How many genes carry diagnostic information? Statistics: How many genes should we use for classification ? The microarray offers genes or more Feature Reduction

Finding the needle in the haystack A common myth: Classification information in gene expression signatures is restricted to a small number of genes, the challenge is to find them Feature Reduction

The Avalanche Verbundprojekt maligne Lymphome MYC-neg MYC-pos Aggressive lymphomas with and without a MYC-breakpoint Feature Reduction

Independent Validation The accuracy of a signature on the data it was learned from is biased because of the overfitting phenomenon Validation of a signature requires independent test data Cross Validation Test error Test set Training set Training error

ok mistake Generating Test Sets Split data randomly into … test … … and training data Cross Validation Learn the classifier on the training data, and apply it to the test data

Cross validation evaluates the algorithm with which the signature was build Gene selection and adjustment of all parameters must be repeated for every learning step within the training sample of the cross validation Train Eval Train Eval Cross Validation n-fold Cross-Validation (n=5) …

Estimators of performance have a variance … … which can be high. The chances of a meaningless signature to produce 100% accuracy on test data is high if the test data includes only few patients Nested 10-fold- CV Variance from 100 random partitions Cross Validation

The gap between training error and test error becomes wider There is a good statistical reason for not including hundreds of genes in a model even if they are biologically effected Bias & Overfitting

The shrunken centroid method and the PAM package Tibshirani et al 2002 genes Centroid Shrinkage

Shrinkage 

Train Select Train Select Compute the CV-Performance for several values of  Pick the  that gives you the smallest number of CV- Misclassifications Adaptive Model Selection PAM does this routinely How much shrinkage is good in PAM (partitioning around medoids? cross validation Centroid Shrinkage

The test data must not be used for gene selection or adaptive model selection, otherwise the observed (Cross Validation-based) accuracy is biased Selection bias Selection Bias

Small , many genes poor performance due to overfitting High , few genes, poor performance due to lack of information – underfitting - The optimal  is somewhere in the middle Cross Validation

Assume protein A binds to protein B and inhibits it The clinical phenotype is caused by active protein A Predictive information is in expression of A minus expression of B Naïve Idea: Don’t calculate weights based on single gene scores but optimize over all possible hyperplanes Predictive genes are not causal genes Calling signature genes markers for a certain disease is misleading!

Problem 1: No separating line Problem 2: Many separating lines Why is this a problem? Only one of these problems exists Optimal decision rules

This problem is related to overfitting... more soon Optimal decision rules

With the microarray we have more genes than patients Think about this in three dimensions There are three genes, two patients with known diagnosis (red and yellow) and Ms. Smith (green) There is always one plane separating red and yellow with Ms. Smith on the yellow side and a second separating plane with Ms. Smith on the red side OK! If all points fall onto one line it does not always work. However, for measured values this is very unlikely and never happens in practice. The p>N problem

From the data alone we can not decide which genes are important for the diagnosis, nor can we give a reliable diagnosis for a new patient This has little to do medicine. It is a geometrical problem. The overfitting disaster The p>N problem If you find a separating signature, it does not mean (yet) that you have a top publication in most cases it means nothing.

There always exist separating signatures caused by overfitting - meaningless signatures - Hopefully there is also a separating signature caused by a disease mechanism, or which at least are predictive for the disease - meaningful signatures – We need to learn how to find and validate meaningful signatures Finding meaningful signatures

Which hyperplane is the best? Separating hyperplanes

Fat planes: With an infinitely thin plane the data can always be separated correctly, but not necessarily with a fat one. Again if a large margin separation exists, chances are good that we found something relevant. Large Margin Classifiers Support Vector Machines (SVMs)

Maximal Margin Hyperplane There are theoretical results that the size of the margin correlates with the test (!) error (V. Vapnik) SVMs are not only optimized to fit to the training data but for predictive performance directly Support Vector Machines (SVMs)

No separable training set Penalty of error: distance to hyperplane multiplied by a parameter c Balance over- and underfitting Support Vector Machines (SVMs)

Documenting a signature is conceptually different from giving a list of genes, although is is what most publications give you In order to validate a signature on external data or apply it in practice: - All model parameters need to be specified - The scale of the normalized data to which the model refers needs to be specified External Validation and Documentation

Cross Validation: Split Data into Training and Test Data - select genes - find the optimal number of genes Training data only: Machine Learning - learn model parameters External Validation Establishing a signature

1. Decide on your diagnosis model (PAM,SVM,etc...) and don‘t change your mind later on 2. Split your profiles randomly into a training set and a test set 3. Put the data in the test set away... far away! 4. Train your model only using the data in the training set (select genes, define centroids, calculate normal vectors for large margin separators, perform adaptive model selection...) don‘t even think of touching the test data at this time 5. Apply the model to the test data... don‘t even think of changing the model at this time 6. Do steps 1-5 only once and accept the result... don‘t even think of optimizing this procedure Cookbook for good classifiers

Rainer Spang, University of Regensburg Acknowledgements Florian Markowetz, Cancer Research UK, Cambridge