Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Inferential Statistics and t - tests
Outlines Background & motivation Algorithms overview
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Model generalization Test error Bias, variance and complexity
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model Assessment, Selection and Averaging
Computational Diagnostics We are a new research group in the department of Computational Molecular Biology at the Max Planck Institute for Molecular Genetics.
Visual Recognition Tutorial
x – independent variable (input)
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of scientific research When you know the system: Estimation.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Basics of discriminant analysis
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Bootstrapping applied to t-tests
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Evaluating Classifiers
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
PATTERN RECOGNITION AND MACHINE LEARNING
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Diagnosis using computers. One disease Three therapies.
The Broad Institute of MIT and Harvard Classification / Prediction.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.
Robust Estimators.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hidden Markov Models Part 2: Algorithms
Computational Diagnostics
Rainer Spang, Max Planck Institute for Molecular Genetics, Berlin
Rainer Spang, Max Planck Institute for Molecular Genetics, Berlin
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Dimension reduction : PCA and Clustering
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Sample classification using Microarray Data

AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant to drug etc...

A B From a tissue database we get biopsies for both entities

Tissue DNA Chip Expression profile We do expression profiling...

A B What about differences in the profiles? Do they exist? What do they mean?... and the situation looks like this:

Which characteristics does this data have?

This is a statistical question and the answer is coming from biology.

The data describes a huge and complex interactive network... that we do not know. ?

Expression profiles describe states of this network ?

Diseases can (should) be understood (defined) as characteristic states of this network ?

This data is: Very high dimensional Consists of highly dependent variables (genes)

A B What is the problem with high dimensional dependent data? Back to the statistical classification problem:

A B A new patient A A two gene scenario, where everything works out fine:

Problem 1: No separating line Problem 2: To many separating lines And here things go wrong:

And in dimensional spaces

Spent a minute thinking about this in three dimensions Ok, there are three genes, two patients with known diagnosis, one patient of unknown diagnosis, and separating planes instead of lines Problem 1 never exists! Problem 2 exists almost always! OK! If all points fall onto one line it does not always work. However, for measured values this is very unlikely and never happens in praxis.

In summary: There is always a linear signature separating the entities... a biological reason for this is not needed. Hence, if you find a separating signature, it does not mean (yet) that you have a nice publication in most cases it means nothing.

Take it easy! There are meaningful differences in gene expression... we have build our science on them... and they should be reflected in the profiles. In general a separating linear signature is of no significance at all. There are good signatures and bad ones!

Strategies for finding good signatures statistical learning: 1.Gene Selection 2.Factor-Regression 3.Support Vector Machines (machine learning) 4.Informative Priors (Bayesian Statistics) What are the ideas?

Gene selection Why does it help? When considering all possible linear planes for separating the patient groups, we always find one that perfectly fits, without a biological reason for this. When considering only planes that depend on maximally 20 genes it is not guaranteed that we find a well fitting signature. If in spite of this it does exist, chances are better that it reflects transcriptional disorder. If we require in addition that the genes are all good classifiers them selves, i.e. we find them by screening using the t-Score, finding a separating signature is even more exceptional.

Gene selection What is it? Choose a small number of genes … say 20 … and then fit a model using only these genes. How to pick genes: -Screening ( i.e. t-score) single gene association -Wrapping multiple gene association 1. choose the best classifying single gene g1 2. choose the optimal complementing gene g2 g1 and g2 are optimal for classification 3. etc ….

How many genes - All linear planes - All linear planes depending on at most 20 genes - All linear planes depending on a given set of 20 genes High probability for finding a fitting signature Low probability for finding a fitting signature High probability that a signature is meaningful Low probability that a signature is meaningful How you find them -Wrapping: multiple gene association -Screening: Single gene association Tradeoffs

Flexible model Smooth model High probability for finding a fitting signature Low probability for finding a fitting signature High probability that a signature is meaningful Low probability that a signature is meaningful More Generally

Factor-Regression For example PCA, SVD... Use the first n (3-4) Factors only... does not work well with expression data, at least not with a small number of samples

Support Vector Machines Fat planes: With an infinitely thin plane the data can always be separated correctly, but not necessarily with a fat one. Again if a large margin separation exists, chances are good that we found something relevant. Large Margin Classifiers

LikelihoodPriorPosterior Informativer Priors

First:Singular Value Decomposition Data Loadings Singular values Expression levels of super genes, orthogonal matrix

Keep all super-genes: The Prior Needs to Be designed in n Dimensions n number of samples Shape? Center? Orientation? Not to narrow... not to wide

Shape multidimensional normal for simplicity

Center Assumptions on the model correspond to assumptions on the diagnosis

Orientation orthogonal super-genes !

Not to Narrow... Not to Wide Auto adjusting model Scales are hyper parameters with their own priors

What are the additional assumptions that came in by the prior? The model can not be dominated by only a few super-genes ( genes! ) The diagnosis is done based on global changes in the expression profiles influenced by many genes The assumptions are neutral with respect to the individual diagnosis

All models confine the set of possible signatures a priori; however, they do it in different ways. Gene selection aims for few genes in the signature SVM go for large margins between data points and the separating hyper-plane. PC-Regression confine the signature to 3-4 independent factors The Bayesian model prefers models with small weights (a la ridge-regression or weight decay) A common idea behind all models...

... and a common problem of all models: The bias variance trade off Model Complexity: - max number of genes - minimal margin - width of prior -etc

How come?

Population mean: Genes have a certain mean expression and correlation in the population

Sample mean: We observe average expression and empirical correlation

Fitted model:

Regularization

How much regularization do we need? - The Bayesian answer: What you do not know is a random variable... regularization becomes part of the model - Or model selection by evaluations...

Model Selection with separate data TrainingTest Selection Split of some samples for Model Selection Train the model on the training data with different choices for the regularization parameter Apply it to the selection data and optimize this parameter (Model Selection) Test how good you are doing on the test data (Model Assessment)

10 Fold Cross-Validation Train Select Train Select... Chop up the training data (don‘t touch the test data) into 10 sets Train on 9 of them and predict the other Iterate, leave every set out once Select a model according to the prediction error (deviance)

Leave one out Cross-Validation Train Select Train Select... Essentially the same But you only leave one sample out at a time and predict it using the others Good for small training sets 1 1

Model Assessment How well did I do? Can I use my signature for clinical diagnosis? How well will it perform? How does it compare to traditional methods?

The most important thing: Don‘t fool yourself! (... and others) This guy (and others) thought for some time he could predict the nodal status of a breast tumor from a profile taken from the primary tumor!... there are significant differences. But not good enough for prediction (West et al PNAS 2001)

DOs AND DONTs : 1. Decide on your diagnosis model (PAM,SVM,etc...) and don‘t change your mind later on 2. Split your profiles randomly into a training set and a test set 3. Put the data in the test set away. 4. Train your model only using the data in the training set (select genes, define centroids, calculate normal vectors for large margin separators,perform model selection...) don‘t even think of touching the test data at this time 5. Apply the model to the test data... don‘t even think of changing the model at this time 6. Do steps 1-5 only once and accept the result... don‘t even think of optimizing this procedure

The selection bias - You can not select 20 genes using all your data and then with this 20 genes split test and training data and evaluate your method. -There is a difference between a model that restricts signatures to depend on only 20 genes and a data set that only contains 20 genes -Your model assessment will look much better than it should