Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Dimension reduction (1)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
Model Assessment, Selection and Averaging
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Principal Component Analysis
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Clustering.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
1 Masterseminar „A statistical framework for the diagnostic of meningioma cancer“ Chair for Bioinformatics, Saarland University Andreas Keller Supervised.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Gene expression profiling identifies molecular subtypes of gliomas
Classification of Microarray Gene Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Whole Genome Expression Analysis
Classification of Microarray Gene Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Supervised Classification. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the.
Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.
The Broad Institute of MIT and Harvard Classification / Prediction.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
Use of Microarray Data via Model-Based Classification in the Study and Prediction of Survival from Lung Cancer Liat Jones *, Angus Ng *, Chris Ambroise.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Consensus Group Stable Feature Selection
Flat clustering approaches
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Validation methods.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Survival-Time Classification of Breast Cancer Patients and Chemotherapy Yuh-Jye Lee, Olvi Mangasarian & W. H. Wolberg UW Madison & UCSD La Jolla Computational.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Machine Learning Clustering: K-means Supervised Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification of Microarray Data

Outline of Talk Introduction Supervised classification of tissue samples – selection bias Unsupervised classification (clustering) of tissues – mixture model-based approach

Sample 1Sample n Gene Gene p Class 1 (good prognosis) Class 2 (poor prognosis) Supervised Classification (Two Classes)

Microarray to be used as routine clinical screen by C. M. Schubert Nature Medicine 9, 9, The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.

Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the National Academy of Sciences Vol. 99, Issue 10, , May 14,

LINEAR CLASSIFIER FORM for the production of the group label y of a future entity with feature vector x.

FISHER’S LINEAR DISCRIMINANT FUNCTION and covariance matrix found from the training data where and S are the sample means and pooled sample

SUPPORT VECTOR CLASSIFIER Vapnik (1995) subject to where β 0 and β are obtained as follows: relate to the slack variables separable case

Leo Breiman (2001) Statistical modeling: the two cultures (with discussion). Statistical Science 16, Discussants include Brad Efron and David Cox

GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate) COLON DATA: Using only 4 genes, CVE is 2%

Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data

Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues

BOOTSTRAP APPROACH Efron’s (1983, JASA).632 estimator where B1 is the bootstrap when rule is applied to a point not in the training sample. A Monte Carlo estimate of B1 is where

Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR McLachlan (1977) proposed w=w o where w o is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups. Value of w 0 was found to range between 0.6 and 0.7, depending on the values of where

.632+ estimate of Efron & Tibshirani (1997, JASA) where (relative overfitting rate) (estimate of no information error rate) If r = 0, w =.632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1

Ten-Fold Cross Validation T r a i n i n g Test

MARKER GENES FOR HARVARD DATA For a SVM based on 64 genes, and using 10-fold CV, we noted the number of times a gene was selected. No. of genes Times selected

No. of Times genes selected tubulin, alpha, ubiquitous Cluster Incl N90862 cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) DEK oncogene (DNA binding) Cluster Incl AF transducin-like enhancer of split 2, homolog of Drosophila E(sp1) ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) benzodiazapine receptor (peripheral) Cluster Incl D21063 galactosidase, beta 1 high-mobility group (nonhistone chromosomal) protein 2 cold inducible RNA-binding protein Cluster Incl U79287 BAF53 tubulin, beta polypeptide thromboxane A2 receptor H1 histone family, member X Fc fragment of IgG, receptor, transporter, alpha sine oculis homeobox (Drosophila) homolog 3 transcriptional intermediary factor 1 gamma transcription elongation factor A (SII)-like 1 like mouse brain protein E46 minichromosome maintenance deficient (mis5, S. pombe) 6 transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) guanine nucleotide binding protein (G protein), gamma 3, linked dihydropyrimidinase-like 2 Cluster Incl AI transforming growth factor, beta receptor II (70-80kD) protein kinase C-like 1 MARKER GENES FOR HARVARD DATA

Breast cancer data set in van’t Veer et al. (van’t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415) These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours. The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.

van de Vijver et al. (2002) considered a further 234 breast cancer tumours but have only made available the data for the top 70 genes based on the previous study of van ‘t Veer et al. (2002)

Number of GenesError Rate for Top 70 Genes (without correction for Selection Bias as Top 70) Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70) Error Rate for 5422 Genes (with correction for Selection Bias)

Two Clustering Problems: Clustering of genes on basis of tissues – genes not independent Clustering of tissues on basis of genes - latter is a nonstandard problem in cluster analysis (n << p)

Hierarchical clustering methods for the analysis of gene expression data caught on like the hula hoop. I, for one, will be glad to see them fade. Gary Churchill (The Jackson Laboratory) Contribution to the discussion of the paper by Sebastiani, Gussoni, Kohane, and Ramoni. Statistical Science (2003) 18,

The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.

In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.

Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. (Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17) “in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.”

McLachlan and Khan (2004). On a resampling approach for tests on the number of clusters with mixture model- based clustering of the tissue samples. Special issue of the Journal of Multivariate Analysis 90 (2004) edited by Mark van der Laan and Sandrine Dudoit (UC Berkeley).

MIXTURE OF g NORMAL COMPONENTS EUCLIDEAN DISTANCE where constant MAHALANOBIS DISTANCE where

SPHERICAL CLUSTERS k-means MIXTURE OF g NORMAL COMPONENTS k-means

In exploring high-dimensional data sets for group structure, it is typical to rely on principal component analysis.

Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.

Mixtures of Factor Analyzers A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by adopting mixtures of factor analyzers (Ghahramani & Hinton, 1997)

B i is a p x q matrix and D i is a diagonal matrix. where

Single-Factor Analysis Model

The U j are iid N(O, I q ) independently of the errors e j, which are iid as N(O, D), where D is a diagonal matrix

Mixtures of Factor Analyzers A single-factor analysis model provides only a global linear model. A global nonlinear approach by postulating a mixture of linear submodels

where U i1,..., U in are independent, identically distibuted (iid) N(O, I q ), independently of the e ij, which are iid N(O, D i ), where D i is a diagonal matrix (i = 1,..., g). Conditional on ith component membership of the mixture,

An infinity of choices for B i for model still holds if B i is replaced by B i C i where C i is an orthogonal matrix. Choose C i so that Number of free parameters is then is diagonal

We can fit the mixture of factor analyzers model using an alternating ECM algorithm. Reduction in the number of parameters is then

1st cycle: declare the missing data to be the component-indicator vectors. Update the estimates of 2nd cycle: declare the missing data to be also the factors. Update the estimates of and

for i = 1,..., g. M-step on 1 st cycle:

M step on 2 nd cycle: where

Work in q-dim space: (B i B i T + D i ) -1 = D i –1 - D i -1 B i (I q + B i T D i -1 B i ) -1 B i T D i -1, |B i B i T +D i | = | D i | / |I q -B i T (B i B i T +D i ) -1 B i |.

PROVIDES A MODEL-BASED APPROACH TO CLUSTERING McLachlan, Bean, and Peel, 2002, A Mixture Model- Based Approach to the Clustering of Microarray Expression Data, Bioinformatics 18,

Example: Microarray Data Colon Data of Alon et al. (1999) n = 62 (40 tumours; 22 normals) tissue samples of p = 2,000 genes in a 2,000  62 matrix.

Mixture of 2 normal components

Mixture of 2 t components

Clustering of COLON Data Genes using EMMIX-GENE

Grouping for Colon Data

Grouping for Colon Data

Heat Map of Genes in Group G 1

Heat Map of Genes in Group G 2

Heat Map of Genes in Group G 3

An efficient algorithm based on a heuristically justified objective function, delivered in reasonable time, is usually preferable to a principled statistical approach that takes years to develop or ages to run. Having said this, the case for a more principled approach can be made more effectively once cruder approaches have exhausted their harvest of low-hanging fruit. Gilks (2004)

In bioinformatics, algorithms are generally viewed as more important than models or statistical efficiency. Unless the methodological research results in a web-based tool or, at the very least, downloadable code that can be easily run by the user, it is effectively useless.