Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification of Microarray Data
Outline of Talk Introduction Supervised classification of tissue samples – selection bias Unsupervised classification (clustering) of tissues – mixture model-based approach
Sample 1Sample n Gene Gene p Class 1 (good prognosis) Class 2 (poor prognosis) Supervised Classification (Two Classes)
Microarray to be used as routine clinical screen by C. M. Schubert Nature Medicine 9, 9, The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.
Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the National Academy of Sciences Vol. 99, Issue 10, , May 14,
LINEAR CLASSIFIER FORM for the production of the group label y of a future entity with feature vector x.
FISHER’S LINEAR DISCRIMINANT FUNCTION and covariance matrix found from the training data where and S are the sample means and pooled sample
SUPPORT VECTOR CLASSIFIER Vapnik (1995) subject to where β 0 and β are obtained as follows: relate to the slack variables separable case
Leo Breiman (2001) Statistical modeling: the two cultures (with discussion). Statistical Science 16, Discussants include Brad Efron and David Cox
GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate) COLON DATA: Using only 4 genes, CVE is 2%
Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples
Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data
Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues
BOOTSTRAP APPROACH Efron’s (1983, JASA).632 estimator where B1 is the bootstrap when rule is applied to a point not in the training sample. A Monte Carlo estimate of B1 is where
Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR McLachlan (1977) proposed w=w o where w o is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups. Value of w 0 was found to range between 0.6 and 0.7, depending on the values of where
.632+ estimate of Efron & Tibshirani (1997, JASA) where (relative overfitting rate) (estimate of no information error rate) If r = 0, w =.632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1
Ten-Fold Cross Validation T r a i n i n g Test
MARKER GENES FOR HARVARD DATA For a SVM based on 64 genes, and using 10-fold CV, we noted the number of times a gene was selected. No. of genes Times selected
No. of Times genes selected tubulin, alpha, ubiquitous Cluster Incl N90862 cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) DEK oncogene (DNA binding) Cluster Incl AF transducin-like enhancer of split 2, homolog of Drosophila E(sp1) ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) benzodiazapine receptor (peripheral) Cluster Incl D21063 galactosidase, beta 1 high-mobility group (nonhistone chromosomal) protein 2 cold inducible RNA-binding protein Cluster Incl U79287 BAF53 tubulin, beta polypeptide thromboxane A2 receptor H1 histone family, member X Fc fragment of IgG, receptor, transporter, alpha sine oculis homeobox (Drosophila) homolog 3 transcriptional intermediary factor 1 gamma transcription elongation factor A (SII)-like 1 like mouse brain protein E46 minichromosome maintenance deficient (mis5, S. pombe) 6 transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) guanine nucleotide binding protein (G protein), gamma 3, linked dihydropyrimidinase-like 2 Cluster Incl AI transforming growth factor, beta receptor II (70-80kD) protein kinase C-like 1 MARKER GENES FOR HARVARD DATA
Breast cancer data set in van’t Veer et al. (van’t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415) These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours. The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.
van de Vijver et al. (2002) considered a further 234 breast cancer tumours but have only made available the data for the top 70 genes based on the previous study of van ‘t Veer et al. (2002)
Number of GenesError Rate for Top 70 Genes (without correction for Selection Bias as Top 70) Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70) Error Rate for 5422 Genes (with correction for Selection Bias)
Two Clustering Problems: Clustering of genes on basis of tissues – genes not independent Clustering of tissues on basis of genes - latter is a nonstandard problem in cluster analysis (n << p)
Hierarchical clustering methods for the analysis of gene expression data caught on like the hula hoop. I, for one, will be glad to see them fade. Gary Churchill (The Jackson Laboratory) Contribution to the discussion of the paper by Sebastiani, Gussoni, Kohane, and Ramoni. Statistical Science (2003) 18,
The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.
In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.
Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. (Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17) “in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.”
McLachlan and Khan (2004). On a resampling approach for tests on the number of clusters with mixture model- based clustering of the tissue samples. Special issue of the Journal of Multivariate Analysis 90 (2004) edited by Mark van der Laan and Sandrine Dudoit (UC Berkeley).
MIXTURE OF g NORMAL COMPONENTS EUCLIDEAN DISTANCE where constant MAHALANOBIS DISTANCE where
SPHERICAL CLUSTERS k-means MIXTURE OF g NORMAL COMPONENTS k-means
In exploring high-dimensional data sets for group structure, it is typical to rely on principal component analysis.
Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.
Mixtures of Factor Analyzers A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by adopting mixtures of factor analyzers (Ghahramani & Hinton, 1997)
B i is a p x q matrix and D i is a diagonal matrix. where
Single-Factor Analysis Model
The U j are iid N(O, I q ) independently of the errors e j, which are iid as N(O, D), where D is a diagonal matrix
Mixtures of Factor Analyzers A single-factor analysis model provides only a global linear model. A global nonlinear approach by postulating a mixture of linear submodels
where U i1,..., U in are independent, identically distibuted (iid) N(O, I q ), independently of the e ij, which are iid N(O, D i ), where D i is a diagonal matrix (i = 1,..., g). Conditional on ith component membership of the mixture,
An infinity of choices for B i for model still holds if B i is replaced by B i C i where C i is an orthogonal matrix. Choose C i so that Number of free parameters is then is diagonal
We can fit the mixture of factor analyzers model using an alternating ECM algorithm. Reduction in the number of parameters is then
1st cycle: declare the missing data to be the component-indicator vectors. Update the estimates of 2nd cycle: declare the missing data to be also the factors. Update the estimates of and
for i = 1,..., g. M-step on 1 st cycle:
M step on 2 nd cycle: where
Work in q-dim space: (B i B i T + D i ) -1 = D i –1 - D i -1 B i (I q + B i T D i -1 B i ) -1 B i T D i -1, |B i B i T +D i | = | D i | / |I q -B i T (B i B i T +D i ) -1 B i |.
PROVIDES A MODEL-BASED APPROACH TO CLUSTERING McLachlan, Bean, and Peel, 2002, A Mixture Model- Based Approach to the Clustering of Microarray Expression Data, Bioinformatics 18,
Example: Microarray Data Colon Data of Alon et al. (1999) n = 62 (40 tumours; 22 normals) tissue samples of p = 2,000 genes in a 2,000 62 matrix.
Mixture of 2 normal components
Mixture of 2 t components
Clustering of COLON Data Genes using EMMIX-GENE
Grouping for Colon Data
Grouping for Colon Data
Heat Map of Genes in Group G 1
Heat Map of Genes in Group G 2
Heat Map of Genes in Group G 3
An efficient algorithm based on a heuristically justified objective function, delivered in reasonable time, is usually preferable to a principled statistical approach that takes years to develop or ages to run. Having said this, the case for a more principled approach can be made more effectively once cruder approaches have exhausted their harvest of low-hanging fruit. Gilks (2004)
In bioinformatics, algorithms are generally viewed as more important than models or statistical efficiency. Unless the methodological research results in a web-based tool or, at the very least, downloadable code that can be easily run by the user, it is effectively useless.