Presentation is loading. Please wait.

Presentation is loading. Please wait.

Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification.

Similar presentations


Presentation on theme: "Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification."— Presentation transcript:

1 Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland http://www.maths.uq.edu.au/~gjm The Classification of Microarray Data

2

3 Outline of Talk Introduction Supervised classification of tissue samples – selection bias Unsupervised classification (clustering) of tissues – mixture model-based approach

4

5

6 Sample 1Sample n....... Gene 1....... Gene p Class 1 (good prognosis) Class 2 (poor prognosis) Supervised Classification (Two Classes)

7 Microarray to be used as routine clinical screen by C. M. Schubert Nature Medicine 9, 9, 2003. The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.

8 Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the National Academy of Sciences Vol. 99, Issue 10, 6562-6566, May 14, 2002 http://www.pnas.org/cgi/content/full/99/10/6562

9 LINEAR CLASSIFIER FORM for the production of the group label y of a future entity with feature vector x.

10 FISHER’S LINEAR DISCRIMINANT FUNCTION and covariance matrix found from the training data where and S are the sample means and pooled sample

11 SUPPORT VECTOR CLASSIFIER Vapnik (1995) subject to where β 0 and β are obtained as follows: relate to the slack variables separable case

12 Leo Breiman (2001) Statistical modeling: the two cultures (with discussion). Statistical Science 16, 199-231. Discussants include Brad Efron and David Cox

13 GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate) COLON DATA: Using only 4 genes, CVE is 2%

14 Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

15 Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data

16 Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues

17 BOOTSTRAP APPROACH Efron’s (1983, JASA).632 estimator where B1 is the bootstrap when rule is applied to a point not in the training sample. A Monte Carlo estimate of B1 is where

18 Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR McLachlan (1977) proposed w=w o where w o is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups. Value of w 0 was found to range between 0.6 and 0.7, depending on the values of where

19 .632+ estimate of Efron & Tibshirani (1997, JASA) where (relative overfitting rate) (estimate of no information error rate) If r = 0, w =.632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1

20 Ten-Fold Cross Validation 12345678910 T r a i n i n g Test

21 MARKER GENES FOR HARVARD DATA For a SVM based on 64 genes, and using 10-fold CV, we noted the number of times a gene was selected. No. of genes Times selected 55 1 18 2 11 3 7 4 8 5 6 6 10 7 8 8 12 9 17 10

22 No. of Times genes selected 55 1 18 2 11 3 7 4 8 5 6 6 10 7 8 8 12 9 17 10 tubulin, alpha, ubiquitous Cluster Incl N90862 cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) DEK oncogene (DNA binding) Cluster Incl AF035316 transducin-like enhancer of split 2, homolog of Drosophila E(sp1) ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) benzodiazapine receptor (peripheral) Cluster Incl D21063 galactosidase, beta 1 high-mobility group (nonhistone chromosomal) protein 2 cold inducible RNA-binding protein Cluster Incl U79287 BAF53 tubulin, beta polypeptide thromboxane A2 receptor H1 histone family, member X Fc fragment of IgG, receptor, transporter, alpha sine oculis homeobox (Drosophila) homolog 3 transcriptional intermediary factor 1 gamma transcription elongation factor A (SII)-like 1 like mouse brain protein E46 minichromosome maintenance deficient (mis5, S. pombe) 6 transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) guanine nucleotide binding protein (G protein), gamma 3, linked dihydropyrimidinase-like 2 Cluster Incl AI951946 transforming growth factor, beta receptor II (70-80kD) protein kinase C-like 1 MARKER GENES FOR HARVARD DATA

23 Breast cancer data set in van’t Veer et al. (van’t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415) These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours. The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.

24 van de Vijver et al. (2002) considered a further 234 breast cancer tumours but have only made available the data for the top 70 genes based on the previous study of van ‘t Veer et al. (2002)

25 Number of GenesError Rate for Top 70 Genes (without correction for Selection Bias as Top 70) Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70) Error Rate for 5422 Genes (with correction for Selection Bias) 10.500.530.56 20.320.410.44 40.260.400.41 80.270.320.43 160.280.310.35 320.220.350.34 640.200.340.35 700.190.33- 128--0.39 256--0.33 512--0.34 1024--0.33 2048--0.37 4096--0.40 5422--0.44

26

27

28 Two Clustering Problems: Clustering of genes on basis of tissues – genes not independent Clustering of tissues on basis of genes - latter is a nonstandard problem in cluster analysis (n << p)

29

30 Hierarchical clustering methods for the analysis of gene expression data caught on like the hula hoop. I, for one, will be glad to see them fade. Gary Churchill (The Jackson Laboratory) Contribution to the discussion of the paper by Sebastiani, Gussoni, Kohane, and Ramoni. Statistical Science (2003) 18, 64-69.

31

32

33 The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.

34 In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.

35 Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. (Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17) “in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.”

36 McLachlan and Khan (2004). On a resampling approach for tests on the number of clusters with mixture model- based clustering of the tissue samples. Special issue of the Journal of Multivariate Analysis 90 (2004) edited by Mark van der Laan and Sandrine Dudoit (UC Berkeley).

37 MIXTURE OF g NORMAL COMPONENTS EUCLIDEAN DISTANCE where constant MAHALANOBIS DISTANCE where

38 SPHERICAL CLUSTERS k-means MIXTURE OF g NORMAL COMPONENTS k-means

39 In exploring high-dimensional data sets for group structure, it is typical to rely on principal component analysis.

40 Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.

41 Mixtures of Factor Analyzers A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by adopting mixtures of factor analyzers (Ghahramani & Hinton, 1997)

42 B i is a p x q matrix and D i is a diagonal matrix. where

43 Single-Factor Analysis Model

44 The U j are iid N(O, I q ) independently of the errors e j, which are iid as N(O, D), where D is a diagonal matrix

45 Mixtures of Factor Analyzers A single-factor analysis model provides only a global linear model. A global nonlinear approach by postulating a mixture of linear submodels

46 where U i1,..., U in are independent, identically distibuted (iid) N(O, I q ), independently of the e ij, which are iid N(O, D i ), where D i is a diagonal matrix (i = 1,..., g). Conditional on ith component membership of the mixture,

47 An infinity of choices for B i for model still holds if B i is replaced by B i C i where C i is an orthogonal matrix. Choose C i so that Number of free parameters is then is diagonal

48 We can fit the mixture of factor analyzers model using an alternating ECM algorithm. Reduction in the number of parameters is then

49 1st cycle: declare the missing data to be the component-indicator vectors. Update the estimates of 2nd cycle: declare the missing data to be also the factors. Update the estimates of and

50 for i = 1,..., g. M-step on 1 st cycle:

51 M step on 2 nd cycle: where

52

53 Work in q-dim space: (B i B i T + D i ) -1 = D i –1 - D i -1 B i (I q + B i T D i -1 B i ) -1 B i T D i -1, |B i B i T +D i | = | D i | / |I q -B i T (B i B i T +D i ) -1 B i |.

54 PROVIDES A MODEL-BASED APPROACH TO CLUSTERING McLachlan, Bean, and Peel, 2002, A Mixture Model- Based Approach to the Clustering of Microarray Expression Data, Bioinformatics 18, 413-422 http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf

55

56 Example: Microarray Data Colon Data of Alon et al. (1999) n = 62 (40 tumours; 22 normals) tissue samples of p = 2,000 genes in a 2,000  62 matrix.

57

58

59 Mixture of 2 normal components

60 Mixture of 2 t components

61

62 Clustering of COLON Data Genes using EMMIX-GENE

63 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Grouping for Colon Data

64

65

66 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Grouping for Colon Data

67

68

69 Heat Map of Genes in Group G 1

70 Heat Map of Genes in Group G 2

71 Heat Map of Genes in Group G 3

72

73 An efficient algorithm based on a heuristically justified objective function, delivered in reasonable time, is usually preferable to a principled statistical approach that takes years to develop or ages to run. Having said this, the case for a more principled approach can be made more effectively once cruder approaches have exhausted their harvest of low-hanging fruit. Gilks (2004)

74 In bioinformatics, algorithms are generally viewed as more important than models or statistical efficiency. Unless the methodological research results in a web-based tool or, at the very least, downloadable code that can be easily run by the user, it is effectively useless.


Download ppt "Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification."

Similar presentations


Ads by Google