Gene expression profiling identifies molecular subtypes of gliomas Ruty Shai, Tao Shi, Thomas J Kremen, Steve Horvath, Linda M Liau, Timothy F Cloughesy, Paul S Mischel* and Stanley F Nelson Presented by Stephanie Tsung
Outline Descriptions of Data Statistical Methods Multidimensional Scaling Plot Hierarchical Clustering K-means Clustering Gene Filtering/Selection Predictor Comparison Conclusion/ Future works
Background Brain tumors can be classified by tumor origins, cell type origin or the tumor site etc; Tumor classification has been critical in treatment selection and outcome prediction. However, current classification methods are still far from perfect; As a new technology, DNA microarray has been introduced to cancer classification on the basis of gene expression levels.
Background: Cancer Classification Cancer classification can be divided into two challenges: class discovery and class prediction. Class discovery refers to defining previously unrecognized tumor subtypes. Class prediction refers to the assignment of particular tumor samples to already-defined classes.
Objectives To test whether gene expression measurements can be used to classify different brain tumors; To determine sets of significant genes to distinguish brain tumor of different pathological types, grades and survival times; To validate the selected informative genes in brain tumor classification and prediction.
Data and Pre-Processing Affymetrix HG-U95Av2 chips 12,555 Genes and total 42 samples Tumor Types (#): N(7) O(3) D(18) A(2) AA(3) P(9) Data pre-processing: Each tumor was examined by a neuropathologist and dissected into two portions: tissue diagnosis and RNA extraction. Normalization and Model-Based Expression indices in dChip.
Q. Are the global transcriptional signatures of the different pathologic subtypes of gliomas molecularly distinct? As a first step in the analysis, we asked whether the global transcriptional signatures of the different pathologic subtypes of gliomas were molecularly distinct.
Multidimensional Scaling Plot (MDS Plot) To uncover the hidden structure of data. D(N) -> D(2) Dimension reduction technique 12,555 dimensional space to low dimensional Euclidean space Explain observed similarities and dissimilarity between objects such as correlation, euclidean distance etc. R: cmd1 <- cmdscale(dist(dat1[,1:30]),k=2,eig=T) We performed multidimensional scaling, an unsupervised method of data reduction, in which high-dimensional gene expression data are projected onto two viewable dimensions representing linear combinations of genes that provide the most variation in the data set
MDS Plot Multidimensional scaling analysis of our samples based on expression of all 12 555 probe sets demonstrated that the global gene expression profiles of gliomas of different type and grade have distinctive global gene expression signatures. The glioblastomas, lower grade astrocytomas and oligodendrogliomas were all separable from each other, and from normal brain tissue The multidimensional scaling data also indicate that primary glioblastomas, which arise as de novo grade IV tumors, are not molecularly distinct from secondary glioblastomas, which develop from lower grade gliomas. However, the secondary GBMs are more diverse than the primary GBMs. Figure 1. (a)Multidimensional scaling plot of all 42 tissue samples plotted in two-dimensional space using expression values from all 12 555 probesets.
Hierarchical Clustering Evaluate all pair wise distance between objects Look for a pair with shortest distance Construct ‘new obj’ by avg. of two obj. Evaluate distance from ‘new obj’ to all other objects and Go to Step 2 R: h1 <- hclust(dist(x), method=“average”)
Hierarchical Clustering II III IV Figure 1. (b) The same 42 tissue samples were grouped into hierarchical clusters. Tissue samples are color-coded. I & II : P=0.00006, Fisher’s exact test III & IV : P=0.00001
Fisher’s Exact Test Sample w/ charat. w/o charat. Total 1 A B A+B 2 C D C+D A+C B+D N Ho: Whether proportion of interest differs between two groups. Ex. 55 8 7 Fisher's exact test is an alternative to the chi-squared test for testing the hypothesis that some proportion of interest differs between two groups. It has the advantage that it does not make any approximations, and so is suitable for small sample sizes. Fisher's exact test calculates the exact probability . Fisher's exact test in the Tests menu is used to calculate an exact P-value for a 2x2 frequency table with small number of expected frequencies, for which the Chi-square test is not appropriate. The Fisher exact test for 2 x 2 tables is used when members of two independent groups can fall into one of two mutually exclusive categories. The test is used to determine whether the proportions of those falling into each category differs by group. The chi-square test of independence can also be used in such situations, but it is only an approximation, whereas Fisher's exact test returns exact one-tailed and two-tailed p-values for a given frequency table. Fisher's exact test computes the probability, given the observed marginal frequencies, of obtaining exactly the frequencies observed and any configuration more extreme. By "more extreme," we mean any configuration (given observed marginals) with a smaller probability of occurrence in the same direction (one-tailed) or in both directions (two-tailed). The two-tailed probability: .326 + .007+ .093 + .163 + .019 = .608
Q. Can we uncover these subtypes without prior knowledge? i.e. How many categories of gliomas are suggested by the gene expression data? Next, we asked if our data might be used to uncover molecular subtypes of gliomas without prior knowledge of their pathologic type or grade. That is, how many categories of glioma are suggested by the gene expression data?
K-means Clustering To find a K-partition of the observations that minimizes the within sum of squares (WSS) for each clusters The number of clusters, k, needs to be pre-specified. Tibshirani prediction strength can be used to determine the optimal k. R: cl1<- kmeans (x, 3)
Three groups were defined Three groups were defined. Each tumor is assigned to one of three cluster groups by color: red is group 1, green is group 2 and black is group 3 These data indicate that there are three main molecular subsets of gliomas, which correspond to glioblastomas, astrocytomas, and oligodendrogliomas. Figure 2 Grouping of tumors. All tumor samples were plotted using multidimensional scaling using all 12 555 probesets. We performed nonhierarchical Kmeans clustering (Kaufmann and Rousseeu, 1990).
Gene Filtering/Selection To find the interesting genes which differently expressed in 6 two groups comparisons Using top 30 genes based on T-test 170 most differentially expressed genes using T-test (6 * 30 – 10 = 170) A final gene list was then constructed by pooling the most differentially expressed genes from these individual comparisons, and redundant genes were eliminated.
Predictor Comparison Compare the performance of predictors: Gene Vote Leave-one-out crossvalidation error rates were calculated. For a given method and sample size, n, a classifier is generated using (n - l) cases and tested on the single remaining case. This is repeated n times, each time designing a classifier by leaving-one-out. Thus, each case in the sample is used as a test case, and each time nearly all the cases are used to design a classifier Leaving-one-out is an elegant and straightforward technique for estimating classifier error rates. Because it is computationally expensive, it has often been reserved for problems where relatively small sample sizes are available. For a given method and sample size, n, a classifier is generated using (n - l) cases and tested on the single remaining case. This is repeated n times, each time designing a classifier by leaving-one-out. Thus, each case in the sample is used as a test case, and each time nearly all the cases are used to design a classifier. The error rate is the number of errors on the single test cases divided by n. The leave-one-out error rate estimator is an almost unbiased estimator of the true error rate of a classifier.
Table 1.
Using 170 filtered genes based on t-test Figure 3 Hierarchical clustering of seven normal white matter tissue samples and 26 glial tumor samples using 170 filtered genes. We used dChip to perform hierarchical clustering of the samples using 1-r where r is Pearson’s correlation coefficient as the distance measure Samples are coded by color. Gene expression values are represented as expression relative to the mean of all samples; red is a relatively higher expression and green is a relatively lower expression
Table 2.
Conclusion Performed MDS plots and K-means clustering analysis and found evidence for three clusters: glioblastomas, lower grade astrocytomas, and oligodendrogilmas (p<0.00001). A relatively small number of genes can be used to distinguish between molecular subtypes. Subsets of gliomas can be potentially used for patient stratification and potential targets for treatment.
Future Directions Construct predictors using different gene selection methods. Validate the selected genes with new tumor samples. ……
K=3 gave us the best prediction power Number of Cluster (K) 1 2 3 4 5 Tibshirani Prediction Strength 1.000 0.766 0.881 0.501 0.510
Statistical problems in response-based classification Identification of new or unknown classes--unsupervised learning Classification into known classes— supervised learning Identification of “best” predictor variables—variable selection, e.g. marker genes in microarray data (gene voting, hierarchical clustering)