Presentation is loading. Please wait.

Presentation is loading. Please wait.

LOGO Integrative Gene Set Analysis and Visualization in Genome-Wide Gene Expression Profiling Chen-An Tsai Department of Agronomy, Biometry Division National.

Similar presentations


Presentation on theme: "LOGO Integrative Gene Set Analysis and Visualization in Genome-Wide Gene Expression Profiling Chen-An Tsai Department of Agronomy, Biometry Division National."— Presentation transcript:

1 LOGO Integrative Gene Set Analysis and Visualization in Genome-Wide Gene Expression Profiling Chen-An Tsai Department of Agronomy, Biometry Division National Taiwan University, Taipei, Taiwan Department of Statistics, NCKU, November 8, 2012

2 The Road to the $1,000 Genome  The Revolution in DNA Sequencing and the New Era of Personalized Medicine.  Next-generation sequencing (NGS) is able to speed up the identification and tracking of genetic variation for unveiling the connection between genes and their associated traits, such as Roche/454, Solexa/Illumina and AB SOLiD.  The $1,000 genome, the $100,000 analysis? Elaine R. Mardis, Genome Medicine 2010, 2:84.  Combining with the whole genome sequences, gene function analysis, and comparative analysis of different genome sequences to dissect the molecular genetic mechanism of important complex traits.

3 Outline  Background  The goal of significance analysis for genes in functional groups  GSA statistics  One-sided Test  Two-sided Test  AUC-based Statistics  Examples: Simulation and data analysis  Discussion and Summary * This work is collaborated with Drs. James J. Chen, Zhanfeng Wang and Yuan-chin Chang

4 Gene Set Analysis (GSA)  Gene Set Analysis (GSA): A statistical analysis to determine whether some functionally predefined sets of genes express differently (enrichment and deletion) under different experimental conditions. (GSEA, Mootha et al.,2003)  Gene set (gene class): A group of biologically related genes: metabolic pathway, protein complex, or GO (gene ontology) term. Incorporate previously accumulated biological information into gene expression data analysis. Investigate underlying mechanisms of a disease on a set of genes. (GSA considers the entire gene set instead of individual genes.)

5 Outline of Gene Set Analysis Figure courtesy of Tian et al. (2005) PNAS

6 Framework of Hypothesis Testing  Given a set of genes (G) with some biological function, test whether there is a coordinated association with a phenotype of interest.  Two testing hypotheses: H 0 (Q 1 ): (competitive null hypothesis) The genes in the gene set G are at most as often differentially expressed as the genes in G c. H 0 (Q 2 ): (self-contained null hypothesis) No genes in G are differentially expressed.

7 Approaches of Gene-Class Testing  Fisher’s Exact test approach: define a fixed cutoff in the gene list and count members of group G below and above this cutoff (Draghici et al., 2003, Genomics), such as GoMiner etc.  Kolmogorov-Smirov running sum statistic: Mootha et al., 2003, Nature Genetic; Subramanian et al., 2005, PNAS.  T- or Wilcoxon rank test weighted sum statistic: Tian et al., 2005, PNAS.  Global Testing statistic: Geoman et al., 2004, Bioinformatics.  ANCOVA test: Mansmann and Meister, 2005, Method. Inform. in Med.  MaxMean approach: Efron and Tibshirani, 2007, The Annals of Applied Statistics.

8 GSEA (KS test)  Enrichment score (ES) is to test if the members of a given gene set are enriched among the most D.E. genes between two phenotypes.  Degree of enrichment of the gene set in the ordered gene list.

9 Key Elements of GSEA  Genes are ranked with respect to a measure that reflects expression difference metric between the phenotypes, i.e. SN ratio, Wilcoxon, and t-statistic.  Calculation of an Enrichment score:   Estimation of significance level for each gene set: using empirical phenotype-based permutation procedure to calculate p-values in order to preserve the correlation structure among genes.  Adjustment for multiple hypothesis testing.

10 Global Testing Approach  Take into account the correlation structure of members of the gene set S.  The Score test for logistic regression model by Goeman et al. (2004).

11  One-sided test: the changes of gene expressions in the gene class are in one direction, either up- or down- regulated (to detect coordinated changes).  The gene sets are closely related and, hence, will have similar expression patterns, either up or down.  Two-sided test: the changes of gene expressions in the gene class are mixtures of up- and down-regulation.  In an exploratory context, the primary interests are to identify all potential differentially expressed gene sets and how individual genes in a gene set respond in different phenotype One-sided and Two-sided Tests

12 One-Sided Global Testing Statistics  OLS statistic (O’Brien’s OLS) for each gene set  LWS statistic (Läuter’s LWS): sum of individual t-statistic on the standardized observations

13 MANOVA Test MANOVA test. y ij = m i + e ij, m i is the mean vector for the i-th condition. The Wilks’  test is: Wilks’  =  1/(1+ k ) where k s are the eigenvalues of (W -  B), and W and B are within and between sum of squares matrix. The Wilks’  test is the Hotelling’s T 2 for two conditions: The matrix S is singular and ill-condition when m > n 1 +n 2.

14  Eigenvalues of the covariance matrix (black) and shrinkage covariance matrix estimator (green), calculated from simulated data for p = 100 and n = 10 and 1000. The true eigenvalues are in black dashed line.

15 Shrinkage Covariance-matrix Estimator  An improved covariance matrix estimate via shrinkage according to the Ledoit-Wolf theorem (Ledoit and Wolf, 2003)  The shrinkage covariance matrix estimator (S ij *): where s ii and r ij are sample variance and sample correlation, and the shrinkage intensity is

16 Schematic for Testing Procedure X y IgIg Observed data matrix Gene Set S Permute y Permute I g Test Hypothesis Q1 Compute Global statistic and empirical p-values Test Hypothesis Q2

17 Simulation Study  Compare the proposed approach with some competitive methods which are widely used for GSA in terms of type I error rate and power for identifying differential gene sets.  Simulate 1000 gene sets with 100 genes in each gene set from Multivariate Normal distribution MNV ( ,  ). In the alternative model, the first 20 genes had a mean difference of 2γ and the next 20 genes had the mean difference of −2γ.  The training sample sizes are n 1 =n 2 =10 and n 1 =n 2 =25 for diseased and normal groups.  All simulations are repeated 1000 times.

18 Results of Simulation Study Method n 1 = n 2 = 10  Hotelling’s T 2 0.0500.0390.0380.050 PCA0.0530.0420.0520.062 SAM-GS0.0460.0420.0380.055 ANCOVA0.0420.0380.0340.052 Global0.0010.0090.0160.034 GSEA0.0590.0580.0520.048 MaxMean0.0930.0940.1070.098 Type I error of seven GSA methods: Hotelling’s T2, PCA, SAM-GS, ANCOVA, Global. GSEA, and MaxMean tests. Design: 100 genes were simulated from MNV( ,  ):  were generated from U[0,10] and   were generated U[0.1,10].

19 Results of Simulation Study (Cont’d) Type I error of seven GSA methods: Hotelling’s T2, PCA, SAM-GS, ANCOVA, Global. GSEA, and MaxMean tests. Method n 1 = n 2 = 25  Hotelling’s T 2 0.0460.0440.0390.049 PCA0.0520.0610.0380.045 SAM-GS0.0410.0620.052 ANCOVA0.0500.0520.0350.037 Global0.0050.0240.0020.037 GSEA0.0540.0550.0530.048 MaxMean0.1030.1050.1150100

20 Results of Simulation Study (Cont’d) Power Analysis

21 Real Examples  Three microarray studies, Gender, Leukemia, and P53 datasets, are publicly available at the GSEA website (http://www.broad.mit.edu/gsea), each study consists of two catalogs of gene sets, chromosomes and cytogenetic catalog (C1) and functional catalog (C2).http://www.broad.mit.edu/gsea  The gender dataset includes 15,056 mRNA expression profiles from 15 male and 17 female samples of lymphoblastoid cell lines.  The leukemia dataset is used to study lymphoid leukemia (ALL) and acute myeloid leukemia (AML) by comparing 10,056 expression profiles derived from 24 ALL patients and 24 AML patients.  The p53 dataset is a study to identify targets of the transcription factor p53 from 10,100 gene expression profiles in the NCI-60 collection of cancer cell lines with 17 normal and 33 mutation samples..

22 Applications Methods p-value ≤ Datasets Gender (C1) Gender (C2) Leukemia (C1) P53 (C2) Hotelling’s T 2 0.015518240 0.057918278 PCA 0.013118047 0.053118178 SAM-GS 0.014218230 0.054518249 ANCOVA 0.01571717 0.05172617616 Global 0.01571718 0.05222817518 GESA 0.014359 0.0568826 MaxMean 0.015439 0.0517271533 212 308 182 308

23 GSA analysis of a breast cancer dataset* for 9 pathways from MANOVA and ANCOVA (Mansmann and Meister, 2005) Gene SetSize p-values Wilks’  ANCOVA Androgen receptor signaling720.00050.0000 Apoptosis1870.00150.0005 Cell cycle control310.0000 Notch delta signaling340.00250.0545 p53 signaling330.00000.0370 Ras signaling2660.00550.0000 Tgf beta signaling820.05350.1165 Tight junction signaling3260.0000 Wnt signaling1760.00000.0005 * Three tumor grades: 1, 2, and 3 with the sample sizes 11, 25, and 60. Real Example for Three-Group Comparison

24  Gene Set Enrichment Plot There are 84 genes. T ols = 0.004 T 2, ANCOVA, and SAM-GS > 0.05 Leukemia Dataset: gene set #38

25  Gene Set Enrichment Plot There are 84 genes. T ols = 0.072 T 2, ANCOVA, and SAM-GS < 0.001 Leukemia Dataset: gene set #60

26  Gene Set Enrichment Plot There are 17 genes T ols, T 2, SAM-GS> 0.05 * 17 normal and 33mutation samples; 10100 genes involved 308 gene sets. P53 Dataset: gene set #264

27 Gene Set Biplot

28 Framework of Biological Classification 1.Construct a classifier with continuous decision output based on genes within each gene set. 2.Construct a measure to evaluate the discrimination ability of gene sets. 3.Identify gene sets with high discrimination power. 4.Provide an importance measure to assess the impact of each individual genes within gene sets.

29 AUC for Evaluation of Classification  Provide a useful summarization of the performance of classifier in terms of sensitivity and specificity.  The empirical AUC (Wilcoxon Mann-Whitney U- statistic) is a consistent estimate where  Identify differential gene sets using the linear combination of genes with maximum AUC.

30 Linear Combination of Genes  Construct a linear classifier by using the linear combination of n k genes for each gene set (k).  Find a linear combination via maximizing the empirical AUC  A sigmoid smoothing function is used to approximate the step function, then  An “optimal” linear combination coefficient for each gene set (k) is

31 Linear Combination of Genes (cont’d)  The parsimonious threshold-independent feature selection (PTIFS) is used to find the “optimal”. (Wang et al., 2007)  PTIFS selects features through an iterative inclusion- exclusion updating algorithm, starting from an anchor gene.  Distinguish the impacts of individual genes within gene set via the corresponding linear combination coefficients.  Both gene sets and individual genes can be evaluated simultaneously.

32 Assessment of Gene Set Significance  Define a statistic, for each gene set k, where the s.d. of is estimated by  The significance of gene set can be determined by the p- value,  A permutation-based approach is used to calculate the empirical p-value for each gene set.  Top-ranking gene sets can be identified according to p k.

33 Simulation Study  Compare the proposed approach with pathwayRF, ls-SVM, ANCOVA, and Global Test in terms of type I error and power for identifying differential gene sets.  Simulate 20 gene sets with 100 genes in each gene set from Multivariate Normal distribution MNV ( ,  ). In the alternative model, only the first 10 gene sets have discriminant ability and only the first 100γ genes in these gene sets are differentially expressed with effect size δ.  The training sample sizes are m=n=50 and testing sample sizes are m=n=20 for diseased and normal groups.  All simulations are repeated 200 times.

34 Results of Simulation Study Type I error of three GSA methods on testing the significance of gene sets Design: 100 genes were simulated from MNV( ,  ):  were generated from U[0,10] and  ii were generated U[0.1,10].

35 Simulation Results Average fitting and predicting error rates of top-10 gene sets selected by AUC cv and pathwayRF

36 Simulation Results Within-gene-set coefficients obtained by our proposed approach ρ=0

37 Simulation Results Within-gene-set coefficients obtained by our proposed approach ρ=0.5

38 Real Examples 212318308 AUC method (AUC, err cv ) pathwayRF (err OOB, err cv )

39 Gene Set Plot for Gender(C2) Dataset

40 Real Examples Average five-fold cross-validation error rates of the top-ten gene sets identified by AUC cv and pathwayRF.

41 Annotation-Based Classification Classification results of the top 20 gene sets identified by AUC cv and pathwayRF in the P53 dataset.

42 Discussion and Summary -1  Genes in a gene set (or Gene sets) are functionally related and are not independent.  Complex structure of gene interactions in a gene set are not fully captured using uni-variate approaches.  The MANOVA test accounts for correlation structure among genes and the shrinkage covariance matrix estimator account for the high-dimensional issue.  Advantages of our AUC-based method:  Assess the discrimination ability of gene sets.  Quantify the impact of individual genes within gene set.  Construct a ensemble classification of linear combination of “gene sets” by applying PTIFS algorithm.

43 Discussion and Summary -2  Multiple comparison adjustment for multiple tests of gene sets is not discussed in this study.  Gene Set Correlation Analysis and visualization methods:  Analysis of relationship among gene sets and common genes among different gene sets.  Visualization of inter- and intra-information of identified gene sets.

44 Gene-set Network Model Graphical Gaussian models (GGMs) are used to understand regulatory interactions among gene sets

45 Thank You 謝謝 !


Download ppt "LOGO Integrative Gene Set Analysis and Visualization in Genome-Wide Gene Expression Profiling Chen-An Tsai Department of Agronomy, Biometry Division National."

Similar presentations


Ads by Google