Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
CAVEAT 1 MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED. MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH. MICROARRAY EXPERIMENTS CANNOT.
Introduction to Microarry Data Analysis - II BMI 730
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
Analysis of Variance. Experimental Design u Investigator controls one or more independent variables –Called treatment variables or factors –Contain two.
Differentially expressed genes
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Chapter 11 Multiple Regression.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Chapter 9 Hypothesis Testing.
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Pathway Analysis. Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Multiple testing in high- throughput biology Petter Mostad.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Ch 10 Comparing Two Proportions Target Goal: I can determine the significance of a two sample proportion. 10.1b h.w: pg 623: 15, 17, 21, 23.
1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004.
Gene Set Enrichment Analysis (GSEA)
Essential Statistics in Biology: Getting the Numbers Right
1 G Lect 6b G Lecture 6b Generalizing from tests of quantitative variables to tests of categorical variables Testing a hypothesis about a.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Networks and Interactions Boo Virk v1.0.
GSEA Overview -- Workflow GSEA is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Integrating Biology and Statistics: Gene Set Methods BIOS Winter/Spring 2010.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Statistics for Differential Expression Naomi Altman Oct. 06.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Statistical Testing with Genes Saurabh Sinha CS 466.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.3 Two-Way ANOVA.
Logic and Vocabulary of Hypothesis Tests Chapter 13.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Flat clustering approaches
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
The Broad Institute of MIT and Harvard Differential Analysis.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Review of statistical modeling and probability theory Alan Moses ML4bio.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Review Statistical inference and test of significance.
Canadian Bioinformatics Workshops
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Canadian Bioinformatics Workshops
::: Schedule. Biological (Functional) Databases
Statistical Testing with Genes
Genesets and Enrichment
Statistical Testing with Genes
Presentation transcript:

Gene Set Analysis 09/24/07

From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we have identified 500 genes that are differentially expressed, then what do we do about it? Can we learn something about the underlying biological pathway?

Sometimes one cannot find a single gene that is differentially expressed, as the statistical criteria are too stringent and/or the data is too noisy. Can we still learn something useful from the microarray experiment?

(Mootha 2003)

Gene set A gene set contains genes that are functionally related. The gene set assignment is independent of the microarray data at hand. We want to know whether a gene set is differentially expressed. Functional annotation is usually obtained from the following sources. –Kyoto Encyclopedia of Genes and Genomes (KEGG): –Gene Ontology (GO):

KEGG KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for: –1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Human Diseases1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Human Diseases and also on the structure relationships (KEGG drug structure maps) in: –6. Drug Development6. Drug Development –Website:

GO terms Ontologies are 'specifications of a relational vocabulary'. GO contains three structured vocabularies: cellular component, biological process and molecular function. GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context. Website:

Khatri and Draghichi 2005

Li

Null hypothesis: The genes in S are at most as often differentially expressed as the genes in S c. Over-representative analysis Differentially expressed Not differentially expressed in S In S C O 1 = a c + d a + b n b + d a + c Total Compare a/(a + b) with (a + c)/n. O 2 = b O 3 = cO 4 = d

Statistical significance Chi-square test Fisher’s exact test hypergeometric distribution

Testing multiple GO nodes simultaneously Determine significance level for each node The adjust for multiple hypothesis testing: FWER; FDR; etc. (GOSurfer)

Problems with using differentially expressed genes Result is sensitive to the criteria for differentially expressed genes. Useless if the criteria is too stringent. Reducing a continuous variable to binary variable loses useful quantitative information.

ErmineJ Called FCS in Pavlidis et al The mean of –log(p-value) for all genes in a gene sets is used as a aggregate score. Use permutation test (with gene) to obtain the p-value corresponding to the aggregate score. Correction for multiple occurrence of a single gene. Adjust for multiple-hypothesis testing by controlling FDR.

ErmineJ

Permute genes Randomize genes or arrays?

Li Permute array labels

Interpretation of p-values In the gene-sampling setup (e.g., Chi-square test), inference is about a new sample of genes. Expression of genes are assumed to be independent. In the subject-sampling setup (e.g., permutation test), inference is about a new subject. Label of a subject (treatment or control) is assumed to be independent. Expressions of different genes may be correlated. It is more biologically meaningful to use subject-sampling methods.

Gene Set Enrichment Analysis (GSEA) Consider all genes instead of differentially expressed genes. Permute class labels Steps: –1: Calculation of an enrichment score (ES). –2: Estimation of significance level of ES. –3: Adjustment for multiple hypothesis testing. (Mootha 2003)

A B Basic idea: Rank the genes according to their p-value for being differentially expressed. If there is no correlation between gene expression and membership in A or B, then the rank- distributions for the two sets should also be approximately equal.

Enrichment Score Rank the genes by their p-values corresponding to the significance level of differential expression: R 1, …, R N. Define if R i is not in S, and if R i is in S. Then that is, the maximum deviation from the expected running sum.

Why Unbiased Normalized

Permutation test of the significance of ES Randomly assign labels to samples, reorder genes, and recompute ES(S). Estimate the p-values by comparing the observed ES(S) with computed from randomly shuffled data.

Multiple hypothesis testing Determine ES(S) for each gene set in the collection. For each S and 1000 fixed permutations  of the array labels, reorder the genes and determine ES(S,  ). Adjust for variation in gene set size. Compute FDR.

Applications of GSEA Data –22,000 genes –43 subjects: 17 normal (NGT), 8 partially impaired, 18 diagnosed with disease (DM2) –Gene sets independently curated from literature No single gene is differentially expressed according the stringent multiple hypothesis testing criteria.

Results from GSEA Select the gene set with maximum ES: (OXPHOS) Genes are consistently down- regulated, although the fold changes are moderate. Selected gene sets are biologically sensible --- consistent with expection.

Starting point for further analysis Apply clustering analysis to the selected gene set. Many genes in the gene set are corregulated, suggesting they share similar functions.

A self-contained null hypothesis Null hypothesis: –Competitive version: The genes in G are at most as often differentially expressed as the genes in G c. –Self-contained version: No genes in G are differentially expressed. “Self-contained” is more strict than “competitive”.

Drawback for comparing S against S C This is compared to a “zero-sum-game”. Gene classes are competing with each other. The stronger the evidence in support of differential expression is for one class, the weaker the evidence for differential expression is judged to be for a second class.

Not significant? Drawback for comparing S against S C

H comp vs H self Advantage: –Self-consistent –When there are a large number of genes are differentially expressed, multiple pathways may be selected. Drawback: –Too aggressive. A gene-class containing very few differentially expressed genes may not be biologically meaningful.

Hybrid methods Several aspects of different methods can be mixed, e.g. –Modify GSEA by using self-contained version to evaluate p-value. –Similar treatment to ErmineJ. (J.J.Goeman and P.Buhlmann 2006)

Multivariate analysis Let X 1 and X 2 be the expression levels for the subject groups 1 and 2. Given a gene set containing q genes. The self-contained null hypothesis can be rephrased as the multi-dimensional mean expression vectors (within the given gene set) are the same. Use multivariate hypothesis testing.

Holstelling’s T 2 Under the null hypothesis, T 2 follows the F-distribution Multiple hypothesis testing is addressed by FDR control.

Dimension reduction Diagonalize the variance matrix S and then project to principle components. where Dimensions corresponding to very small eigenvalues are ignored.

Results Figure 1 in Sek Kwon’s paper.