Carlo Colantuoni – ccolantu@jhsph.edu Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017.

Carlo Colantuoni – ccolantu@jhsph.edu
Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni –

Class Outline Basic Biology & Gene Expression Analysis Technology
Data Preprocessing, Normalization, & QC Measures of Differential Expression Multiple Comparison Problem Clustering and Classification The R Statistical Language and Bioconductor GRADES – independent project with Affymetrix data.

Class Outline - Detailed
Basic Biology & Gene Expression Analysis Technology The Biology of Our Genome & Transcriptome Genome and Transcriptome Structure & Databases Gene Expression & Microarray Technology Data Preprocessing, Normalization, & QC Intensity Comparison & Ratio vs. Intensity Plots (log transformation) Background correction (PM-MM, RMA, GCRMA) Global Mean Normalization Loess Normalization Quantile Normalization (RMA & GCRMA) Quality Control: Batches, plates, pins, hybs, washes, and other artifacts Quality Control: PCA and MDS for dimension reduction Measures of Differential Expression Basic Statistical Concepts T-tests and Associated Problems Significance analysis in microarrays (SAM) [ & Empirical Bayes] Complex ANOVA’s (limma package in R) Multiple Comparison Problem Bonferroni False Discovery Rate Analysis (FDR) Differential Expression of Functional Gene Groups Functional Annotation of the Genome Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum Gene Set Enrichment Analysis (GSEA) Parametric Analysis of Gene Set Enrichment (PAGE) geneSetTest Notes on Experimental Design Clustering and Classification Hierarchical clustering K-means Classification LDA (PAM), kNN, Random Forests Cross-Validation Additional Topics The R Statistical Language Bioconductor Affymetrix data processing example!

DAY #3: Measures of Differential Expression:
Review of basic statistical concepts T-tests and associated problems Significance analysis in microarrays (SAM) (Empirical Bayes) Complex ANOVA’s (“limma” package in R) Multiple Comparison Problem: Bonferroni FDR Differential Expression of Functional Gene Groups Notes on Experimental Design

Slides from Rob Scharpf

Fold-Change? T-Statistics?
Some genes are more variable than others

distribution of distribution of Slides from Rob Scharpf

X1-X2 is normally distributed if X1 and X2 are normally distributed – is this the case in microarray data? Slides from Rob Scharpf

Problem 1: T-statistic not t-distributed
Problem 1: T-statistic not t-distributed. Implication: p-values/inference incorrect Explain QQ-plot. Horizontal axis are theoretical quantiles of normal(0,1) distribution. Vertical axis are the t-statistics from the gene expression experiment. The t-statistics for smaller quantiles of the observed distribution are above the identity line -- they are bigger than what we’d expect to see if the t-statistics were N(0,1) distributed. Similarly for the biggest quantiles.Because the observed distribution of t-statistics is not t-distributed, p-values and inferences are not correct.

P-values by permutation
It is common that the assumptions used to derive the statistics are not approximate enough to yield useful p-values (e.g. when T-statistics are not T distributed.) An alternative is to use permutations.

p-values by permutations
We focus on one gene only. For the bth iteration, b = 1,  , B; Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”. For each gene, calculate the corresponding two sample t-statistic, tb. After all the B permutations are done: p = # { b: |tb| ≥ |tobserved| } / B This does not yet address the issue of multiple tests!

Another problem with t-tests
The volcano plot shows, for a particular test, negative log p-value against the effect size (M).

Remember this? The variability in expression measures depends on the gene.

Problem 2: t-statistic bigger for genes with smaller standard error estimates. Implication: Ranking might not be optimal Highthroughput is nice because we get to see problems. On the left are log2 intensity ratios plotted against sigmahat. Note that the standard errors tend to be larger for bigger |M|. Small standard errors in the denominator inflate the t-statistic for genes with small M. Does the t-test depend on sigma? Volcano plot. Put the red, blue dot plot for 3 genes here.

Problem 2 With low N’s SD estimates are unstable Solutions:
Significance Analysis in Microarrays (SAM) Empirical Bayes methods and Stein estimators We wont talk about Stein. Suppose you had 1 sample and wanted to estimate mu_1, … mu_G. X_1 … X_G are the ml estimators. Stein showed that the squared euclidean distance was larger than need be. Could calculate (1-w)*x + w*xbar, where the weight w has unit scale proportional to how far x is from the mean. Values further from the mean are shrunken to the sample mean.

Significance analysis in microarrays (SAM)
A clever adaptation of the t-ratio to borrow information across genes Implemented in Bioconductor in the siggenes package Significance analysis of microarrays applied to the ionizing radiation response, Tusher et al., PNAS 2002

SAM d-statistic For gene i : mean of sample 1 mean of sample 2
Standard deviation of repeated measurements for gene i Exchangeability factor estimated using all genes

Minimize the average CV across all genes.

Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurements A fix for this problem: Relative difference for a permutation of the data that was balanced between cell lines 1 and 2. Random fluctuations in the data, measured by balanced permutations (for cell line 1 and 2) A) Relative difference in irradiated and unirradiated states. B) relative difference between cell lines 1 and 2. C) relative difference between hybridizations A and B. D) Relative difference for a permutation of the data that was balanced between cel lines. Note: variance of d_I appears independent of gene expression.

SAM produces a modified T-statistic (d), and has an approach to the multiple comparison problem.

Selected genes: Beyond expected distribution

eBayes: Borrowing Strength
An advantage of having tens of thousands of genes is that we can try to learn about typical standard deviations by looking at all genes Empirical Bayes gives us a formal way of doing this “Shrinkage” of variance estimates toward a “prior”: moderated t-statistics – eliminates extreme stats due to small variances. Implemented in the limma package in R. In addition, limma provides methods for more complex experimental designs beyond simple, two-sample designs.

The Multiple Comparison Problem
Also called the multiple testing problem (some slides courtesy of John Storey)

Hypothesis Testing Test for each gene:
Null Hypothesis: no differential expression. Two types of errors can be committed Type I error or false positive (say that a gene is differentially expressed when it is not, i.e., reject a true null hypothesis). Type II error or false negative (fail to identify a truly differentially expressed gene, i.e.,fail to reject a false null hypothesis)

Hypothesis testing Once you have a given score for each gene, how do you decide on a cut-off? p-values are most common. How do we decide on a cut-off when we are looking at many 1000’s of “tests”? Are 0.05 and 0.01 appropriate? How many false positives would we get if we applied these cut-offs to long lists of genes?

Multiple Comparison Problem
Even if we have good approximations of our p-values, we still face the multiple comparison problem. When performing many independent tests, p-values no longer have the same interpretation.

Bonferroni Procedure a = 0.05 # Tests = a = 0.05 / 1000 = or p = p * 1000 The bonferonni procedure says that we reject all hypotheses. Think of T(p;alpha) as a threshold. Any genes with p-values below this threshold we call statistically significant.

Bonferroni Procedure Too conservative. How else can we interpret many 1000’s of observed statistics? Instead of evaluating each statistic individually, can we assess a list of statistics: FDR (Benjamini & Hochberg 1995) The bonferonni procedure says that we reject all hypotheses. Think of T(p;alpha) as a threshold. Any genes with p-values below this threshold we call statistically significant.

Null = Equivalent Expression; Alternative = Differential Expression
FDR Given a cut-off statistic, FDR gives us an estimate of the proportion of hits in our list of differentially expressed genes that are false. Null = Equivalent Expression; Alternative = Differential Expression

False Discovery Rate The “false discovery rate” measures the proportion of false positives among all genes called significant: This is usually appropriate because one wants to find as many truly differentially expressed genes as possible with relatively few false positives The false discovery rate gives an estimate of the rate at which further biological verification will result in dead-ends

Distribution of Statistics
Permuted Observed Statistic

Distribution of Statistics
False Pos. Total Pos. FDR = Permuted = Observed Permuted Observed Statistic

Distribution of p-values
Observed Permuted p-value

FDR = False Positives/Total Positive Calls This FDR analysis requires enough samples in each condition to estimate a statistic for each gene: observed statistic distribution. And enough samples in each condition to permute many times and recalculate this statistic: null statistic distribution. What if we don’t have this?

FDR = 0.05 Beyond ±0.9

False Positive Rate versus False Discovery Rate
False positive rate is the rate at which truly null genes are called significant False discovery rate is the rate at which significant genes are truly null

False Positive Rate and P-values
The p-value is a measure of significance in terms of the false positive rate (aka Type I error rate) P-value is defined to be the minimum false positive rate at which the statistic can be called significant Can be described as the probability a truly null statistic is “as or more extreme” than the observed one So if your happy with a FPR of 0.05, corresponding to 0.05 * = 500 false positives in a typical microarray, then you’

False Discovery Rate and Q-values
The q-value is a measure of significance in terms of the false discovery rate Q-value is defined to be the minimum false discovery rate at which the statistic can be called significant Can be described as the probability a statistic “as or more extreme” is truly null

Power and Sample Size Calculations are Hard
Need to specify: a (Type I error rate, false positives) or FDR s (stdev: will be sample- and gene-specific) Effect size (how do we estimate?) Power (1-b, b=Type II error rate) Sample Size Some papers: Mueller, Parmigiani et al. JASA (2004) Rich Simon’s group Biostatistics (2005) Tibshirani. A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics Mar 2;7:106.

Beyond Individual Genes: Functional Gene Groups
Borrow statistical power across entire dataset Integrate preexisting biological knowledge Beyond threshold enrichment

Functional Annotation of Lists of Genes
KEGG PFAM SWISS-PROT GO DRAGON DAVID/EASE MatchMiner BioConductor (R)

Gene Cross-Referencing and Gene Annotation Tools In BioConductor
(in the R statistical language) annotate package Microarray-specific “metadata” packages DB-specific “metadata” packages AnnBuilder package

Annotation Tools In BioConductor:
annotate package Functions for accessing data in metadata packages. Functions for accessing NCBI databases. Functions for assembling HTML tables.

Annotation for Commercial Microarrays Array-specific metadata packages

Functional Annotation with other DB’s GO metadata package

Functional Annotation with other DB’s KEGG metadata package

Is their enrichment in our list of differentially expressed genes for a particular functional gene group or pathway? Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups

Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups

Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups
The argument lower.tail will indicate if you are looking for over- or under- representation of differentially expressed genes within a particular functional group (using lower.tail=F for over-representation).

Can we use more of our data than Threshold Enrichment (that only uses the top of our gene list)?

Functional Gene Subgroups within An Experiment
Swiss-Prot EXP#1 PFAM KEGG

Statistics for Analysis of Differential Expression of Gene Subgroups
Is THIS … … Different from THIS?

Over-Expression of a Group of Functionally Related Genes
p<7.42e-08 T statistic

Statistical Tests: Is THIS … … Different from THIS? c2
Conceptually Distinct from Threshold Enrichment and the Hypergeometric test! Statistical Tests: c2 Kolmogorov-Smirnov Product of Probabilities GSEA PAGE geneSetTest (Wilcoxon rank sum)

c2 All Genes c2 is the sum of D values where: (O-E)2 ______ D = E
Subset of Interest E histogram bins O

Kolmogorov-Smirnov All Genes Subset of Interest

Product of Individual Probabilities
All Genes Subset of Interest

Statistics from gene subgroup
What shape/type of distributions would each of these tests be sensitive to? All statistics Statistics from gene subgroup

Gene Set Enrichment Analysis (GSEA)
Subramanian et al, 2005 PNAS

Gene Set Enrichment Analysis (GSEA)

Parametric Analysis of Gene Set Enrichment (PAGE)
Kim et al, 2005 BMC Bioinformatics

Parametric Analysis of Gene Set Enrichment (PAGE)

Sm-m Z = s/m0.5

A simple method in Bioconductor
geneSetTest(limma) Test whether a set of genes is enriched for differential expression. Usage: geneSetTest(selected,statistics,alternative="mixed",type="auto",ranks.only=TRUE,nsim=10000) The test statistic used for the gene-set-test is the mean of the statistics in the set. If ranks.only is TRUE the only the ranks of the statistics are used. In this case the p-value is obtained from a Wilcoxon test. If ranks.only is FALSE, then the p-value is obtained by simulation using nsim random selected sets of genes. Arguement: alternative = “mixed” or “either” : fundamentally different questions.

Wilcoxon test

Analysis of Gene Networks

Large Protein Interaction Network
Network Regulated in Sample #1

Network Regulated in Sample #2 Network Regulated in Sample #1

Network Regulated in Sample #3 Network Regulated in Sample #1 Network Regulated in Sample #2

of Interest Network Regulated in Sample #1 Network Regulated in Sample #2 Network Regulated in Sample #3

Additional Notes on Experimental Design

Old-School Experimental Design: Randomization

Biological Replicates
Replicates in a mouse model: Dissection of tissue Biological Replicates RNA Isolation Amplification Technical Replicates Probe labelling Hybridization

Common question in experimental design
Should I pool mRNA samples across subjects in an effort to reduce the effect of biological variability (or cost)?

Two simple designs The following two designs have roughly the same cost: 3 individuals, 3 arrays Pool of three individuals, 3 technical replicates To a statistician the second design seems obviously worse. But, I found it hard to convince many biologist of this. 3 pools of 3 animals on individual arrays?

Cons of Pooling Everything
You can not measure within class variation Therefore, no population inference possible Mathematical averaging is an alternative way of reducing variance. Pooling may have non-linear effects You can not take the log before you average: E[log(X+Y)] ≠ E[log(X)] + E[log(Y)] You can not detect outliers *If the measurements are independent and identically distributed

Cons specific to microarrays
Different genes have dramatically different biological variances. Not measuring this variance will result in genes with larger biological variance having a better chance of being considered more important

Higher variance: larger fold change
We compute fold change for each gene (Y axis) From 12 individuals we estimate gene specific variance (X axis) If we pool we never see this variance.

Remember this? The variability in expression measures depends on the gene.

“Statistical analysis of gene expression microarray data” – Speed.
Useful Books: “Statistical analysis of gene expression microarray data” – Speed. “Analysis of gene expression data” – Parmigianni “Bioinformatics and computational biology solutions using R” - Irizarry

Carlo Colantuoni – ccolantu@jhsph.edu Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017.

Similar presentations

Presentation on theme: "Carlo Colantuoni – ccolantu@jhsph.edu Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carlo Colantuoni – ccolantu@jhsph.edu Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017.

Similar presentations

Presentation on theme: "Carlo Colantuoni – ccolantu@jhsph.edu Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017."— Presentation transcript:

Similar presentations

About project

Feedback