Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.

Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman

Outline Normalization Expression variation Modeling the log Fold change Complex designs Shrinkage and empirical Bayes (limma) Multiple testing (False Discovery Rate)

Measuring expression

Platforms Microarrays RNAseq Common:  Need for normalization  Batch effects

Why normalization Some experimental factors cannot be completely controlled  Amount of material  Amount of degradation  Print tip differences  Quality of hybridization Effects are systematic Cause variation between samples and between batches

What is normalization? Normalization = An attempt to get rid of unwanted systematic variation by statistical means Note 1: this will never completely succeed Note 2: this may do more harm than good Much better, but often impossible Better control of the experimental conditions

How do normalization methods work? General approach 1.Assume: data from an ideal experiment would have characteristic A E.g. mean expression is equal for each sample Note: this is an assumption! 2.If the data do not have characteristic A, change the data such that the data now do have characteristic A E.g. Multiply each sample’s expression by a factor

Example: quantile normalization Assume: “Most probes are not differentially expressed” “As many probes are up and downregulated” Reasonable consequence: The distribution of the expression values is identical for each sample Normalization: Make the distribution of expression values identical for each sample

Quantile normalization in practice Choose a target distribution Typically the average of the measured distributions All samples will get this distribution after normalization Quantile normalization:  Replace the ith largest expression value in each sample by the ith largest value in the target distribution Consequence:  Distribution of expressions the same between samples  Expressions for specific genes may differ

Less radical forms of normalization Make the means per sample the same Make the medians the same Make the variances the same Loess curve smoothing Same idea, but less change to the data

Overnormalizing Normalizing can remove or reduce true biological differences  Example: global increase in expression Normalization can create differences that are not there  Example: almost global increase in expression Usually: normalization reduces unwanted variation

Batch effects Differences between batches are even stronger than between samples in the same batch Note: batch effects at several stages Normalization is not sufficient to remove batch- effects Methods available (comBat) but not perfect Best: avoid batch effects if possible

Confounding by batch Take care of batch-effects in experimental design Problem: confounding of effect of interest by batch effects Example: Golub data Solution: balance or randomize

Expression variation

Differential expression Two experimental conditions  Treated versus untreated Two distinct phenotypes  Tumor versus normal tissue Which genes can reliably be called differentially expressed? Also: continuous phenotypes  Which gene expressions are correlated with phenotype?

Variation in gene expression Technical variation  Variation due to measurement technique  Variability of measured expression from experiment to experiment on the same subject Biological variation  Variation between subjects/samples  Variability of “true” expression between different subjects Total variation  Sum of technical and biological variation

Reliable assessment Two samples always have different expression Maybe even a high fold change Due to random biological and technical variation Reliable assessment of differential expression: Show: fold change found cannot be explained by random variation

Assessment of differential expression Two interrelated aspects: Fold change:  How large is the expression difference found? P-value:  How sure are we that a true difference exists?

LIMMA: Linear models for gene expression

Modeling variation How does gene expression depend on experimental conditions? Can often be well modeled with linear models Limma:  linear models for microarray analysis  Gordon Smyth, W. and E. Hall Institute, Australia

Multiplicative scale effects Assumption: effects on gene expression work in a multiplicative way (“fold change”) Example: treatment increases gene expression of gene MMP8 by a factor 2  “2-fold increase” Treatment decreases gene expression of gene MMP8 by a factor 2  “2-fold decrease”

Multiplicative scale errors Assumption: variation on gene expression works in a multiplicative way A 2-fold increase by chance is just as likely as a 2-fold decrease by chance When true expression is 4, measuring 8 is as likely as measuring 2

Working on the log scale When effects are multiplicative, log-transform! Usual in microarray analysis: log to base 2 Remember: log(ab) = log(a)+log(b)  2 fold increase = +1 to log expression  2 fold decrease = -1 to log expression Log scale makes multiplicative effects symmetric  ½ and 2 are not symmetric around 1 (= no change)  -1 and +1 are symmetric around 0 (= no change)

A simple linear model Example: treated and untreated samples Model separately for each gene Log Expression of gene 1: E1 E1 = a + b * Treatment + error a: intercept = average untreated logexpression b: slope = treatment effect

Modeling all genes simultaneously E1 = a1 + b1 * Treatment + error E2 = a2 + b2 * Treatment + error … E20,000 = a20,000 + b20,000 * Treatment + error Same model, but Separate intercept and slope for each gene And separate sd sigma1, sigma2, … of error

Estimates and standard errors Gene 1: Estimates for a1, b1 and sigma1 Estimate of treatment effect of gene 1 b1 is the estimated log fold change standard error s.e.(b1) depends on sigma1 Regular t-test for H0: b1=0: T = b1/s.e.(b1) Can be used to calculate p-values. Just like regular regression, only 20,000 times

Back to original scale Log scale regression coefficient b1 Average log fold change Back to a fold change: 2^b1  b1= 1 becomes fold change 2  b1 = -1 becomes fold change 1/2

Confounders Other effects may influence gene expression Example: batch effects Example: sex or age of patients In a linear model we can adjust for such confounders

Flexibility of the linear model Earlier: E1 = a1 + b1 * Treatment + error Generalize: E1 = a1 + b1 * X + c1 * Y + d1 + Z + error Add as many variables as you need.

Variance shrinkage

Empirical Bayes So far: each gene on its own 20,000 unrelated models Limma: exchange information between genes “Borrowing strength” By empirical Bayes arguments

Estimating variance For each gene a variance is estimated Small sample size: variance estimate is unreliable  Too small for some genes  Too large for others Variance estimated too small: false positives Variance estimated too large: low power

Large and small estimated variance Gene with low variance estimate  Likely to have low true variance  But also: likely to have underestimated variance Gene with high variance estimate  Likely to have high true variance  But also: likely to have overestimated variance Limma’s idea:  Use information from other genes to assess whether variance is over/underestimated

True and estimated variance

Variance model Limma has a gene variance model All gene’s variances are drawn at random from an inverse gamma distribution Based on this model:  Large variances are shrunk downwards  Small variances are shrunk upwards

Effect of variance shrinkage Genes with large fold change and large variance  More power  More likely to be significant Genes with small fold change and small variance  Less power  Less likely to be significant

Limma and sample size Shrinkage of limma only effective for small sample size (< 10 samples/group) Added information of other genes becomes negligeable if sample size gets large Large samples: Doing limma is the same as doing regression per gene

Differential expression in RNAseq

RNAseq data: counts Gene idY1Y2Y3Y4Y5Y6Y7Y8Y9Y10 ENSG0000 0110514691781015810131165108701 ENSG0000 0086015115528688146845985860 ENSG0000 01158082851904672953455323694734235 ENSG0000 01697405021843631954032622253321363 ENSG0000 02158690700000200 ENSG0000 026160920317620251582318231 ENSG0000 01697444885294705051137373139235171921 ENSG0000 02158641000000000

Modelling count data Distinguish three types of variation  Biological variation  Technical variation  Count variation Count variation is important for low- expressed genes Generally biological variation most important

Overdispersion Modelling count data: two stages 1.Model how gene expression varies from sample to sample 2.Model how the observed count varies by repeated sequencing of the same sample Stage 2 is specific for RNAseq

Two approaches Approach 1: Model the count variation and the between-sample variation  edgeR  Deseq Approach 2: Normalize the count data and model only the biological variation  Voom + limma Approach 3: Model count variation only  Popular but very wrong!

Multiple testing

20,000 p-values Fitting 20,000 linear models Some variance shrinkage Result:  20,000 fold changes  20,000 p-values Which ones are truly differentially expressed?

Multiple testing Doing 20,000 tests: risk false positive 20,000 times If 5% of null hypotheses is significant, expect 1,000 significant by pure chance How to make sure you can really trust the results?

Bonferroni Classical way of doing multiple testing Call K the number of tests performed Bonferroni: significant = p-value < 0.05/K “Adjusted p-value”  Multiply all p-values by K, compare with 0.05

Advantages of Bonferroni Familywise error control  =Probability of making any type I error < 0.05 With 95% chance, list of differentially expressed genes has no errors Very strict Easy to do

Disadvantages of Bonferroni Very strict  “No” false positives  Many false negatives It is not a big problem to have a few false positives Do validation experiments later

False discovery rate (Benjamini and Hochberg) FDR = expected proportion of false discoveries among all discoveries Control of FDR at 0.05 means in the long run experiments average about 5% type I errors among the reported genes Percentage: longer lists of genes are allowed to have more errors

Benjamini and Hochberg by hand 1. Order the p-values small to large Example: 0.0031, 0.0034, 0.02, 0.10, 0.65 2. Multiply the k-th p-value by m/k, where m is the number of p-values, so 0.0031 * 5/1, 0.0034 * 5/2, 0.02 * 5/3, 0.10 * 5/4, 0.65 * 5/5 which becomes 0.0155, 0.0085, 0.033, 0.125, 0.65 3. If the p-values are no longer in increasing order, replace each p-value by the smallest p-value that is later in the list. In the example, we replace 0.0155 by 0.0085. The final Benjamini-Hochberg adjusted p-values become 0.0085, 0.0085, 0.033, 0.125, 0.65

FDR warnings FDR is susceptible to cheating How to cheat with FDR? Add many tests of known false null hypotheses… Result: reject more of the other null hypotheses

Example limma results

Conclusion

Testing for differentially expressed genes Repeated application of a linear model Include all factors in the model that may influence gene expression Limma: additional step “borrowing strength” Don’t forget to correct for multiple testing!

Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.

Similar presentations

Presentation on theme: "Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.

Similar presentations

Presentation on theme: "Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman."— Presentation transcript:

Similar presentations

About project

Feedback