Differential Gene Expression Xiaole Shirley Liu STAT115 / STAT215
Identification of Diagnostic Genes Classical study of cancer subtypes Golub et al. (1999)
Differential Expression Naïve method: Fold change Avg(X) / Avg(Y) Note on scale: Natural scale: MAS4, MAS5, dChip Log scale: RMA, need to take exp()
Fold Change Problems Does not give confidence of differential expression Better statistical test?
Test Normality Normal distribution QQ Plot Normal: T*-test Non-normal: non-parametric test
Wilcoxon Rank Sum Test Break Rank all data in row, count sum of ranks TT or TC Significance calculated from permutation as well E.g. 10 normal and 10 cancer Min(T) = 55 Max(T) = 155 Significance(T=150) Check U table (transformation of T) for stat significance Non-parametric, less power with fewer samples Break
Linear Model for Differential Expression Yijk = mj + aij + errorijk Separate model for each gene j. mj is the mean expression for gene j over the entire experiment (RMA ExprIndex). aij is the deviation of the mean of the ith condition from the overall mean Si aij=0 k is a specific sample. For 3 rep (mutant) over 3 rep (wildtype ctrl), we care whether amu-awt=0 (null hypothesis H0)
Ordinary t-tests c based on sample size in the two conditions How to determine sg?
Variance Estimates Same variance across treatment: standard t-test Variance in different treatment is different: Welch-t test Big |t|, small p, reject H0
Variance Stabilization Problem with estimating variance when the sample size is small (e.g. 2-3 replicates in each condition) Statistical Analysis of Microarrays (SAM) Modified t*, increase sg based on sg of other genes on the array (i.e. lowest 5 percentile of sg) LIMMA: Smyth 2004 Empirical Bayes: borrow info from all genes
LIMMA: Design Matrix Specifies RNA samples used on arrays >Mat Treat1 Treat2 Control Sample1 1 0 0 Sample2 1 0 0 Sample3 1 0 0 Sample4 0 1 0 Sample5 0 1 0 Sample6 0 1 0 Sample7 0 0 1 Sample8 0 0 1 Sample9 0 0 1
LIMMA: Contrast Matrix Specifies which comparisons are of interest > contrast Treat1-Control Treat2-Control Treat1 1 0 Treat2 0 1 Control -1 -1 Flexibility of the generalized linear model can consider many different conditions Yijk = mj + aij + errorijk
LIMMA: Contrast Matrix Smooth gene-wise variance towards a common (typical) value in a graduated way by borrowing information from all the genes, but allow flexibility for individual genes Assume gene’s variance follows inverse gamma distribution, large variance shrunk down, small variance shrunk up Break
Multiple Hypotheses Testing How many differential genes to report?
Multiple Hypotheses Testing We test differential expression for every gene with p-value, e.g. 0.01 For ~20 K genes on the array, potentially 0.01 x 20K = 200 genes wrongly called H0: no diff expr; H1: diff expr Reject H0: call something to be differential expressed Should control family-wise error rate or false discovery rate
Family-Wise Error Rate P(false rejection at most one hypothesis) < α P(no false rejection ) > 1- α Bonferroni correction: to control the family-wise error rate for testing m hypotheses at level α, we need to control the false rejection rate for each individual test at α/m If α is 0.05, for 20K gene prediction, p-value cutoff is 0.05/20K = 2.5E-6 Too conservative for differential expressed gene selection
False Discovery Rate Break U V m0 T S m1 m - R R m # not rejected Not called # rejected Called Total # H0 Two groups similar U V m0 # H1 Two groups different T S m1 m - R R m V: type I errors, false positives T: type II errors, false negatives FDR = V / R, FP / all called Break
False Discovery Rate Less conservative than family-wise error rate Benjamini and Hochberg (1995) method for FDR control, e.g. FDR ≤ * Assume all the p-val from different tests are independent Draw all m genes (x), ranked by p-val (y) Draw line y = x * / m, x = 1…m Call all the genes below the line
FDR Threshold Genes ranked by p-val p-value x * / m line index / m
Q-value Teaser: what’s the pvalue distribution if there are no differential genes Storey & Tibshirani, PNAS, 2003 Empirically derived q-value Every p-value has its corresponding q-value (FDR)
Practical Use of FDR Very useful concepts in most of genomics or high throughput studies Pvalue and FDR are monotonic Common FDR: 1%, 5%, 10%, also filter by fold change Give rough estimate of signal / noise and experimental quality For expression, most people are comfortable with ~500-2000 differentially expressed genes
Summary Differential Expression Fold change T* test on normally distributed data LIMMA uses hierarchical model to stabilize gene-wise variance FDR: adjust for multiple hypotheses testing FWER: conservative Benjamini-Hochberg qvalue