Download presentation
Presentation is loading. Please wait.
1
Differential Gene Expression
Xiaole Shirley Liu STAT115 / STAT215
2
Identification of Diagnostic Genes
Classical study of cancer subtypes Golub et al. (1999)
3
Differential Expression
Naïve method: Fold change Avg(X) / Avg(Y) Note on scale: Natural scale: MAS4, MAS5, dChip Log scale: RMA, need to take exp()
4
Fold Change Problems Does not give confidence of differential expression Better statistical test?
5
Test Normality Normal distribution QQ Plot Normal: T*-test
Non-normal: non-parametric test
6
Wilcoxon Rank Sum Test Break
Rank all data in row, count sum of ranks TT or TC Significance calculated from permutation as well E.g. 10 normal and 10 cancer Min(T) = 55 Max(T) = 155 Significance(T=150) Check U table (transformation of T) for stat significance Non-parametric, less power with fewer samples Break
7
Linear Model for Differential Expression
Yijk = mj + aij + errorijk Separate model for each gene j. mj is the mean expression for gene j over the entire experiment (RMA ExprIndex). aij is the deviation of the mean of the ith condition from the overall mean Si aij=0 k is a specific sample. For 3 rep (mutant) over 3 rep (wildtype ctrl), we care whether amu-awt=0 (null hypothesis H0)
8
Ordinary t-tests c based on sample size in the two conditions
How to determine sg?
9
Variance Estimates Same variance across treatment: standard t-test
Variance in different treatment is different: Welch-t test Big |t|, small p, reject H0
10
Variance Stabilization
Problem with estimating variance when the sample size is small (e.g. 2-3 replicates in each condition) Statistical Analysis of Microarrays (SAM) Modified t*, increase sg based on sg of other genes on the array (i.e. lowest 5 percentile of sg) LIMMA: Smyth 2004 Empirical Bayes: borrow info from all genes
11
LIMMA: Design Matrix Specifies RNA samples used on arrays >Mat
Treat1 Treat2 Control Sample Sample Sample Sample Sample Sample Sample Sample Sample
12
LIMMA: Contrast Matrix
Specifies which comparisons are of interest > contrast Treat1-Control Treat2-Control Treat Treat Control Flexibility of the generalized linear model can consider many different conditions Yijk = mj + aij + errorijk
13
LIMMA: Contrast Matrix
Smooth gene-wise variance towards a common (typical) value in a graduated way by borrowing information from all the genes, but allow flexibility for individual genes Assume gene’s variance follows inverse gamma distribution, large variance shrunk down, small variance shrunk up Break
14
Multiple Hypotheses Testing
How many differential genes to report?
15
Multiple Hypotheses Testing
We test differential expression for every gene with p-value, e.g. 0.01 For ~20 K genes on the array, potentially 0.01 x 20K = 200 genes wrongly called H0: no diff expr; H1: diff expr Reject H0: call something to be differential expressed Should control family-wise error rate or false discovery rate
16
Family-Wise Error Rate
P(false rejection at most one hypothesis) < α P(no false rejection ) > 1- α Bonferroni correction: to control the family-wise error rate for testing m hypotheses at level α, we need to control the false rejection rate for each individual test at α/m If α is 0.05, for 20K gene prediction, p-value cutoff is 0.05/20K = 2.5E-6 Too conservative for differential expressed gene selection
17
False Discovery Rate Break U V m0 T S m1 m - R R m
# not rejected Not called # rejected Called Total # H0 Two groups similar U V m0 # H1 Two groups different T S m1 m - R R m V: type I errors, false positives T: type II errors, false negatives FDR = V / R, FP / all called Break
18
False Discovery Rate Less conservative than family-wise error rate
Benjamini and Hochberg (1995) method for FDR control, e.g. FDR ≤ * Assume all the p-val from different tests are independent Draw all m genes (x), ranked by p-val (y) Draw line y = x * / m, x = 1…m Call all the genes below the line
19
FDR Threshold Genes ranked by p-val p-value x * / m line index / m
20
Q-value Teaser: what’s the pvalue distribution if there are no differential genes Storey & Tibshirani, PNAS, 2003 Empirically derived q-value Every p-value has its corresponding q-value (FDR)
21
Practical Use of FDR Very useful concepts in most of genomics or high throughput studies Pvalue and FDR are monotonic Common FDR: 1%, 5%, 10%, also filter by fold change Give rough estimate of signal / noise and experimental quality For expression, most people are comfortable with ~ differentially expressed genes
22
Summary Differential Expression
Fold change T* test on normally distributed data LIMMA uses hierarchical model to stabilize gene-wise variance FDR: adjust for multiple hypotheses testing FWER: conservative Benjamini-Hochberg qvalue
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.