Differential Expression and False Discovery Rate Revisiting the Princeton Stem Cell Data GPBA Workshop Oct. 15, 2003
Gene Expression Analysis of Hematopoietic Stem Cells and Committed Progenitors Hongxian He 1, Gregory Grant 1, Lyle Ungar 2, Natalia B. Ivanova 3, Ihor R. Lemischka 3, Chris Stoeckert 1 1. Center for Bioinformatics, University of Pennsylvania 2. School of Engineering and Applied Science, University of Pennsylvania 3. Department of Molecular Biology, Princeton University
Model for Hematopoiesis Hematopoiesis: A process in which blood cells are formed from hematopoietic stem cells (HSC). HSC properties: pluripotent: They can differentiate into a number of different blood cell types. self-renewing: They can divide to replenish themselves in their pluripotent (undifferientiated) state.
Sample Preparation: Cell Purification Hematopoietic fraction from mouse fetal liver Hematopoietic fraction from mouse adult bone marrow Lin- AA4.1+ c-Kit+ Sca-1+ Lin- AA4.1+ c-Kit+ Sca-1- MBC HSCLCP Lin+ Lin- MBC LCP LT-HSCST-HSC Lin- c-Kit+ Sca-1- Lin- c-Kit+ Sca-1- Rho low Lin- c-Kit+ Sca-1- Rho high LT-HSC: long-term hematopoietic stem cell ST-HSC: short-term hematopoietic stem cell LCP: lineage-committed progenitor MBC: mature blood cell
Dataset: Each hematopoietic population was hybridized to Affymetrix MG-U74A, B, C v2 chips, respectively. All the populations have 2 biological replicates, except for bone marrow LT-HSC and ST- HSC populations which have 4 replicates each. Available at Questions: What genes are expressed in each population? What genes are differentially expressed between any two successive populations? What is the minimum set of genes whose expression patterns are limited to and correlated with a specific population? What are the genes that play functional roles in regulating HSC proliferation and differentiation? What is the set of genes that are shared (or/and different) among different datasets using the same analysis approach?
Gene list 1 Gene list 2 FL HSC sample 1 FL HSC sample 2 Final Gene list (Intersect) Gene 1 p 1 (2) Gene 2 p 2 (2). Gene N p N (2) MAS 5.0 Gene 1 p 1 (1) Gene 2 p 2 (1). Gene N p N (1) B-H FDR method Absolute Expression Analysis B-H FDR method 1 2 Our goal is to identify genes expressed in HSCs and successive populations and to know the degree of confidence in the results.
MAS 5.0 P/A Calls Now based on Wilcoxon’s one-sided signed rank test –Take differences between PM and MM (16 differences) –Rank the absolute differences from smallest to largest (1-16). Sign the ranks based on if differences were +/-. If no differences expect sum of positive ranks to be about sum of negative ranks. (about 64 e.g., ) If really different then expect sum of negative ranks to be close to 0 or 1. –P-value (from table) of seeing smaller sum of ranks used to make calls. P < >= M <= 0.06 A > 0.06 Note: Making about 12,000 such calls per chip. –The probability of making a false present call is 0.05 for the test on a single gene but with12,000 tests there will be 600 spurious calls just by chance! –Multiple testing correction issue
Multiple Testing Correction Family-wise Error Rate (FWER) –Control the probability of making any false positive call at the desired significance level. –Conservative methods such as the Bonferroni correction Divide p-value by number of calls (or genes) For 12,000 genes, need a p-value threshold of about for each gene to assure that the probability of making any false present call is 0.05 False Discovery Rate (FDR) –FDR = expected (# false predictions/ # total predictions) –Control the proportion of false positive calls in all positive calls at the desired significance level. Shifts focus to predicted positives and accepts that some will be wrong. An FDR of 0.05 means out of 100 predicted positives, 5 are wrong. FWER will be very high but not important.
Benjamini-Hochberg FDR –Step-up method (sort p-values in decreasing order and evaluate until first success meeting cut-off) – –Cut-off of 0.02 FDR (only 2 expected false positives in 100 predictions) For A chip (mostly known genes), this meant a cutoff of around –3000 to 4000 genes pass as expressed –60 to 80 of these are wrong (i.e., about of 12,000) FDR chosen to maximize the number of genes passing the cutoff yet keeping the number of false calls small.
mBM.LTHSCmBM.STHSCmBM.LCPmBM.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 1. Number of genes expressed in each population from bone marrow mFL.HSCmFL.LCPmFL.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 2. Number of genes expressed in each population from fetal liver Results of Absolute Expression Analysis
mBM.LTHSCmBM.STHSCmBM.LCPmBM.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 3. Number of genes uniquely expressed in each population from bone marrow mFL.HSCmFL.LCPmFL.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 4. Number of genes uniquely expressed in each population from fetal liver Results of Absolute Expression Analysis (cont.)
Differential Expression Analysis Bone Marrow Sample Fetal Liver Sample LT-HSCST-HSCLCPMBC LCPMBCHSC analysis 2 Our goal is to identify genes differentially expressed between HSCs and successive populations and to know the degree of confidence in the results.
Differential Expression Analysis LT-HSC ST-HSC Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene list 1 Gene list 2Gene list 3 (Intersect) PaGE SAM LPE + BH Final Gene list MAS 5.0
MAS 5.0 Expression Level Estimate Goals were to replace AvgDiff with a metric that produces only positive values and is little affected by outliers. Tukey’s biweight and imputation of stray signal –Substitute negative PM-MM values with predicted values based on other probe pairs. –Instead of straight average, use average of values weighted by distance from median. (Can do iteratively but just do one step.) Those furthest from the median contribute the least to the average.
Differential Expression Methods Used PaGE (Pattern of Gene Expression) SAM (Significance Analysis of Microarrays) LPE (Local Pooled-Error model test)
FDR Using Permutations An alternative to the B-H FDR is to use a permutation method. A single gene statistic is chosen –for PaGE it is simply the ratio of expression levels –for SAM it is a variant of the t-statistic. The experiments are permuted between the groups and a permutation estimate of the null distribution of the statistic is compiled. This distribution is then used to estimate what cutoff value for the single gene p-values will achieve the desired FDR based on the number of genes expected to pass the cut-off by chance and the observed number of genes that pass the cut-off. Permuted Observed
Using Permutations to Get p-values Permutations randomize associations –use these associations to recalculate statistics for each permutation –Find the likelihood of the observed statistic based on the distribution of statistics from the permuted samples p-value = percent of statistics generated from the permutations that are greater than or equal to the observed statistic. Permuting columns preserves any gene dependencies but scrambles samples Permute columns to c,d,e,f;a,b,g,h LT-HSC ST-HSC a b c d e f g h LT-HSC ST-HSC c d e f a b g h
Example of Permutation for Eight Samples in Two Groups (4, 4) Step 1: permute the sample columns (e.g., swap sample a in group 1 with the sample e in group 2). Calculate the t- statistic. Note: permuting columns avoids assumption of gene independence. Important for considering more than one gene. Step 2: repeat step 1 for all possible permutations (70 in this case). Step 3: Use the 70 t-statistics to get the distribution. Step 4: compare starting t-statistic to distribution to get p- value (fraction of permuted t-statistics larger than observed t-statistic d) Permuted
Generating Permutations 4, 4 (a,b,c,d;e,f,g,h) gives 70 permutations –No swap:a,b,c,d;e,f,g,h 1 –swap 1: b,c,d,e;a,f,g,h a,c,d,e;b,f,g,h a,b,d,e;c,f,g,h a,b,c,e;d,f,g,h 16 b,c,d,f;a,e,g,h etc… b,c,d,g;a,e,f,h etc… b,c,d,h;a,e,f,g etc… –swap 2: c,d,e,f;a,b,g,h b,d,e,f;a,c,g,h b,c,e,f;a,d,g,h a,d,e,f;b,c,g,h a,c,e,f;b,d,g,h a,b,e,f;c,d,g,h 36 c,d,e,g;a,b,f,h etc… c,d,e,h;a,b,f,g etc… c,d,f,g;a,b,e,h etc… c,d,f,h;a,b,e,g etc… c,d,g,h;a,b,e,f etc… –swap 3: d,e,f,g;a,b,c,h c,e,f,g;a,b,d,h b,e,f,g;a,c,d,h a,e,f,g;b,c,d,h 16 d,e,f,h;a,b,c,g etc… d,e,g,h;a,b,c,f etc… d,f,g,h;a,b,c,e etc… –swap 4:e,f,g,h;a,b,c,d 1
SAM confidence SAM stands for “Statistical Analysis of Microarrays.” Tusher et al PNAS SAM controls the FDR. –Note: A p-value of.05 is considered marginal. But a false- discovery rate as high as.50 might even be desirable. SAM uses a variant of the t-statistic –Has a fudge factor s 0 (a small positive constant) to limit the effect of high variation at low intensities. d(g) = x 1 (g) - x 2 (g) where s(g) = standard error for gene g s(g) + s 0
The SAM Interface
PaGE PaGE stands for Patterns from Gene Expression. –Goal is to compare patterns across more than 2 groups to look at co- regulation. t-statistics not really applicable to describing co-regulation –PaGE was developed by our group at Penn! Manduchi et al. Bioinformatics PaGE also focuses on the FDR. –PaGE takes a minimum confidence level as a parameter, and finds all genes which exceed this confidence. –Each gene is reported with its own confidence. FDR = 1- Confidence PaGE uses ratios of means. B, C, D A A A Where A, B, C, and D are group means for each gene and A is the reference group. Use permutations to generate the random distribution of ratios.
Local Pooled Error Uses the z-test –z = (median 1 - median 2 )/ pooled Use medians instead of means Variance from pools of genes Pool genes to combine variance in signals –Divide genes up into quantiles (e.g. percentiles) based on average (log) expression value over replicated arrays and estimate variance for genes within each quantile. –Use local pooled variance for each gene comparison. With few replicates, this provides a better estimate of true variance in measurement. –Assuming normal distribution, use z-test to get p-values Determine significance using multiple testing correction (B-H FDR).
Differential Expression Analysis HSC LCP Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene list 1 Gene list 2Gene list 3 (Intersect) PaGE SAM LPE + BH Final Gene list MAS 5.0
Results of Differential Expression Analysis Bone Marrow Sample LT-HSC vs. STHSC: 13 up-regulated in LT-HSC, 73 up-regulated in ST-HSC ST-HSC vs. LCP: 25 up-regulated in ST-HSC, 28 up-regulated in LCP LCP vs. MBC: 0 up-regulated in LCP, 219 up-regulated in MBC Fetal Liver Sample HSC vs. LCP: 9 up-regulated in HSC, 1 up-regulated in LCP LCP vs. MBC: 108 up-regulated in LCP, 437 up-regulated in MBC
Comparison of current study with the original analysis in Ivanova et al., Science, 2002 This studyIvanova et al. Absolute expression analysis - Wilcoxon’s signed rank test for Presence/Absence call (MAS 5.0) - Multiple testing adjustment using FDR method - Consensus call from replicates - Empirical Presence/Absence call (MAS 4.0) - No multiple testing adjustment - Consensus call from replicates Differential expression analysis - Statistical methods for differential expression analysis - Consensus result from 3 methods - Each two successive populations were compared - Simple fold change thresholding - All populations were compared to MBC population Cluster analysis - SOM algorithm to automatically identify patterns - BM and FL samples were analyzed separately - Assignment to biologically predefined patterns using simple correlation - BM and FL samples were analyzed together
More Information on Affymetrix The 2003 Affymetrix GeneChip Microarray Low- Level Workshop – workshop.affx –Many slide presentations and posters available for download. Bioconductor – –Open source effort using the R statistical packages.
Cluster Analysis Goal: look for genes with similar expression patterns across different hematopoietic populations. Method: Self-Organizing Maps (SOM) Distance measure: (1 – Pearson correlation coefficient) Gene selection: analysis of variance (ANOVA) to select informative genes with significant expression level change across populations. BMFL MG-U74Av MG-U74Bv MG-U74Cv Table: Number of genes selected for having significant expression level change over populations in different samples 3
Figure: Plot of expression profiles for genes in one cluster generated by SOM analysis for bone marrow sample. The centroid is indicated by the red line.
Clusters for Bone Marrow Sample