Differential Expression and False Discovery Rate Revisiting the Princeton Stem Cell Data GPBA Workshop Oct. 15, 2003.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Dealing With Statistical Uncertainty
Microarray Normalization
OHRI Bioinformatics Introduction to the Significance Analysis of Microarrays application Stem.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
PSY 307 – Statistics for the Behavioral Sciences
Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics.
Microarray Data Preprocessing and Clustering Analysis
Gene Expression Data Analyses (3)
Differentially expressed genes
Topic 2: Statistical Concepts and Market Returns
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
GCB/CIS 535 Microarray Topics John Tobias November 8th, 2004.
Lecture 9: One Way ANOVA Between Subjects
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
5-3 Inference on the Means of Two Populations, Variances Unknown
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
Chapter 15 Nonparametric Statistics
Multiple testing in high- throughput biology Petter Mostad.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Essential Statistics in Biology: Getting the Numbers Right
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Differential Expression II Adding power by modeling all the genes Oct 06.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
I. Statistical Tests: A Repetive Review A.Why do we use them? Namely: we need to make inferences from incomplete information or uncertainty þBut we want.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
RADical microarray data: standards, databases, and analysis Chris Stoeckert, Ph.D. University of Pennsylvania Yale Microarray Data Analysis Workshop December.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Statistics for Differential Expression Naomi Altman Oct. 06.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
NON-PARAMETRIC STATISTICS
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Microarray Data Analysis The Bioinformatics side of the bench.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Generation of patterns from gene expression by assigning confidence to differentially expressed genes Elisabetta Manduchi, Gregory R. Grant, Steven E.McKenzie,
Canadian Bioinformatics Workshops
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Significance analysis of microarrays (SAM)
Significance Analysis of Microarrays (SAM)
I. Statistical Tests: Why do we use them? What do they involve?
Significance Analysis of Microarrays (SAM)
Presentation transcript:

Differential Expression and False Discovery Rate Revisiting the Princeton Stem Cell Data GPBA Workshop Oct. 15, 2003

Gene Expression Analysis of Hematopoietic Stem Cells and Committed Progenitors Hongxian He 1, Gregory Grant 1, Lyle Ungar 2, Natalia B. Ivanova 3, Ihor R. Lemischka 3, Chris Stoeckert 1 1. Center for Bioinformatics, University of Pennsylvania 2. School of Engineering and Applied Science, University of Pennsylvania 3. Department of Molecular Biology, Princeton University

Model for Hematopoiesis Hematopoiesis: A process in which blood cells are formed from hematopoietic stem cells (HSC). HSC properties: pluripotent: They can differentiate into a number of different blood cell types. self-renewing: They can divide to replenish themselves in their pluripotent (undifferientiated) state.

Sample Preparation: Cell Purification Hematopoietic fraction from mouse fetal liver Hematopoietic fraction from mouse adult bone marrow Lin- AA4.1+ c-Kit+ Sca-1+ Lin- AA4.1+ c-Kit+ Sca-1- MBC HSCLCP Lin+ Lin- MBC LCP LT-HSCST-HSC Lin- c-Kit+ Sca-1- Lin- c-Kit+ Sca-1- Rho low Lin- c-Kit+ Sca-1- Rho high LT-HSC: long-term hematopoietic stem cell ST-HSC: short-term hematopoietic stem cell LCP: lineage-committed progenitor MBC: mature blood cell

Dataset: Each hematopoietic population was hybridized to Affymetrix MG-U74A, B, C v2 chips, respectively. All the populations have 2 biological replicates, except for bone marrow LT-HSC and ST- HSC populations which have 4 replicates each. Available at Questions: What genes are expressed in each population? What genes are differentially expressed between any two successive populations? What is the minimum set of genes whose expression patterns are limited to and correlated with a specific population? What are the genes that play functional roles in regulating HSC proliferation and differentiation? What is the set of genes that are shared (or/and different) among different datasets using the same analysis approach?

Gene list 1 Gene list 2 FL HSC sample 1 FL HSC sample 2 Final Gene list (Intersect) Gene 1 p 1 (2) Gene 2 p 2 (2). Gene N p N (2) MAS 5.0 Gene 1 p 1 (1) Gene 2 p 2 (1). Gene N p N (1) B-H FDR method Absolute Expression Analysis B-H FDR method 1 2 Our goal is to identify genes expressed in HSCs and successive populations and to know the degree of confidence in the results.

MAS 5.0 P/A Calls Now based on Wilcoxon’s one-sided signed rank test –Take differences between PM and MM (16 differences) –Rank the absolute differences from smallest to largest (1-16). Sign the ranks based on if differences were +/-. If no differences expect sum of positive ranks to be about sum of negative ranks. (about 64 e.g., ) If really different then expect sum of negative ranks to be close to 0 or 1. –P-value (from table) of seeing smaller sum of ranks used to make calls. P < >= M <= 0.06 A > 0.06 Note: Making about 12,000 such calls per chip. –The probability of making a false present call is 0.05 for the test on a single gene but with12,000 tests there will be 600 spurious calls just by chance! –Multiple testing correction issue

Multiple Testing Correction Family-wise Error Rate (FWER) –Control the probability of making any false positive call at the desired significance level. –Conservative methods such as the Bonferroni correction Divide p-value by number of calls (or genes) For 12,000 genes, need a p-value threshold of about for each gene to assure that the probability of making any false present call is 0.05 False Discovery Rate (FDR) –FDR = expected (# false predictions/ # total predictions) –Control the proportion of false positive calls in all positive calls at the desired significance level. Shifts focus to predicted positives and accepts that some will be wrong. An FDR of 0.05 means out of 100 predicted positives, 5 are wrong. FWER will be very high but not important.

Benjamini-Hochberg FDR –Step-up method (sort p-values in decreasing order and evaluate until first success meeting cut-off) – –Cut-off of 0.02 FDR (only 2 expected false positives in 100 predictions) For A chip (mostly known genes), this meant a cutoff of around –3000 to 4000 genes pass as expressed –60 to 80 of these are wrong (i.e., about of 12,000) FDR chosen to maximize the number of genes passing the cutoff yet keeping the number of false calls small.

mBM.LTHSCmBM.STHSCmBM.LCPmBM.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 1. Number of genes expressed in each population from bone marrow mFL.HSCmFL.LCPmFL.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 2. Number of genes expressed in each population from fetal liver Results of Absolute Expression Analysis

mBM.LTHSCmBM.STHSCmBM.LCPmBM.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 3. Number of genes uniquely expressed in each population from bone marrow mFL.HSCmFL.LCPmFL.MBC MG-U74Av MG-U74Bv MG-U74Cv Total Table 4. Number of genes uniquely expressed in each population from fetal liver Results of Absolute Expression Analysis (cont.)

Differential Expression Analysis Bone Marrow Sample Fetal Liver Sample LT-HSCST-HSCLCPMBC LCPMBCHSC analysis 2 Our goal is to identify genes differentially expressed between HSCs and successive populations and to know the degree of confidence in the results.

Differential Expression Analysis LT-HSC ST-HSC Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene list 1 Gene list 2Gene list 3 (Intersect) PaGE SAM LPE + BH Final Gene list MAS 5.0

MAS 5.0 Expression Level Estimate Goals were to replace AvgDiff with a metric that produces only positive values and is little affected by outliers. Tukey’s biweight and imputation of stray signal –Substitute negative PM-MM values with predicted values based on other probe pairs. –Instead of straight average, use average of values weighted by distance from median. (Can do iteratively but just do one step.) Those furthest from the median contribute the least to the average.

Differential Expression Methods Used PaGE (Pattern of Gene Expression) SAM (Significance Analysis of Microarrays) LPE (Local Pooled-Error model test)

FDR Using Permutations An alternative to the B-H FDR is to use a permutation method. A single gene statistic is chosen –for PaGE it is simply the ratio of expression levels –for SAM it is a variant of the t-statistic. The experiments are permuted between the groups and a permutation estimate of the null distribution of the statistic is compiled. This distribution is then used to estimate what cutoff value for the single gene p-values will achieve the desired FDR based on the number of genes expected to pass the cut-off by chance and the observed number of genes that pass the cut-off. Permuted Observed

Using Permutations to Get p-values Permutations randomize associations –use these associations to recalculate statistics for each permutation –Find the likelihood of the observed statistic based on the distribution of statistics from the permuted samples p-value = percent of statistics generated from the permutations that are greater than or equal to the observed statistic. Permuting columns preserves any gene dependencies but scrambles samples Permute columns to c,d,e,f;a,b,g,h LT-HSC ST-HSC a b c d e f g h LT-HSC ST-HSC c d e f a b g h

Example of Permutation for Eight Samples in Two Groups (4, 4) Step 1: permute the sample columns (e.g., swap sample a in group 1 with the sample e in group 2). Calculate the t- statistic. Note: permuting columns avoids assumption of gene independence. Important for considering more than one gene. Step 2: repeat step 1 for all possible permutations (70 in this case). Step 3: Use the 70 t-statistics to get the distribution. Step 4: compare starting t-statistic to distribution to get p- value (fraction of permuted t-statistics larger than observed t-statistic d) Permuted

Generating Permutations 4, 4 (a,b,c,d;e,f,g,h) gives 70 permutations –No swap:a,b,c,d;e,f,g,h 1 –swap 1: b,c,d,e;a,f,g,h a,c,d,e;b,f,g,h a,b,d,e;c,f,g,h a,b,c,e;d,f,g,h 16 b,c,d,f;a,e,g,h etc… b,c,d,g;a,e,f,h etc… b,c,d,h;a,e,f,g etc… –swap 2: c,d,e,f;a,b,g,h b,d,e,f;a,c,g,h b,c,e,f;a,d,g,h a,d,e,f;b,c,g,h a,c,e,f;b,d,g,h a,b,e,f;c,d,g,h 36 c,d,e,g;a,b,f,h etc… c,d,e,h;a,b,f,g etc… c,d,f,g;a,b,e,h etc… c,d,f,h;a,b,e,g etc… c,d,g,h;a,b,e,f etc… –swap 3: d,e,f,g;a,b,c,h c,e,f,g;a,b,d,h b,e,f,g;a,c,d,h a,e,f,g;b,c,d,h 16 d,e,f,h;a,b,c,g etc… d,e,g,h;a,b,c,f etc… d,f,g,h;a,b,c,e etc… –swap 4:e,f,g,h;a,b,c,d 1

SAM confidence SAM stands for “Statistical Analysis of Microarrays.” Tusher et al PNAS SAM controls the FDR. –Note: A p-value of.05 is considered marginal. But a false- discovery rate as high as.50 might even be desirable. SAM uses a variant of the t-statistic –Has a fudge factor s 0 (a small positive constant) to limit the effect of high variation at low intensities. d(g) = x 1 (g) - x 2 (g) where s(g) = standard error for gene g s(g) + s 0

The SAM Interface

PaGE PaGE stands for Patterns from Gene Expression. –Goal is to compare patterns across more than 2 groups to look at co- regulation. t-statistics not really applicable to describing co-regulation –PaGE was developed by our group at Penn! Manduchi et al. Bioinformatics PaGE also focuses on the FDR. –PaGE takes a minimum confidence level as a parameter, and finds all genes which exceed this confidence. –Each gene is reported with its own confidence. FDR = 1- Confidence PaGE uses ratios of means. B, C, D A A A Where A, B, C, and D are group means for each gene and A is the reference group. Use permutations to generate the random distribution of ratios.

Local Pooled Error Uses the z-test –z = (median 1 - median 2 )/  pooled Use medians instead of means Variance from pools of genes Pool genes to combine variance in signals –Divide genes up into quantiles (e.g. percentiles) based on average (log) expression value over replicated arrays and estimate variance for genes within each quantile. –Use local pooled variance for each gene comparison. With few replicates, this provides a better estimate of true variance in measurement. –Assuming normal distribution, use z-test to get p-values Determine significance using multiple testing correction (B-H FDR).

Differential Expression Analysis HSC LCP Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene 1 e 1 (1) e 1 (2) e 1 (3) e 1 (4) Gene 2 e 2 (1) e 2 (2) e 2 (3) e 2 (4). Gene N e N (1) e N (2) e N (3) e N (4) Gene list 1 Gene list 2Gene list 3 (Intersect) PaGE SAM LPE + BH Final Gene list MAS 5.0

Results of Differential Expression Analysis Bone Marrow Sample LT-HSC vs. STHSC: 13 up-regulated in LT-HSC, 73 up-regulated in ST-HSC ST-HSC vs. LCP: 25 up-regulated in ST-HSC, 28 up-regulated in LCP LCP vs. MBC: 0 up-regulated in LCP, 219 up-regulated in MBC Fetal Liver Sample HSC vs. LCP: 9 up-regulated in HSC, 1 up-regulated in LCP LCP vs. MBC: 108 up-regulated in LCP, 437 up-regulated in MBC

Comparison of current study with the original analysis in Ivanova et al., Science, 2002 This studyIvanova et al. Absolute expression analysis - Wilcoxon’s signed rank test for Presence/Absence call (MAS 5.0) - Multiple testing adjustment using FDR method - Consensus call from replicates - Empirical Presence/Absence call (MAS 4.0) - No multiple testing adjustment - Consensus call from replicates Differential expression analysis - Statistical methods for differential expression analysis - Consensus result from 3 methods - Each two successive populations were compared - Simple fold change thresholding - All populations were compared to MBC population Cluster analysis - SOM algorithm to automatically identify patterns - BM and FL samples were analyzed separately - Assignment to biologically predefined patterns using simple correlation - BM and FL samples were analyzed together

More Information on Affymetrix The 2003 Affymetrix GeneChip Microarray Low- Level Workshop – workshop.affx –Many slide presentations and posters available for download. Bioconductor – –Open source effort using the R statistical packages.

Cluster Analysis Goal: look for genes with similar expression patterns across different hematopoietic populations. Method: Self-Organizing Maps (SOM) Distance measure: (1 – Pearson correlation coefficient) Gene selection: analysis of variance (ANOVA) to select informative genes with significant expression level change across populations. BMFL MG-U74Av MG-U74Bv MG-U74Cv Table: Number of genes selected for having significant expression level change over populations in different samples 3

Figure: Plot of expression profiles for genes in one cluster generated by SOM analysis for bone marrow sample. The centroid is indicated by the red line.

Clusters for Bone Marrow Sample