Multiple Testing in Microarray Data Analysis Mi-Ok Kim.

Slides:



Advertisements
Similar presentations
Multiple testing and false discovery rate in feature selection
Advertisements

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Multiple Testing and Prediction and Variable Selection Class web site: Statistics for Microarrays.
Microarray Data Analysis Statistical methods to detect differentially expressed genes.
Likelihood ratio tests
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
Analysis of gene expression data (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH)
Differentially expressed genes
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
Hypothesis Testing Lecture 4. Examples of various hypotheses The sodium content in Furresøen is x Sodium content in Furresøen is equal to the content.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
The Need For Resampling In Multiple testing. Correlation Structures Tukey’s T Method exploit the correlation structure between the test statistics, and.
False Discovery Rate Methods for Functional Neuroimaging Thomas Nichols Department of Biostatistics University of Michigan.
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Statistics for Microarrays
Multiple Testing Procedures Examples and Software Implementation.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Choosing Statistical Procedures
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Multiple Testing in the Survival Analysis of Microarray Data
Multiple testing in high- throughput biology Petter Mostad.
1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.
1/2555 สมศักดิ์ ศิวดำรงพงศ์
Candidate marker detection and multiple testing
Means Tests Hypothesis Testing Assumptions Testing (Normality)
Essential Statistics in Biology: Getting the Numbers Right
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G.
Differential Expression II Adding power by modeling all the genes Oct 06.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Significance Testing of Microarray Data BIOS 691 Fall 2008 Mark Reimers Dept. Biostatistics.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 False Discovery Rate Guy Yehuda. 2 Outline Short introduction to statistics The problem of multiplicity FDR vs. FWE FDR control procedures and resampling.
False Discovery Rates for Discrete Data Joseph F. Heyse Merck Research Laboratories Graybill Conference June 13, 2008.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Strong Control of the Familywise Type I Error Rate in DNA Microarray Analysis Using Exact Step-Down Permutation Tests Peter H. Westfall Texas Tech University.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Differential Expressions Classical Methods Lecture Topic 7.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Statistical Testing with Genes Saurabh Sinha CS 466.
Chapter Outline Goodness of Fit test Test of Independence.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Multiple testing in large-scale gene expression experiments Statistics 246, Spring 2002 Week 8, Lecture 2.
The Broad Institute of MIT and Harvard Differential Analysis.
Multiple testing in large-scale gene expression experiments
1 Drug Screening and the False Discovery Rate Charles W Dunnett McMaster University 3 rd International Conference on Multiple Comparisons, Bethesda, Maryland,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Bonferroni adjustment Bonferroni adjustment (equally weighted) – Reject H 0j with p i
Lec. 19 – Hypothesis Testing: The Null and Types of Error.
1 השוואות מרובות מדדי טעות, עוצמה, רווחי סמך סימולטניים ד"ר מרינה בוגומולוב מבוסס על ההרצאות של פרופ' יואב בנימיני ופרופ' מלכה גורפיין.
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Ex St 801 Statistical Methods Part 2 Inference about a Single Population Mean (HYP)
Learning Objectives Describe the hypothesis testing process Distinguish the types of hypotheses Explain hypothesis testing errors Solve hypothesis testing.
Differential Gene Expression
Example: Propellant Burn Rate
Presentation transcript:

Multiple Testing in Microarray Data Analysis Mi-Ok Kim

Outline 1.Hypothesis Testing 2.Issue in Multiple Testing in Microarray Analysis 1) Type I Error 2) Power 3) P-values 3. Permutation

1. Hypothesis Testing H0 : Null hypotheis vs. H1 : Alternative Hypothesis T : test statistics C : critical value If |T|>C, H0 is rejected. Otherwise H0 is retained Ex ) H0 :  1 =  2 vs. H1 :  1   2 T = ( x 1 - x 2 ) / pooled se If |T| > z (1-  /2), H0 is rejected at the significance level  C 

1. Hypothesis Testing Hypothesis Result Retained Rejected Truth H0 Type I error H1 Type II error Type I error rate = false positives (  : significance level ) Type II error rate = false negatives Power : 1–Type II error rate P-values : p=inf{  | H0 is rejected at the significance level  }

2. Issues in Multiple Comparison Q : Given n treatments, which two treatments are significantly different ? (simultaneous testing) cf) Is treatment A different from treatment B ? Ex ) m treatment means :  1,…,  n H j :  i =  j where i  j T j = ( x i - x j ) / pooled SE Type I error when testing each at 0.05 significance level one by one : 1 – (0.95) n Inflated Type I error, ex)  =1 – (0.95) 10 = Remedies : Bonferroni Method Type I error rate =  / # of comparison

3. Issues in Multiple Testing in Microarray Analysis the identification of differentially expressed genes. ex) a study of differentially expressed genes expression in tumor biopsy specimens from leukemia patients ( ALL / AML ) that includes 6,817 genes and 30 samples rows : genes ( m ) columns : samples ( n ) H j : jth gene is not differentially expressed Simultaneously testing m null hypotheses H j, j=1, …, m, to determine which hypotheses to reject while controlling a suitably defined Type I error and maximizing power

3-1) Type I Error Rates Hypothesis Result #retained #rejected Total Truth H0 U V m0 H1 T S m1 Total m-R R m Per-comparison error rate ( PCER ) = E(V) / m Per-family error rate ( PFER ) = E(V) Family-wise error rate = pr ( V ≥ 1 ) False discovery rate ( FDR ) = E(Q), Q V/R, if R > 0 0, if R = 0

3-1) Type I Error Rates Under the complete null hypothesis, each H j has Type I error rate  j. PCER = E(V) / m = (   m )/m PFER = E(V) =   m FWER= pr ( V ≥ 1 ) = 1 - Pr (H j, j=1, …, m, not rejected ) FDR = E(V / R) = FWER PCER = (   m )/m ≤ max (   m ) ≤ PWER = FDR ≤ PFER=   m

3-1) Type I Error Rate Assume H j, j=1, …, m, with their test statistics T j, j=1,…, m, which has a MN with mean  =(  1,…,  m ) and identity covariance vector Let R j = I ( H j is rejected) and r j is observed value of R j Let  j = Pr ( H j rejected under H j ). PFER =  j=1 m  j PCER =  j=1 m  j / m FWER = 1-  j=1 m (1-  j ) FDR =  r 1=0 1 …  r 1=0 1 (  j=1 m0 r j /  j=1 m r j )   j r j (1-  j ) 1-r j

3-2) Strong vs. Weak Control Expectations and probabilities are conditional on which hypotheses are true. Strong control : control of which Type I error rate under any combination of true and false hypotheses, ie. any value of m0 Weak control : control of the Type I error rate only when all the null hypotheses are true, ie. Under the complete null hypothesis ∩ j=1 m Hj In the microarray setting, where it is very unlikely that no genes are differentially expressed, it seems particularly important to have a strong control of the Type I error rate.

3-3) Power Within the class of multiple testing procedures that control a given Type I error rate at an acceptable level, maximize power, that is, minimize a suitably defined Type II error rate. Any-pair power : Pr ( S ≥ 1 ) = the probability of rejecting at least one false null hypothesis Per-pair power : average power = E(S) / m1 All-pair power : Pr ( S = m1 ) = the probability of rejecting all false null hypothesis

3-4) Multiple Testing Procedures based on P- values that control the family-wise error rate For a single hypothesis H 1, p 1 =inf{  | H 1 is rejected at the significance level  } If p 1 < , H 1 is rejected. Otherwise H 1 is retained Adjusted p-values for multiple testing (p*) p j *=inf{  | H 1 is rejected at FWER=  } If p j * < , H j is rejected. Otherwise H j is retained Single-Step, Step-Down and Step-Up procedure

3-4-1) Single-Step Procedure For a strong control of FWER, single-step Bonferroni adjusted p-values : p j *= min( mp j,1) single-Step Sidak adjsted pvalues : p j *= 1- (1-p j ) m For a weak control of FWER, single-step minP adjusted p-values p j *= min 1 ≤k≤m (P k ≤ p j | complete null) m single-step maxP adjusted p-values p j *= max 1 ≤k≤m (|T k | ≤ C j | complete null) m Under subset pivotal property, weak control = strong control

3-4-2) Step-Down Procedure Order the observed unadjusted p-values such that p r1 ≤ p r2 ≤ … ≤ p rm Accordingly, order H r1 ≤ H r2 ≤ … ≤ H rm Holm’s procedure j* = min { j | p rj >  / (m-j+1) }, reject H rj for j=1,.., j*-1 Adjusted step-down Holm’s p-values p rj *= max{ min( (m-k+1) p rk, 1) } p rj *= max{ 1-(1-p rk ) (m-k+1) } p rj *= max{ Pr( min rk<l<rm P l ≤ p rk | complete null) } p rj *= max{ Pr( max rk<l<rm |T l | ≤ C rk | complete null) }

3-4-3) Step-Up Procedure Order the observed unadjusted p-values such that p r1 ≤ p r2 ≤ … ≤ p rm Accordingly, order H r1 ≤ H r2 ≤ … ≤ H rm j* = max { j | p rj ≤  / (m-j+1) }, reject H rj for j=1,.., j* Adjusted step-down Holm’s p-values p rj *= min{ min( (m-k+1) p rk, 1) }

3-5) Resampling Method Rows – genes, Columns – samples Bootstrap or permutation based method Estimate the joint distribution of the test statistics under the complete null hypothesis by permuting the columns of the gene expression data matrix (permuting columns) For the bth permutation, b=1, …, B, compute test statistics t 1,b, …, t m,b p rj *=  j=1 B I (| t j,b | ≥ C j ) / B ex ) Colub (1999)

3-5) Resampling Method Efron et al. (2000) and Tusher et al. (2001) Compute a test statistics tj for each gene j and define order statistics t(j) such that t (1) ≥ t (2) ≥.. ≥ t (m) For each b permutation, b=1,..,B, compute the test statistics and define the order statistics t (1),b ≥ t (2),b ≥.. ≥ t (m),b From the permutations, estimate the expected value (under the complete null) of the order statistics by t* (j) =  t (j),b / B Form a Q-Q plot of the observed t (j) vs. the expected t* (j) Efron et al. – for a fixed threshold , genes with |t (j) -t* (j) | ≥  Tusher et al. - for a fixed threshold , let j*=max{j: t (j) -t* (j) ≥ , t* (j) > 0}