Strong Control of the Familywise Type I Error Rate in DNA Microarray Analysis Using Exact Step-Down Permutation Tests Peter H. Westfall Texas Tech University.

Slides:



Advertisements
Similar presentations
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Advertisements

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
1 Chapter 9 Hypothesis Testing Developing Null and Alternative Hypotheses Type I and Type II Errors One-Tailed Tests About a Population Mean: Large-Sample.
Hypothesis Testing Developing Null and Alternative Hypotheses Developing Null and Alternative Hypotheses Type I and Type II Errors Type I and Type II Errors.
Likelihood ratio tests
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Statistical Significance What is Statistical Significance? What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant?
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Differentially expressed genes
Statistical Significance What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant? How Do We Know Whether a Result.
ANOVA Determining Which Means Differ in Single Factor Models Determining Which Means Differ in Single Factor Models.
Lecture 2: Thu, Jan 16 Hypothesis Testing – Introduction (Ch 11)
Pengujian Hipotesis Nilai Tengah Pertemuan 19 Matakuliah: I0134/Metode Statistika Tahun: 2007.
LARGE SAMPLE TESTS ON PROPORTIONS
1 Test of significance for small samples Javier Cabrera.
The Need For Resampling In Multiple testing. Correlation Structures Tukey’s T Method exploit the correlation structure between the test statistics, and.
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Multiple testing, correlation and regression, and clustering in R Multtest package Anscombe dataset and stats package Cluster package.
Testing Dose-Response with Multivariate Ordinal Data Bernhard Klingenberg Asst. Prof. of Statistics Williams College, MA Paper available at
Statistics for Microarrays
Multiple Testing Procedures Examples and Software Implementation.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
Introduction to Hypothesis Testing
Copyright © 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Statistical Inference: Estimation and Hypothesis Testing chapter.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Multiple Testing in the Survival Analysis of Microarray Data
Multiple testing in high- throughput biology Petter Mostad.
Essential Statistics in Biology: Getting the Numbers Right
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
1 1 Slide © 2004 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G.
Introduction to SAS Essentials Mastering SAS for Data Analytics
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
1 1 Slide © 2003 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Significance Testing of Microarray Data BIOS 691 Fall 2008 Mark Reimers Dept. Biostatistics.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Decision-Theoretic Views on Switching Between Superiority and Non-Inferiority Testing. Peter Westfall Director, Center for Advanced Analytics and Business.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
4 Hypothesis & Testing. CHAPTER OUTLINE 4-1 STATISTICAL INFERENCE 4-2 POINT ESTIMATION 4-3 HYPOTHESIS TESTING Statistical Hypotheses Testing.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
New Proposals for Multiple Test Procedures, Applied to Gene Expression Array Data Siegfried Kropf, Otto von Guericke University Magdeburg in cooperation.
The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Multiple testing in large-scale gene expression experiments Statistics 246, Spring 2002 Week 8, Lecture 2.
Optimality Considerations in Testing Massive Numbers of Hypotheses Peter H. Westfall Ananda Bandulasiri Texas Tech University.
The Broad Institute of MIT and Harvard Differential Analysis.
Multiple testing in large-scale gene expression experiments
Multiple Comparisons with Gene Expression Arrays Using a Data Driven Ordering of Hypotheses Siegfried Kropf, Jürgen Läuter, Magdeburg, Germany Peter H.
Aim: What is the P-value method for hypothesis testing? Quiz Friday.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Bonferroni adjustment Bonferroni adjustment (equally weighted) – Reject H 0j with p i
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Covering Principle to Address Multiplicity in Hypothesis Testing
Presentation transcript:

Strong Control of the Familywise Type I Error Rate in DNA Microarray Analysis Using Exact Step-Down Permutation Tests Peter H. Westfall Texas Tech University

glass (1 cm 2 ) ~ 6,500 genes Microarrays Different cDNA sequence

Example Group 1: Acute Myeloid Leukemia (AML), n 1 =11 Group 2: Acute Lymphoblastic Leukemia (ALL), n 2 =27 Data: OBS TYPE G1 G2 G3 … G AML (Gene expression levels) 2 AML … … … … 11 AML 12 ALL … … 38 ALL

Testing for 7000 Gene Expression Levels Goal: Test H 0i : F ALL,i = F AML,i for i=1,…,7000. Here, “F” denotes cdf. Many choices for test statistics. Multiplicity problem: If tests are done at  =.05, and there are 6600 equivalent genes, then.05*6600= 330 will be determined “non-equivalent.”

Closed Testing to Control False Discoveries Let S = {1,2,…,7000} (gene labels). Let K = {i 1,…,i k }  S denote a particular subset. The Closed Testing Procedure: 1. Test H 0K : F ALL,K = F AML,K for each K  S, using a valid  -level test for each. 2. Reject H 0i : F ALL,i = F AML,i if H 0K is rejected for all K  {i}.

Theorem: CTP strongly Controls FWE Proof: Suppose H 0j 1,..., H 0j m all are true (unknown to you which ones). You may reject at least one only when you reject the intersection H 0j 1 ...  H 0j m. Thus, FWE = P(reject at least one of H 0j 1,..., H 0j m | H 0j 1,..., H 0j m all are true)  P(reject H 0j 1 ...  H 0j m | H 0j 1,..., H 0j m all are true) = .

Exact Tests for Composite Hypotheses H 0K Use the permutation distribution of min i  K p i, where p i = 2P(T 38-2 > |t i |), and t i = p-value = proportion of the 38!/(27!11!) permutations for which min i  K P i *  min i  K p i. Note: Exact despite “massively singular” covariance matrix!

A Slight Problem... There are subsets K to be tested This might take a while...

A Fantastic Simplification You need only test 7000 of the subsets! Why? Because P(min i  K P i *  c)  P(min i  K’ P i *  c) when K  K’. Significance for most lower order subsets is determined by significance of higher order subsets.

Illustration with Four Genes H {1234} min p =.0121, p {1234} =.0379 H {123} min p =.0121, p {123} <.0379 H {124} min p =.0121, p {124} <.0379 H {134} min p =.0121, p {134} <.0379 H {234} min p =.0142, p {234} =.0351 H {12} min p =.0121 p {12} <.0379 H {13} min p =.0121 p {13} <.0379 H {14} min p =.0121 p {14} <.0379 H {23} min p =.0142 p {23} <.0351 H {24} min p =.0142 p {24} <.0351 H {34} min p =.0191 p {34} =.0355 H 1 p 1 = p {1} <.0379 H 2 p 2 = p {2} <.0351 H 3 p 3 = p {3} =.1991 H 4 p 4 = p {4} <.0355 (Start at bottom.)

MULTTEST PROCEDURE Tests only the needed subsets (7000, not ). Samples from the permutation distribution. Only one sample is needed, not 7000 distinct samples: The joint distribution of minP is identical under H K and H S. (Called the “subset pivotality” condition by Westfall and Young, 1993.)

PROC MULTTEST code Proc multtest noprint out=adjp holm hoc stepperm n=200000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast ‘AML vs ALL’ -1 1; run; proc sort data=adjp(where=(raw_p le.0005)); by raw_p; proc print; var _var_ raw_p stppermp; run;

PROC MULTTEST Output (50 minutes for 200,000 samples)

Imbalance Issues Use of student t statistics does result in an exact, closed multiple testing procedure, but... There is imbalance: less power for gene types that are highly kurtotic than for normally distributed types. Solutions: Use exact unadjusted p-values – Already available for binary data – Computational difficulties otherwise Rank-transform the data prior to analysis

Rank Transform for Better Balance Proc rank; var gene1-gene7123; run; Proc multtest noprint out=adjp holm hoc stepperm n=200000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast ‘AML vs ALL’ -1 1; run; proc sort data=adjp(where=(raw_p le.0005)); by raw_p; proc print; var _var_ raw_p stppermp; run;

Rank Transformed Results

Comparing ALL and AML for Gene G E N E ALLAML TYPE

Is Better Balance Good? Maybe not - Imbalance induces more powerful multiple testing procedure –Bonferroni multiplier implicitly reduced through imbalance –Serendipity!

Summary Westfall-Young Method is an exact, closed testing method, despite large p, small n Detected genes are “honestly significant” Robust (nonparametric)