SRCOS Summer Research Conf, 2011 Multiple Testing Under Dependency, with Applications to Genomic Data Analysis Zhi Wei Department of Computer Science New.

Slides:



Advertisements
Similar presentations
Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Chromatin Immuno-precipitation (CHIP)-chip Analysis
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Lecture 5: Learning models using EM
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Conditional Random Fields
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
. Applications and Summary. . Presented By Dan Geiger Journal Club of the Pharmacogenetics Group Meeting Technion.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
5-3 Inference on the Means of Two Populations, Variances Unknown
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
Multiple testing in high- throughput biology Petter Mostad.
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
CS Statistical Machine learning Lecture 24
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CK, October, 2003 A Hidden Markov Model for Microarray Time Course Data Christina Kendziorski and Ming Yuan Department of Biostatistics and Medical Informatics.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
1 Identifying Differentially Regulated Genes Nirmalya Bandyopadhyay, Manas Somaiya, Sanjay Ranka, and Tamer Kahveci Bioinformatics Lab., CISE Department,
CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
A Bayesian Method for Rank Agreggation Xuxin Liu, Jiong Du, Ke Deng, and Jun S Liu Department of Statistics Harvard University.
Handout Six: Sample Size, Effect Size, Power, and Assumptions of ANOVA EPSE 592 Experimental Designs and Analysis in Educational Research Instructor: Dr.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
In Bayesian theory, a test statistics can be defined by taking the ratio of the Bayes factors for the two hypotheses: The ratio measures the probability.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Recovering Temporally Rewiring Networks: A Model-based Approach
CONCEPTS OF HYPOTHESIS TESTING
Markov Networks.
Discrete Event Simulation - 4
Expectation-Maximization & Belief Propagation
Covering Principle to Address Multiplicity in Hypothesis Testing
Markov Networks.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

SRCOS Summer Research Conf, 2011 Multiple Testing Under Dependency, with Applications to Genomic Data Analysis Zhi Wei Department of Computer Science New Jersey Institute of Technology Joint work with Wenguang Sun

2 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Intro Microarray Measure the activity of thousands of genes simultaneously DE v.s. EE genes Notation: –X i ∈ {DE, EE} –Y i = (y i1, …, y im ; y i(m+1), …, y i(m+n) ), m, n replicates for 2 cond, resp. Phenotype 1 Phenotype 2 Expression X X X X X X Differential Expression (DE) Analysis Data for a single gene

3 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Microarray Time Course Data: A Motivating Example Microarray Time Course (MTC) data commonly collected for studying biological processes Tian, Nowak, and Brasier (2005) “A TNF-Induced Gene Expression Program Under Oscillatory NF-Kappab Control,” –study the biological process of how the cytokine tumor necrosis factor (TNF) initiates tissue inflammation Dataset: time-course microarray experiment to profile gene activities at 0hr, 1hr, 3hr, 6hr after inhibition of the NF-kB transcription factor To find: –(1) “Early response genes” that were differentially expressed (DE) less than 1hr after the NF-kB inhibition –(2) “Middle response genes” that were DE at 3hr but no response prior to that –(3) “Late response genes” that were DE at 6hr but no response prior to that –(4) “Biphasic genes” that were DE at both 1hr and 6hr hours, but not at 3hr.

4 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Existing solutions [Y i1,Y i2, …,Y iT ] XiXi X i1 X i2 …X iT Y i1 Y i2 …Y iT Solution 1: Find TDE genes –Moderated likelihood ratio statistic and Hotelling T 2 (Tai and Speed 06), –functional data analysis : Luan and Li (04), Storey et al. (05), Hong and Li (06) Solution 2: Find DE gene at each time point –T-test at each time point –HMM, Yuan and Kendziorski (06). Too detailed Too general ?

5 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Set-wise multiple testing Conceptualize the gene selection problem as a set-wise multiple testing problem –Test DE vs EE at each time point for each gene –Combine the tests across all time points as a set for DE patterns of interest, namely, one gene corresponding to a set of hypotheses Several issues in set-wise testing –(i) multiplicity: how to control testing errors (such as the false discovery rate) at the set level when thousands of gene are considered simultaneously; –(ii) optimality: how to optimally combine the testing results within a set so that sensitivity is maximized while multiplicity is controlled. –(iii) dependency: how to account for and exploit the dependency of the tests within a set, e.g. the high temporal correlation in MTC data.

6 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Related work Multiplicity issue for simultaneous set-wise tests –Formally addressed by (Benjamini & Heller, Biometrics, 2008) Test DE vs EE at each time point individually Combine the p-values at each time point into a pooled p-value by Simes BH procedure is applied to the Simes-combined P values –Partial conjunction tests Are all T hypotheses in the set true? (Conjunction null) Are at least u out of T hypotheses in the set false? (Partial Conjunction alternative) –Limitations Incapable of capturing sophisticated DE patterns, e.g., early response genes Not optimal because dependency is ignored, although FDR at set level (FSR) is controlled. Testing hypotheses under dependency (Sun & Cai, JRSSB, 2009) –One HMM to account for dependency of hypotheses –Local Index of Significance (LIS) statistics –Valid and optimal

7 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Characterizing Gene Temporal Expr. Patterns in a Set-wise Multiple Testing Framework Setup for MTC data: –Each gene has two possible states at each time point: equally expressed (EE) and differentially expressed (DE), denoted by 0 and 1, respectively. –Consider P sets of hypotheses: {(H i1,...,H iT ) : i = 1,...,P} for testing EE versus DE, where P is #genes and T, #time points. Pattern representation –2 T atomic patterns η ∈ {0,1} T, e.g , 11100, etc. for T=5 –Partition 2 T ηs into null Θ 0 and non-null Θ 1, the patterns of interest –2^(2 T ) -1 partitions  powerful temporal pattern characterization Conjunction: Θ 0 ={(η 1,..., η T ) ∈ {0, 1} T |∑ η t =0} Partial Conjunction: Θ 0 ={(η 1,..., η T ) ∈ {0, 1} T |∑ η t < u} “Late response genes”, DE after time t, Θ 1 ={(η 1,..., η T ) ∈ {0, 1} T | η 1 =…=η t =0 & η t+1 +…+η T >0}

8 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 HMM The state sequence of gene i over time is a binary vector X i = (x i1,..., x iT ), where x it = 1 indicates that gene i at time t is DE and x it = 0 otherwise. Assume X i ∈ {0,1} T is distributed as a Markov chain –Transition probability can be either homogeneous or inhomogeneous Assume X i ┴ X j for i <>j Emission probability f(Y it |X it ) ~ Gamma-Gamma Model (Newton et al. 2001, Kendziorski et al 2003) X i1 X i2 …X iT Y i1 Y i2 …Y iT

9 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 HMM-based set-wise testing Use EM algorithm for estimating all HMM parameters: initial prob., transition prob., emission prob. (Gamma-Gamma model) Define a binary vector ϑ = ( ϑ 1,..., ϑ p ) ∈ {0, 1} p, where ϑ i =1 if X i ∈ Patterns of Interest and ϑ i =0 otherwise. Define the generalized local index of significance (GLIS): X i1 X i2 …X iT Y i1 Y i2 …Y iT

10 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 HMM-based set-wise testing (cont’d) GLIS testing procedure –Denote by GLIS (1),..., GLIS (p) the ordered GLIS values and H (1),..., H (p) the corresponding hypotheses –Let –Then reject all H (i), for i=1, 2,..., l –Valid and asymptotically optimal for set-wise multiple testing when the HMM parameter estimate is consistent (Sun and Wei, JASA, 2011) The GLIS statistic can be interpreted as the probability of a gene being null (i.e., not exhibiting a pattern of interest) given the observed expression data

11 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Simulation Setup –#subjects under both conditions are n 1 = n 2 = 10 –#genes P = 2000 –#time points =6 –#replications =200. –Nominal FSR level is 0.1. –HMM: π = (0.95, 0.05), A t = (a 00, 1−a 00 ; 1−a 11, a 11 ),  t = (α t,α 0t, ν t ). Dependency: fix  t = (10, 1, 0.5) and a 00 =0.95, change a 11 from 0.2 to.98 Gene expr. Signal: fix A t = (0.95, 0.05; 0.2, 0.8) and  t = (α t,1, 0.5), change α t Conjunction and Partial Conjunction Comparison –BH-Simes –Viterbi-based decision rule: the most probable state sequence η ∈ Θ 1 then select it. Evaluation: FSR (False Set Rate), MSR (Missed Set Rate) = 1 – sensitivity

12 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Conjunction Test Results

13 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Partial Conjunction Test Results

14 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Ranking efficiency A k = (0.95, 0.05; 0.2, 0.8) and  k = (α k,32, 4) Solid: Oracle GLIS Dashed: data-driven GLIS Dotted: BH-Simes

15 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Application Calvano et al (Nature 2005): A network-based analysis of systemic inflammation in humans. Goal: Identify genes in response to endotoxin Dataset: 4 healthy persons’ gene expression in whole blood leukocytes immediately before and at 2,4,6,9 and 24 h after the intravenous administration of bacterial endotoxin; 4 healthy persons w/o treatment as controls.

16 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Pattern Design Genes perturbed in response to treatments Sequentially perturbed genes Early and late response genes OR

17 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 HMM Fitting

18 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Early Response Genes (DE from 2 nd time point)

19 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Late Response Genes (DE after 2 nd time point)

20 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Published Genome-Wide Associations through 3/2009, 398 published GWA at p < 5 x NHGRI GWA Catalog

21 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 HMM-dependent (set-wise) hypothesis test in Genome-wide Association Studies (GWAS) Individual Tests (Wei, Sun et al Bioinformatics, 2009) –N chromosomes  N HMMs, emission prob: Normal mixtures –How to pool different HMMs together for having a global multiplicity control –Improvement: More reproducible, Higher sensitivity Set-wise Tests (Wang, Wei, and Sun, Statistics and Its Interface, 2010) –Testing a genomic region

22 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 From HMM to Markov Random Field An undirected graph, where each node represents a random variable Xi and each edge represents a dependency Joint Distribution Function Markov Blanket and Conditional independence Inference –Exact inference is NP-complete –Approximation techniques: MCMC, loopy belief propagation, ICM

23 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Microarray DE analysis Discrete Markov Random Field Model for Latent Differential Expression States (Wei and Li, Bioinformatics, 2007) p genes on a network Gene expression Y i =(Y i1,…,Y im ; Y i(m+1),…,Y i(m+n) ) Phenotype 1 Phenotype 2 Networks built from KEGG: 1668 nodes (genes), 8011 edges Framework: f(X|Y) ~ f(Y|X)*Pr(X) Pr(X) ~ Markov random field (MRF) Model Emission Prob. f(Y|X) ~ Gamma-Gamma Model

24 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 MRF in Genome-wide association studies (GWAS) GWAS data with network structure constructed based on LD dependency. (Li, Wei, and Maris, Biostatistics, 2010) Framework: f(X|Y) ~ f(Y|X)*Pr(X) X, disease assc. state: Pr(X) ~ Markov random field (MRF) Model Y, genotype counts, Emission Prob. f(Y|X) ~ Dirichlet-Trinomial Model

25 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 MTC Data Hidden Spatial-Temporal MRF Model (Wei and Li, Annals of Applied Statistics 2008) Model both regulatory (gene network) and temporal dependency (time). g g -1 g+1 tt -1 t +1 Time Gene Y gt Gene Expression X gt = 0 or 1, Gene State Temporal Dependency Regulatory Dependency f(Y gt |X gt ) ~ Emission Prob. Spatial Neighbors Temporal Neighbors

26 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 Challenges in MRF MRF-structured hypothesis test is valid and optimal? MRF-structured set-wise hypothesis test is valid optimal? Unlike HMM, the model fitting and parameter estimating for MRF is much more challenging –The EM algorithm for MRF is NP-hard –(Generalized) belief propagation algorithm

27 New Jersey Inst. of Tech. SRCOS Summer Research Conf, 2011 References Wei Z and Li H (2007): A Markov random field model for network-based analysis of genomic data. Bioinformatics, 23: Wei Z and Li H (2008): A spatial-temporal MRF model for network-based analysis of microarray time course gene expression data. Annals of Applied Statistics, 2: Sun, W., and Cai, T. (2009). “Large-scale multiple testing under dependence”. Journal of the Royal Statistical Society, Series B, 71, Wei Z, Sun W, Wang K and Hakonarson H (2009), Multiple Testing in Genome- Wide Association Studies via Hidden Markov Models, Bioinformatics, 25: Li H, Wei Z and Maris J (2010), A Hidden Markov Random Field Model for Genome-wide Association Studies, Biostatistics, 11: Wang W, Wei Z, and Sun W (2010), Simultaneous Set-Wise Testing Under Dependence, with Applications to Genome-Wide Association Studies, Statistics and Its Interface, 3(4): Sun W and Wei Z (2011), Multiple Testing for Pattern Identification, with Applications to Microarray Time Course Experiments, Journal of the American Statistical Association, 106(493):73–88