Statistical Methods for Analyzing Ordered Gene Expression Microarray Data Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences.

Slides:



Advertisements
Similar presentations
Multiple testing and false discovery rate in feature selection
Advertisements

Analysis of time-course gene expression data Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle.
Correlation Oh yeah!.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.
Analysis of gene expression data (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH)
Differentially expressed genes
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Business Statistics - QBM117 Statistical inference for regression.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
Correlation & Regression
Nonparametric or Distribution-free Tests
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Chapter 13: Inference in Regression
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Comparing Means. Anova F-test can be used to determine whether the expected responses at the t levels of an experimental factor differ from each other.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Experimental Research Methods in Language Learning Chapter 11 Correlational Analysis.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Geographic Information Science
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Course Outline Presentation Reference Course Outline for MTS-202 (Statistical Inference) Fall-2009 Dated: 27 th August 2009 Course Supervisor(s): Mr. Ahmed.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Sample Size Determination
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
John W. Tukey’s Multiple Contributions to Statistics at Merck Joseph F. Heyse Merck Research Laboratories Third International Conference on Multiple Comparisons.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
BPS - 5th Ed. Chapter 231 Inference for Regression.
A New Statistical Method for Analyzing Longitudinal Multifactor Expression Data and It ’ s Application to Time Course Burn Data Baiyu Zhou Department of.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Sample Size Determination
Functional Genomics in Evolutionary Research
Significance Analysis of Microarrays (SAM)
I. Statistical Tests: Why do we use them? What do they involve?
Significance Analysis of Microarrays (SAM)
Comparing Means.
An Introduction to Correlational Research
Bootstrapping and Bootstrapping Regression Models
Presentation transcript:

Statistical Methods for Analyzing Ordered Gene Expression Microarray Data Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC

An outline Ordered gene expression data Common experimental designs A review of some statistical methods An example Demonstration of ORIOGEN – a software for ordered gene expression data

Some examples of ordered gene expression data Comparison of gene expression by: – various stages of cancer Normal - Hyperplasia – Adenoma – Carcinoma – tumor size New tumor – Middle Size – Large tumor (with necrosis) – dose of a chemical (dose-response study) – duration of exposure to a chemical (time-course experiments) – dose & duration

Some commonly used experimental designs Experimental unit: Tissues/cells/animals Single chemical/treatment – Dose response study – Time course study single dose but responses obtained at multiple time points after treatment experimental units are treated at multiple time points using the same dose. – Dose response x Time course study Multiple doses at multiple time points Multi chemicals/treatments

Possible objectives – Investigate changes in gene expression at certain biologically relevant category. E.g. Hyperplasia to Adenoma to Carcinoma E.g. “early time point” to “late time point” since the exposure to a chemical – Identify/cluster genes with similar expression profiles over time/dose.

Correlation coefficient based methods Correlation coefficient based methods match genes with similar observed patterns of expression across dose/time points. Gene 1 Gene 2

Correlation coefficient based methods A number of variations to this general principle exist in the literature. Here we outline some prominent ones. A. Chu et al. (Science, 1998): Pre-select a set of biologically relevant patterns of gene expressions over time. Identify a sample of about 3 to 8 genes for each pattern. Compute the correlation coefficient of each candidate gene in the microarray data with the above pre- selected genes. Cluster each candidate gene into the cluster with highest correlation coefficient

Correlation coefficient based methods … B. Kerr and Churchill (PNAS, 2001): They correctly recognized the uncertainty associated with Chu et al. ‘s clustering algorithm. Hence they proposed a bootstrap methodology to evaluate Chu et al.’s clusters. C.Heyer et al. (Genome Research, 1999): Rather than using the standard correlation coefficient between genes, they employ jackknife version which robustifies against outliers. Unlike Chu et al.’s strategy, they classify genes on the basis of pairwise correlation coefficients.

Correlation coefficient based methods … Strengths Familiarity among biologists Easy to compute and interpret (although it is often misinterpreted too!) Weakness Non-linearity in the data can lead to misinterpretation Outliers and influential observations can affect the numerical value of the correlation coefficient. Heterogeneity between genes can also affect the numerical value of the correlation coefficient. It is also important to note that correlation coefficient is typically estimated on the basis of a very small number of points.

Regression based procedures Basic assumption among these methods: The “conditions” are numerical, e.g. dose or time

Polynomial regression Liu et al. (BMC Bioinformatics, 2005) For each gene Liu et al. fitted a quadratic regression model: They cluster each gene into a particular cluster depending upon the sign and statistical significance of the regression parameters. If for a gene none of the regression coefficients are significant then such a gene is declared un-important.

Polynomial regression Liu et al. (BMC Bioinformatics, 2005) Strengths: Biologists are reasonably familiar with quadratic regression analysis. Regression coefficients are easy to interpret. For small number of doses or time points and for evenly spaced doses, a quadratic model may be a reasonable approximation. An easy to use EXCEL based software is available.

Polynomial regression Liu et al. (BMC Bioinformatics, 2005) Two major limitations because it is fully parametric: 1. Departure from quadratic model is common: In such cases the quadratic model may not be correct. 2. Normality assumption need not be valid. Time

“Semi-parametric” regression methods Several authors have tried semi-parametric regression approach to gene expression data. E.g. deHoon et al. (Bioinformatics, 2002) Bar-Joseph et al. (PNAS, 2003, Bioinformatics, 2004) Luan and Li et al. (Bioinformatics, 2003) Storey et al. (PNAS, 2005)

Storey et al. (2005) Basic idea: For each gene, they fit mixed effects model with a B-spline basis. This methodology is largely based on Brumback and Rice (JASA, 1998). Statistical significance of each gene is evaluated using an F like test statistic with P-value (q-value) determined by bootstrap.

Storey et al. (2005) Strengths: It is semi-parametric A user friendly software called EDGE is available Limitations: It does not perform well for “threshold” patterns of gene expression The “conditions” should be numerical Unequal dose or time spacing can have an impact on the performance of the procedure

Order Restricted Inference for Ordered Gene ExpressioN (ORIOGEN) Peddada et al. (Bioinformatics, 2003, 2005) Simmons and Peddada (Bioinformation, 2007)

Temporal Profile /Dose Response Pattern of the (unknown) mean expression of a gene over time (dose) is known as the temporal profile (dose response) of a gene. – ORIOGEN: uses mathematical (in)equalities to describe a profile.

Some Examples Null profile:

Examples Continued … Up-down profile with maximum at 3 hours

Examples Continued … Non-increasing profile Cyclical profile

ORIOGEN Step 1 (Profile specification): Pre-specify the shapes of profiles of interest.

Some Examples Of Pre-specified Profiles

ORIOGEN … Step 2 (profile fitting): Fit each pre-specified profile to each gene using the estimation procedure described in: Hwang and Peddada (1994, Ann. of Stat.)

A Brief Description Of The Estimation Procedure …

Definitions Linked parameters: Two parameters are said to be linked if the inequality between them is known a priori. Nodal parameter: A parameter is said to be nodal if it is linked to all parameters in the graph. For any given profile, the estimation always starts at the nodal parameter.

Pool the Adjacent Violator Algorithm (PAVA) Hypothesis: Observed data Isotonized data (PAVA)

Estimation: The General Idea is the only nodal parameter

Estimation Continued … From this sub-graph we estimate 1 and

Step 3: Determine the norm of a gene corresponding to each temporal profile. This is defined as the maximum (studentized) difference between estimates corresponding to linked parameters. Peddada et al. (2001, Biometrics). A Measure of “Goodness-of-fit” Norm

An Example Observed data: 1, 1.5, 2, 2.5, 1.5, 2.25 Two pre-specified temporal profiles: (a)(b)

Example Continued … Fit under profile (a) 1, 1.5, 2.25, 2.25, 1.875, Fit under profile (b) 1, 1.5, 2, 2.5, 1.875, 1.875

Example Continued … norm for profile (a) is: = 1.25 norm for profile (b) is: = 1.5

“Best Fitting” Profile Step 4: Identify the profile with the largest norm. In the example, profile (b) has larger norm than profile (a). Hence profile (b) is a better fit than (a).

Statistical Significance Step 5: Statistical significance: P-value for statistical significance is obtained using the bootstrap methodology:

Illustration …

MCF-7 breast cancer cell treated with 17 -estradiol (Lobenhofer et al., 2002, Mol. Endocrin.). Gene expressions were measured after: 1hr, 4hrs, 12hrs, 24hrs, 36hrs and 48hrs of treatment. # of genes on each chip = # of samples at each time point = 8

Available softwares Linear Regression Method (Liu et al., 2005) EDGE (Storey et al., 2005) EPIG (Chao et al., 2008) ORIOGEN (Peddada et al., 2006)

Concluding remarks MethodologyFreely available software Applicable to ordinal “conditions” Repeated measures and correlated data Model assumptions Linear RegressionYesNo Linear regression EPIGYesNo? EDGEYesNoYesNo ORIOGENYes No

Some open problems ORIOGEN is potentially subject to Type III error. How do we control FDR & Type III error. How to deal with – Dependent samples? – Covariates? Order restricted inference in the context of mixed effects linear models.

Acknowledgments – Leping Li – David Umbach – Clare Weinberg – Ed Lobenhofer – Cynthia Afshari Software developers at Constella Group – (late) John Zajd – Shawn Harris