Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,

Slides:



Advertisements
Similar presentations
Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model Alex Lewin 1, Philippe Broët 2 and Sylvia.
Advertisements

Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Linear Models for Microarray Data
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Designing Experiments: Sample Size and Statistical Power Larry Leamy Department of Biology University of North Carolina at Charlotte Charlotte, NC
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
Lecture 5: Learning models using EM
Differentially expressed genes
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Statistical Analysis of Microarray Data
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
1 Test of significance for small samples Javier Cabrera.
Microarrays: Theory and Application By Rich Jenkins MS Student of Zoo4670/5670 Year 2004.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Introduce to Microarray
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Essential Statistics in Biology: Getting the Numbers Right
CDNA Microarrays MB206.
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
We calculated a t-test for 30,000 genes at once How do we handle results, present data and results Normalization of the data as a mean of removing.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Statistics for Differential Expression Naomi Altman Oct. 06.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Introduction to Microarrays. The Central Dogma.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Microarray: An Introduction
Estimation of Gene-Specific Variance
Getting the numbers comparable
Presentation transcript:

Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.

Microarray: Genome-wide gene expression 2. Introduced to Genetics/Genomics in 1996:... Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously. 1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon) In JSM 2004, - More than 30 sections - More than 100 stat. Papers/posters

Two major types of Microarray Nature cell biology Aug v3 (8) More refs: Nature Review of Genetics, May v5 (5) Oligonucleotide ArraycDNA Array

PM MM Probe pairs / Probe Set 1Probe Set / Gene This is “per gene”. The PM/MM effects are considered as fixed effects. Chu, T., Weir, B., and Wolfinger, R. (02, 04). Lipshutz et al; 1999; Nature Genetics, 21(1): Oligonucleotide Array

Statistical problems in Microarray? - Multiple testing: P-values. - Variable selection. - Discrimination. - Clustering of samples. of genes. - Time course experiments. - Clinical trails Merck, GSK … - Gene networks - pathway Terry Speed homepage: - Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array. - Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised Statistical Computing

Significance Analysis (challenge) Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed? Gene n Gene 3. Gene 2. Gene 1 Trt k. Trt 2Trt 1 Oligonucleotide array (supervised) An example: common problem in genome-wide studies: The “Large p, small n” problems. Small n: number of replications – low statistical power Large p: number of features (genes, probes, bio-markers…) – multiple-testing problems ? Computation …?

Data from McGraw Lab., Cornell Univ. Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies … … … Chip1Chip2 Chip19Chip20… … … (Random effect 2) Male Female line1line2line3line4line5 XXXXXX 5 Trt ~15000 genes, for each gene: PM MM (fixed effect) (…Random effect 3) Female flies killed, mRNA prepared (random effect 1) (1)(2)(3)(4)(10)(9)

Data from McGraw Lab., Cornell Indices i: Trt1 Trt2 Trt3 Trt4 Trt5 Indices j: Prep1 Prep2 Indices k: Chip1 Chip2 … Indices l: 1, 2, 3,…, 19, 20 Gene g Gene 1 Gene 2 Gene Total: 5x2x2x20=400 points for each gene g ……σ gij ………..σ gijk ……………..σ gijkl y gijkl Linear Mixed Model: (for each gene g) Y gijkl = G g + (G*trt) gi + (G*Probe) gl + (G*trt*prep) gij + (G*trt*prep*chip) gijk + γ gijkl.

Significant Expressed Genes: by SGA 10 possible Contrasts Number of Significantly Expressed Genes Bonferroni(.05) F.D.R.(.05) Trt1 vs. Trt200 Trt1 vs. Trt300 Trt1 vs. Trt400 Trt1 vs. Trt500 Trt2 vs. Trt300 Trt2 vs. Trt400 Trt2 vs. Trt500 Trt3 vs. Trt400 Trt3 vs. Trt500 Trt4 vs. Trt500 Possible Reasons: 2.Large p: Multiple Testing problems (15000x10 tests) FWR vs. FDR? (not addressed in this study.) 3. Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing … In this study, trying to improve power in each single test… 1. … lower level analysis …

Our Idea: Taking advantage from “large p” This plot does contain useful “global” information on each VC (range, “HDR”…). Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene (VC1, VC2, VC3); Note: Not the “density” of each VC … Black: estimated VC1 Red: estimated VC2 Blue: estimated VC3 The “global” infor. is taken as the “prior”. (SGA – pilot analysis)

Our Empirical Bayes Approach A 7-step algorithm: 1. Apply SGA to get VC estimates; 2. Transform to the “ANOVA Components (AC)” ; 3. Apply Jeffrey’s prior (non-informative) ; 4. Fit Inverted Gamma (IG) to each AC (prior density); Derive the posterior density (and the posterior estimate) of each AC; 6. Transform the posterior estimate of AC back to VC (reverse step 2); 7. Mixed model analysis: fix the VC value to be the posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.

Real Data Example: Cornell Data Number of significant genes Bonferroni (.05)False Discovery Rate (.05) ContrastS.G.A.E. B.S.G.A.E. B. Trt1 vs. Trt Trt1 vs. Trt Trt1 vs. Trt Trt1 vs. Trt Trt2 vs. Trt Trt2 vs. Trt Trt2 vs. Trt Trt3 vs. Trt Trt3 vs. Trt Trt4 vs. Trt Significance Test:

Simulation Studies Design – structure mimic the true data: Parameters are set to be the estimated value from the true data set. For the 3 VC, σ gij =0.01, σ gijk =0.015, σ gijkl = genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non- significantly expressed”, with Trt mean: Trt1Trt2Trt3Trt4Trt5 Significant Expressed Non-significantly Expressed00000

Simulation Results (1) VC1(0.01)VC2(0.015)VC3(0.072) SGAEBSGAEBSGAEB Bias1.3x x x x x x10 -5 Variance1.6x x x x x x10 -6 MSE1.6x x x x x x10 -6 EB estimator vs. REML estimator: Bias, Variance and MSE: The bias, variance and MSE of EB are only fractions of those of SGA

Simulation Results (2): The null distribution of the test t statistics: 1. SGA (red, expected to be t distribution with df=5); 2. EB with df=30 (blue); 3. EB with df=1000 (green); 4. Truth (black, expected to be standard normal distribution ).

Simulation Results (3): SizePower (% Power) Contrast (Trt. Diff) SGAEB(30)EB(1000)TruthSGAEB (30)EB (1000)Truth (100%) T1 vs. T2 (0.15) (66.7%)0.172 (98.9%)0.176 (101.1%)0.174 (100%) T1 vs. T3 (0.30) (70.1%)0.558 (93.6%)0.596 (100%) T1 vs. T4 (0.45) (76.5%)0.870 (96.2%)0.894 (98.9%)0.904 (100%) T1 vs. T5 (0.60) (91.8%)0.964 (99.4%)0.972 (100.2%)0.970 (100%) T2 vs. T3 (0.15) (67.7%)0.164 (88.2%)0.188 (101.1%)0.186 (100%) T2 vs. T4 (0.30) (68.3%)0.530 (93.3%)0.554 (97.5%)0.568 (100%) T2 vs. T5 (0.45) (83.4%)0.772 (95.8%)0.796 (98.6%)0.806 (100%) T3 vs. T4 (0.15) (81.4%)0.194 (100%)0.202 (104.1%)0.194 (100%) T3 vs. T5 (0.30) (78.2%)0.446 (91.8%)0.468 (96.3%)0.486 (100%) T4 vs. T5 (0.15) (75.0%)0.144 (90.0%)0.156 (97.5%)0.160 (100%) Test Size and Power Calculation: Mean: % 94.72% 99.53% 100%

Discussion Why EB estimator “beats” REML estimator? - Prior density contains “truth” information. Q: How to control (large p vs. small p)? ( Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE ) However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good … … for “small p”, EB estimator can be biased! Gain in MSE not guaranteed!

Applications Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide); – Proteomics Extension to general data sets (Mixed model), if controlling system built (in the near future).