Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,

Slides:



Advertisements
Similar presentations
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Advertisements

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Microarray Normalization
Evaluation of Affymetrix array normalization procedures based on spiked cRNAs Andrew Hill Expression Profiling Informatics Genetics Institute/Wyeth-Ayerst.
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
The Simple Linear Regression Model: Specification and Estimation
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
The Simple Regression Model
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
1 Models and methods for summarizing GeneChip probe set data.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Division of Human Cancer Genetics Ohio State University.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
GeneChips and Microarray Expression Data
Analysis of microarray data
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Calibration & Curve Fitting
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Affymetrix GeneChips Oligonucleotide.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Microarray - Leukemia vs. normal GeneChip System.
Multilevel Data in Outcomes Research Types of multilevel data common in outcomes research Random versus fixed effects Statistical Model Choices “Shrinkage.
Scenario 6 Distinguishing different types of leukemia to target treatment.
CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background Data Processing Procedure Characteristics of Data Data.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Comparison of Microarray Data Generated from Degraded RNA using Five Different Target Synthesis Methods and Commercial Microarrays Scott Tighe and Tim.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistics for Differential Expression Naomi Altman Oct. 06.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Hybridization Design for 2-Channel Microarray Experiments Naomi S. Altman, Pennsylvania State University), NSF_RCN.
Microarray Data Analysis The Bioinformatics side of the bench.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Introduction to Oligonucleotide Microarray Technology
Introduction to Affymetrix GeneChip data
Basic Estimation Techniques
Genetical Genomics in the Mouse
Significance Analysis of Microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Getting the numbers comparable
Microarray Data Analysis
Presentation transcript:

Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina, Chapel Hill Division of Human Cancer Genetics Ohio State University William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright

Measuring gene expression with the Affymetrix GeneChip Perfect Match (PM) Mismatch (MM) PM - 25 bases complementary to region of gene MM - Middle base is different... Coding portion of gene X polyA cRNA from sample mRNA is put on the chip intensity of binding reflects gene expression

Reproducibility of Probe Sensitivities Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.

The Li-Wong Model Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, Li-Wong Full (LWF) Li-Wong Reduced (LWR) Identifiability constraint

The Li-Wong Model Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, Li-Wong Full (LWF) Li-Wong Reduced (LWR) Identifiability constraint ith array jth probe pair Total no. probe pairs

The Li-Wong Model Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, Li-Wong Full (LWF) Li-Wong Reduced (LWR) Identifiability constraint ith array jth probe pair Total no. probe pairs expression sensitivities

How to compare gene expression indexes? We get maximum likelihood estimates for  using either full data (LWF) or reduced data (LWR) The Affymetrix software computes: Average Difference (AD) Log-Average (LA) The log-average might perform particularly poorly. Note that if terms are small and error variance is small,

We gain insight by assuming Li-Wong model is true. Then what are the consequences? For large sample sizes, the  ’s and  ’s will be well- estimated

Compare LW estimators directly: Comparing to AD is tricky, but with a correction factor AD is also an unbiased estimate of  :

This also gives insight into “perfect match only” analyses: RE(full, PM-only)= and Furthermore, PM-only is always at least twice as efficient as LWR

Empirical Comparisons We propose that an expression index is “good” if it has a high correlation with the underlying true expression (which is usually unknown). this correlation can be estimated using a specially designed mixing experiment if r is the correlation coefficient between the measured index and true expression, the “relative efficiency” of two indexes  and  can be estimated as

Suppose the true underlying gene expression for a given gene is . Consider two indices of gene expression is an unbiased estimate of  And we have

Can we estimate this relative efficiency? Suppose we could do a regression of on . the ratio of explained to residual variance in the model can be shown to be and similarly for, so

Can we estimate r without ever knowing true expressions  ? Yes, with a specially designed mixing experiment we seek two contrasting conditions in which many genes will be differentially expressed

Experimental Design Human Fibroblasts (GM 08330) 20% FBS 48h 24h Harvest total RNA Lys, PheDap, Thr 50:50 Add Bacterial Control Genes StimulatedStarved 5 passages Dap, Thr, Lys, Phe Produce 50:50 group Produce duplicates each day for 3d Synthesize cDNA, cRNA; fragment Add Hybridization Control Genes BioB, BioC, BioD, Cre Hybridize HuGeneFL 0.1% FBS Serum starvation Cell culture Serum stimulation 0.1% 20% Harvest total RNA Gene Expression Indexes Data Reduction RNA extraction 20% FBS (6 replicates for each condition)

BIN1 expression Stim 50:50 Starved True expression = average of Stim, Starved

BIN1 expression Stim 50:50 Starved 12 3

Note that Where X=1, 2, 3 (say) for Stim, 50:50 Starved, respectively

Mean probe intensity per array Stim 50:50 Starved Overall intensity higher in Stimulated

Coefficients of variation for assay (individual probes) and gene expression indexes

Stim50:50StarvedStim50:50Starved Stim 50:50 Starved Stim 50:50 Starved LWF AD LWR LA Correlation matrix of 18 arrays as a colorized image for each expression index.

Comparing Models Cluster Analysis Affymetrix Log Ave Full Model Reduced Model Affymetrix Ave Diff Strv 1 Strv 4 Strv 2 Strv 5 Strv 3 Strv 6 50: : : : : :50 6 Stim 4 Stim 6 Stim 5 Stim 3 Stim 1 Stim 2 Strv 1 Strv 3 Strv 2 Strv 6 Strv 5 Strv 4 Stim 1 Stim 6 Stim 3 Stim 5 Stim 4 50: : : : : :50 6 Strv 3 Strv 4 Strv 6 Strv 5 Strv 2 Strv 1 Stim 2 Stim 1 Stim 4 Stim 5 Stim 6 Stim 3 50: : : : : :50 3 Strv 2 Strv 3 Strv 1 Strv 6 Strv 5 Strv 4 Stim 2 Stim 4 50:50 1 Stim 1 Stim 6 Stim 3 Stim 5 50: : : : :50 6

Relative Efficiency LWF LWR AD LA Median(r 2 /(1-r 2 )) LWF LWR AD LA UnscaledScaled

Correlation of duplicate measurements of 149 genes LWF median r=.74 LWR median r=.43 AD median r=.08 LA median r=.17

Number of unexpressed genes Only 0.2% of the LW estimates are negative 50:50 group has fewest negative estimates could this indicate very few unexpressed genes? Stim 50:50 Starved

A conservative approach to estimating number of unexpressed genes Let U denote number of unexpressed genes genes are ranked according to expression index This is useful if we can get a random sample of unexpressed genes Unexpressed population Gene expression index

We use the spiked-out bacterial control genes as a sample of “unexpressed” genes the 4 genes are are represented 3 times each (different portions of mRNA), for a total of 12 probe sets Based on this reasoning, we estimate that greater than 88% of the genes are expressed, even in the Starved samples

Rank of expression index variance across the 6 Stimulated arrays versus rank of index mean Truly absent in stim group AD LWF Very low estimated expression for truly absent genes when using LWF

Present/absent calls We use the statistic to declare genes present/absent (absolute call) we find the vast majority of genes on the array appear to be present for the spiked in/out genes, we find vastly improved present/absent calling using LW estimates

LWF-Z LWR-Z Untrimmed AD Untrimmed LA LA AD Absolute Call ROC curve - spiked in/out genes

Variability in estimates Full Model Reduced Model log(variance) log(mean) Stim 50:50 Starved

Conclusions Model-based estimators are superior to simple averaging Full model superior to reduced this does not necessarily mean that the mismatch probes are a good idea - but if they are present we should use them we have demonstrated this using both analytic considerations and experimental data a carefully designed experiment can be used to address many issues Many more genes may be expressed than previously thought

Other issues/ future work Spiking genes might be used to calibrate and normalize arrays relationship between variance and mean of expression indexes may be useful in planning experiments our data may be useful for future work, especially in producing indexes that are resistant to probe saturation all primary data, this Powerpoint presentation and a preprint are available at