Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach Daniel Holder, Bill Pikounis, Richard Raubertas, Vladimir Svetnik,

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
P. J. Munson, National Institutes of Health, Nov. 2001Page 1 A "Consistency" Test for Determining the Significance of Gene Expression Changes on Replicate.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
1. Principles and important terminology 2. RNA Preparation and quality controls 3. Data handling 4. Costs 5. Protocols 6. Information for collaboration.
Departments of Medicine and Biostatistics
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Getting the numbers comparable
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Dilution/Mixture Study Bill Craven, GeneLogic, Inc. Motivated by a desire for a data set to be used as a baseline to characterize analysis and normalization.
Differentially expressed genes
Summarizing and comparing GeneChip  data Terry Speed, UC Berkeley & WEHI, Melbourne Affymetrix Users Meeting, Friday June 7, 2002 Redwood City, CA.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
SNP chips Advanced Microarray Analysis Mark Reimers, Dept Biostatistics, VCU, Fall 2008.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
GCB/CIS 535 Microarray Topics John Tobias November 8th, 2004.
Lecture 24: Thurs., April 8th
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Felix Naef & Marcelo Magnasco, GL meeting, Nov Outline Background subtraction Probeset statistics Excursions into.
1 Models and methods for summarizing GeneChip probe set data.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
A robust neural networks approach for spatial and intensity-dependent normalization of cDNA microarray data A.L. Tarca, J.E.K. Cooke and J. MacKay Presented.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Gene Expression Microarrays Microarray Normalization Stat
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/09/2015 7:46 PM 1 Two-sample comparisons Underlying principles.
Lecture 22 Introduction to Microarray
CDNA Microarrays MB206.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
We calculated a t-test for 30,000 genes at once How do we handle results, present data and results Normalization of the data as a mean of removing.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Statistics for Differential Expression Naomi Altman Oct. 06.
Tom Kepler Santa Fe Institute Normalization and Analysis of DNA Microarray Data by Self-Consistency and Local Regression
Linear Models One-Way ANOVA. 2 A researcher is interested in the effect of irrigation on fruit production by raspberry plants. The researcher has determined.
Microarray Data Analysis The Bioinformatics side of the bench.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data.
Introduction to Affymetrix GeneChip data
Canadian Bioinformatics Workshops
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Active Learning Lecture Slides
Getting the numbers comparable
Normalization for cDNA Microarray Data
Pre-processing AFFY data
Presentation transcript:

Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach Daniel Holder, Bill Pikounis, Richard Raubertas, Vladimir Svetnik, and Keith Soper Biometrics Research Merck Research Laboratories

S cale Matters A dditive F its (probes and chips) E xperimental-Unit Variability R obustness and Resistance

Goals of Data Analysis Which genes have we detected? Which genes have changed ? –Which genes change together? Prerequisites –Quantify transcript abundance (“gene expression index”) –Quantify precision –Assess quality

Our Data Analysis Method Normalize chips for overall fluorescence (based on MM)* Transform data (linear-log hybrid scale) Fit probe-specific model using all chips (highly resistant to outliers)* Normalize for chip bias (scatterplot smooth)* Assess differences (Include between-EU variability, e.g., ANOVA)* * offers opportunities for QC

Fig 1:Hybrid Transformation (knot at c=20) f(x)=x f(x)=c*ln(x/c)+c f(x)=hybrid(0,c) x f(x)

Linear-log Hybrid Scale f(x) = a if x<a =x if x in [a,c) =c*ln(x/c)+c if x  c Typically choose a=0 Value of c chosen for additivity Improved homogeneity of variance For low expression genes compare differences, not ratios

Probe Specific Effects “Probe specific biases…are highly reproducible and predictable, and their adverse effect can be reduced by proper modeling and analysis methods” -Li and Wong (PNAS 2000) Multiplicative model for PM - MM, for each probeset, (i th chip, j th probe) –Resistance achieved by iteratively omitting extreme points (or chips) and refitting using least squares

Probe Specific Effects (Our Approach) For each probeset, resistant, additive fit to PM - MM –Use a fitting procedure that is highly resistant to extreme values (median polish) * * Since logs are undefined for non-positive values and unstable for small values, we use a linear-log hybrid scale

Adjusting for Chip Bias Initial centering of chips Chip bias may depend on gene expression level Plot chip effects vs. Overall expression level (grand median) for each probeset Omit probesets that appear to change Between group |dev|/Within group |dev| Omit probesets in top 25% Fit a resistant scatterplot smoother (loess)

Fig 4: Typical Chip Normalization Plot Grand Median Chip Effects* (Hybrid scale) 5 groups  2 chips/group, 7.1K probesets

Terry Speed questions 3. How do you tell that one approach to quantifying expression at the probe set level (e.g. SAFER), is better than another (e.g. dChip)? Compare on data for which we ‘know’ the answer –Spiking experiments (limited # genes) –Validation (eg TaqMan) –Create POS and NEG groups as best we can. How to compare (depends on down-stream usage) –repeatibility –eg. signal to noise ⇛ t-statistic ⇛ p-value –fold changes

Fibroblast/Adipocyte Mixing Expt Mixture %’s (100/0, 75/25, 50/50, 25/75, 0/100) 3 chips/mix (15 chips total, Mg74A) 3 methods (SAFER, SAFER(log), dCHIP) Create groups of probesets using 100/0 vs. 0/100 –POS (max p < 0.01, correct oligos, n=1049) –NEG (incorrect oligos, n=2611) –p-value from t-test (pooled variance, hybrid scale) We will change the POS, NEG and p-value definitions on some of the later slides

Fibroblast/Adipocyte Mixing Expt (2) Performance based on 75/25 vs 25/75 –p-values from t-test (pooled variance, hybrid) –for POS require same sign as 100/0 vs 0/100 –pos rate, false pos rate (FPR), pos rate vs FPR Linearity?

dChip SAFER log SAFER Fig 5: CDF for 0% vs 100% (all probesets) n = 12,654

POS: maxp < 0.01 (n = 1049) NEG: wrong sequence (n = 2611) 0% vs 100% POS 25% vs 75% POS 0% vs 100% NEG 25% vs 75% NEG SAFER SAFER log dChip Uniform dist. Fig 6: CDFs for POS and NEG probesets

SAFER dChip SAFER log Fig 7: Positive Rate vs ‘False’ Positive Rate 25% vs 75% POS: maxp < 0.01 (n = 1049) NEG: wrong seq. (n = 2611))

SAFER dChip SAFER log POS: maxp < 0.01 (n = 1049) NEG: wrong seq. (n = 2611) Fig 8: Positive Rate vs ‘False’ Positive Rate (log scale) 25% vs 75% log scale

Fig 9: Positive Rate vs ‘False’ Positive Rate (log scale) log scale POS: maxp < 0.01 (n = 1038) NEG: wrong seq. (n = 2611) 25% vs 75%, dChip p-values used for dChip SAFER SAFER log dChip

SAFER dChip SAFER log 25% vs 75% Fig 10: Positive Rate vs ‘False’ Positive Rate (log scale) log scale POS: rank (dChip(p))

Fig 11: Boxplot of R 2 values for POS probesets SAFER SAFER(log) dCHIP R2R2 POS: maxp < 0.01 (n = 1049)

Fig 12: Boxplot of R 2 values for POS probesets exclude 100/0 and 0/100 groups SAFER SAFER(log) dCHIP R2R2 POS: maxp < 0.01 (n = 1049)

Terry Speed questions Response: We don’t know. 1. Do you lose anything not being able to down-weight non-performing probe pairs in the way Li & Wong can with their phi's (ie, probe effect)? Li & Wong SAFER Down-weighting non-performing probes seems like a good idea. Is up-weighting ‘bright’ probes good? (variability, saturation) Possible to incorporate weighting in polishing step.

Terry Speed questions Primary goal is to quantitate mRNA detection (and error). Explicit QC methods aimed at avoiding the effects of aberrant arrays, probes, individual observations are less important when resistant methods are used. SAFER provides same raw materials (fitted values and residuals) for QC as Li and Wong. QC summaries can easily be made available. 2. Is SAFER QC as thorough as Li & Wong's (in detecting aberrant chips, probe-sets, probe pairs)? Response: QC is not as thorough, but::

Conclusions For these data, it appears that the SAFER method performs better than dChip. + Better sensitivity (ROC Curve) + Slightly Better Linearity Caveat: This is one analysis of one dataset.

Acknowledgments Biometrics Research –Bert Gunter Other –David Gerhold (Pharmacology) –John Thompson (Immunology) –Eric Muise (Immunology) –Karen Richards (Drug Metabolism) –Jian Xu (Pharmacology) –Yuhong Wang (Bioinformatics)

Backups

chip probe grand median probe effects chip effects Example Median Polish intensities residuals

Fig 2: Choose c using P-values from Tukey Non-additivity Test P-value Hybrid(0,1)Hybrid(0,20)Hybrid(0,40)Raw Scale 5 groups  2 chips/group, 7.1K probesets

Grand effect Within Group SD Fig 3: Within Group SD, Hybrid Scale 5 groups  2 chips/group, 7.1K probesets

100*Var Between /(Var Between + Var Within ) Fig 9: Between EU variability as a percentage of Total variability All probesetsProbesets with mean>50 (hybrid) Grand Median P=known expressedLine = loess smooth 15 human livers  2 chips/liver, 1.5K probesets

dChip vs SAFER differences 0% vs 100% (all probesets)0% vs 100% (POS probesets) 25% vs 75% (all probesets)25% vs 75% (POS probesets) POS: maxp < 0.01 (n = 1049)

SAFER dChip SAFER log POS: maxp < 0.01 (n = 1049) NEG: wrong seq. & minp > 0.5 (n = 270) 25% vs 75% Positive Rate vs ‘False’ Positive Rate (log scale) log scale