Probe Level Analysis of AffymetrixTM Data

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
Microarray Quality Assessment Issues in High-Throughput Data Analysis BIOS Spring 2010 Dr Mark Reimers.
MicroArray Image Analysis Robin Liechti
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Introduction to Affymetrix Microarrays
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
Getting the numbers comparable
DNA microarray and array data analysis
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Felix Naef & Marcelo Magnasco, GL meeting, Nov Outline Background subtraction Probeset statistics Excursions into.
1 Models and methods for summarizing GeneChip probe set data.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Introduce to Microarray
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Division of Human Cancer Genetics Ohio State University.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Lecture 22 Introduction to Microarray
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Agenda Introduction to microarrays
Assessing expression data quality in high-density oligonucliotide arrays.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Scenario 6 Distinguishing different types of leukemia to target treatment.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Microarray Data Analysis The Bioinformatics side of the bench.
EE150a – Genomic Signal and Information Processing On DNA Microarrays Technology October 12, 2004.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Introduction to Oligonucleotide Microarray Technology
The simple linear regression model and parameter estimation
Introduction to Affymetrix GeneChip data
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Normalization Methods for Two-Color Microarray Data
Getting the numbers comparable
Pre-processing AFFY data
Presentation transcript:

Probe Level Analysis of AffymetrixTM Data Mark Reimers, NCI

Outline Design of Affy probesets Background Normalization Non-specific hybridization Estimation Comparison of Methods

Affymetrix GeneChip® Probe Arrays Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Over 400,000 different probes complementary to genetic information of interest Oligonucleotide probe * 1.28cm GeneChip Probe Array Hybridized Probe Cell Single stranded, fluorescently labeled DNA target

Affymetrix Probe Design Published Gene Sequence Multiple (11-20) 25-base oligonucleotide probes Perfect Match Mismatch 5´ 3´ Multiple 25-mer oligo probes are synthesized that represent the transcript. The probes are synthesized in pairs a ‘Perfect Match’, complementary to the target and a ‘Mismatch’ in which the 13th base is changed. The probes tend to be biased toward the 3’ end and vary in number from 11 to more than 20 per transcript. They are arranged as a perfect match with a corresponding homomeric mismatch. The individual probe pairs which represent a transcript are distributed about the array. PM is exactly complementary to published sequence MM is changed on 13th base

Chip Layout Typical chips are square: 640x640 (U95A), 712x712 (U133) or 1042x1042 (Plus2) Older chips placed all probes for one gene in a row Modern chips distribute probes according to sequence, not gene

Chip Nomenclature HGU133A - Human Genome: Unigene build 133, first chip PM - ‘perfect match’ MM - ‘mismatch’ Control sequence sequence from unrelated organism Signal - intensity Doesn’t translate directly to abundance Cross-hybridization Binding of sequences other than target

Affymetrix Background Adjustment and Normalization

What’s the Issue? Background: some Affy chips show consistently higher values for the lowest signals (presumably absent) than others Background may vary over a chip Normalization: Distribution of probe signals may differ between chips, independent of background adjustment PM and MM may be shifted differently

Probe Intensities in 23 Replicates

Approaches to Background Subtract common estimate of background Fit local background across chip and subtract - MAS 5.0 Consider background as random variable Use statistical theory to derive background correction

RMA ‘Bayesian’ BG Correction Each S = BG + Intensity + e BG randomly sampled from Normal distn Intensity randomly sampled from exponential distribution Estimate mean and SD of BG distn by fitting values below mode of signal distn Estimate Intensity, conditional on S, by integrating over possible values of BG

Approaches to Normalization Simple: find average of each chip; divide all values by chip average MAS5: trimmed mean Invariant set: find subset of probes in almost same rank order in each chip Quantile normalization: fit to average quantiles across experiment

Probes on Different Chips Plots of two Affymetrix chips against the experiment means

MAS 5.0 Plot probes from each chip against common base-line chip Fit regression line to middle 98% of probes

Invariant Set (Li-Wong) Method Select baseline chip X For each other chip Y: Select probes p1, …, pK, (K ~ 10000), such that p1 < p2 < …< pK in both chips Fit running median through points { (xp1,yp1), …, (xpK, ypK) } Repeat

Quantile Method (RMA) Distributions of probe intensities vary substantially among replicate chips This cannot be even approximately resolved by any linear transformation Drastic solution: ‘shoehorn’ all probe intensities into same distribution Ideal distribution is taken as average of all

Quantile Normalization Distribution of Chip Intensities Reference Distribution Formula: xnorm = F2-1(F1(x)) Density function Assumes: gene distribution changes little F1(x) F2(x) Cumulative Distribution Function a x y

Ratio-Intensity: Before

Ratio-Intensity: After

Critique of RMA Normalization Distribution of signals looks more like exponential on log scale No allowance for regional biases in BG Quantile normalization is very strong: highly expressed genes won’t be equal Better to let higher end be roughly linear Requires much memory - could be implemented differently

Model-based Estimates for Affymetrix Raw Data

Many Probes for One Gene Sequence Multiple oligo probes Perfect Match Mismatch 5´ 3´ Multiple 25-mer oligo probes are synthesized that represent the transcript. The probes are synthesized in pairs a ‘Perfect Match’, complementary to the target and a ‘Mismatch’ in which the 13th base is changed. The probes tend to be biased toward the 3’ end and vary in number from 11 to more than 20 per transcript. They are arranged as a perfect match with a corresponding homomeric mismatch. The individual probe pairs which represent a transcript are distributed about the array. How to combine signals from multiple probes into a single gene abundance estimate?

Probe Variation Individual probes don’t agree on fold changes Probes for one gene may vary by two orders of magnitude on each chip CG content is most important factor in signal strength Signal from 16 probes along one gene on one chip

Competing Models 2005 GCOS (Affymetrix MicroArray Suite 5.0) dChip Manufacturer’s software dChip Li and Wong, HSPH Bioconductor: affy package (RMA) Bolstad, Irizarry, Speed, et al Variants such as gcRMA, vsn Probe-level analyses affyPLM, logit-t, …

Probe Measure Variation Typical probes are two orders of magnitude different! CG content is most important factor RNA target folding also affects hybridization 3x104

Principles of MAS 5 method First estimate background bg = MM (if physically possible) log(bg) = log(PM)-log(non-specific proportion) (if impossible) Non-specific proportion = max(SB, e) SB = Tukeybiweight(log(PM)-log(MM)) Signal = Tukeybiweight(log(Adjusted PM))

Critique of MAS 5 principle Not clear what an average of different probes should mean Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here No ‘learning’ based on cross-chip performance of individual probes

Motivation for multi-chip models: Probe level data from spike-in study ( log scale ) note parallel trend of all probes Courtesy of Terry Speed

Linear Models Extension of linear regression Essential features: Measurement errors independent of each other ‘random noise’ Needs normalization to eliminate systematic variation Noise levels comparable at different levels of signal Small number of factors give predicted levels combine in linear function or simple algebraic form

Model for Probe Signal chip 1 a1 a2 chip 2 Probes 1 2 3 f1 f2 f3 Each probe signal is proportional to i) the amount of target sample – a ii) the affinity of the specific probe sequence to the target – f NB: High affinity is not the same as Specificity Probe can give high signal to intended target and also to other transcripts Probes 1 2 3 chip 1 a1 a2 chip 2 f1 f2 f3

Multiplicative Model For each gene, a set of probes p1,…,pk Each probe pj binds the gene with efficiency fj In each sample there is an amount qi. Probe intensity should be proportional to fjxqi Always some noise!

Robust Statistics Outlier: a measure that is far beyond the typical random variation common in biological measures 10-15% in Affy probe sets Robust methods try to fit the majority of data points Issue is to identify which points to down-weight or ignore Median is very robust – but inefficient Trimmed means are almost as robust and much more efficient

Robust Linear Models Criterion of fit Method for finding fit Least median squares Sum of weighted squares Least squares and throw out outliers Method for finding fit High-dimensional search Iteratively re-weighted least squares Median Polish

Why Robust Models for GeneChips? 10% - 15% of individual signals in a probe set deviate greatly from pattern Often outliers lie close together Causes: Scratches Proximity to heating elements Uneven fluid flow

Fitting probes in one set on one chip Li & Wong (dChip) Model: PMij = qifj + eij - Original model (dChip 1.0) used PMij - MMij = qifj + eij by analogy with Affy MAS 4 Outlier removal: Identify extreme residuals Remove Re-fit Iterate Fitting probes in one set on one chip Dark blue: PM values Red: fitted values Light blue: probe SD

Critique of Li-Wong model Model assumes that noise for all probes has same magnitude All biological measurements exhibit intensity-dependent noise

Bolstad, Irizarry, Speed – (RMA) For each probe set, take the log transform of PMij = qifj: i.e. fit the model: Fit this additive model by iteratively re-weighted least-squares or median polish Where nlog() stands for logarithm after normalization Critique: assumes probe noise is constant (homoschedastic) on log scale

Comparison of Methods Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA 20 replicate arrays – variance should be small Standard deviations of expression estimates on arrays arranged in four groups of genes by increasing mean expression level Courtesy of Terry Speed

Steady Improvement Affymetrix improves their model PLIER is a multi-chip model MAS P & A calls reasonable MAS 5.0 estimation does a reasonable job on probe sets that are bright Abundant genes dChip and RMA do better on genes that are less abundant Signalling proteins, transcription factors, etc

Expression Comparison 1 – MAS 4 Ratio-Intensity Plot comparing two chips from spike-in experiment White dots represent unchanged genes Red numbers flag spike-in genes Courtesy of Terry Speed

Expression Comparison 2 – MAS 5 t-scores changed genes Theoretical t-distribution Courtesy of Terry Speed

Expression Comparison 3 – Li-Wong Courtesy of Terry Speed

Expression Comparison 4 - RMA Courtesy of Terry Speed

Comparison on Real Data These results are based on samples with 14 spike-ins - not realistic complexity Choe et al (Genome Biology 2005) produced a spike in data set with realistic complexity - found MAS5 PM correction worked well Comparisons of biological variation vs technical variation in replicated samples suggest RMA defaults work best

Mix and Match Methods in affy Background: rma, mas Normalization: quantile, constant, … PM-correction: none, Model: median polish, mas Estimates <- expresso( cel.data, bgcorrect.method = mas, normalization.method = quantiles, …

gcRMA: Estimating Non-specific Hybridization Each probe has its own characteristic cross-hybridizations (NSH) Mismatch is not a good estimate of NSH GC content may predict NSH reasonably well