Affymetrix and BioConductor

Affymetrix and BioConductor
Johannes Freudenberg Cincinnati Children’s Hospital Medical Center

Johannes Freudenberg, CCHMC
Overview Affymetrix' high-density oligonucleotide microarrays GeneChip® Technology Terminology Spotted vs. Affymetrix Arrays Preprocessing Affymetrix MA data using BioConductor Overview Background correction Normalization PM correction Summarization Many preprocessing strategies available – which one to use? Affymetrix & Limma 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix GeneChip® (=0.5 inch) Dudoit et al., 2002 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix GeneChip® technology
Lipshutz et al., 1999 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix GeneChip® technology (2)
Original Publication (Lockhart et al., 1996) Affymetrix ( 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix GeneChip® technology (3)
Lipshutz et al., 1999 9/21/2018 Johannes Freudenberg, CCHMC

Terminology Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer. Perfect match (PM): A 25-mer complementary to a reference sequence of interest (gene, ). Mismatch (MM): same as PM but with base change for the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) . The purpose of the MM probe design is to measure non-specific binding and background noise. Probe-pair: a (PM,MM) pair. Probe set: a collection of probe-pairs (11 to 20) related to a common gene or EST. AffyID: an identifier for a probe-pair set (eg. “A28102_at”) CEL file: text (or binary) file containing raw probe intensities for a single chip created by MAS (GCOS) software MAS: Microarray Suite – Affymetrix software package GCOS: GeneChip operating software – new Affymetrix software Adapted from Dudoit et al., 2002 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix vs. other technologies
Feature Affymetrix Other technologies Probe synthesis - In-situ synthesized - Spotted cDNA - Oligos attached to beads Probes length - 25-mers - 50-mers - Varying lengths Probe density - High density - Low density Probes per gene different probe pairs - One probe - Multiple identical probes Labeling - Biotin Cy3, Cy5, etc. Hybridization - One target per spot Competitive 9/21/2018 Johannes Freudenberg, CCHMC

What’s the evidence? Expectation/ Hope(?): measured hybridization intensity proportional to # of mRNA transcripts Spike-in experiment seems to confirm this However Only ten genes Selection of these genes? Lockhart et al., 1996 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix chips – Summary
Biological Sample Extracted mRNA Labeled mRNA Hybridization, scanning & image processing Data (pre-)processing CEL files 9/21/2018 Johannes Freudenberg, CCHMC

Preprocessing for Affymetrix GeneChips®
CEL files within-chip cross-chip sequence specific background correction within-probe set aggregation of intensity values Two “standard” methods MAS 5.0 (now GCOS/GDAS) by Affymetrix RMA by Speed group (UC Berkeley) 9/21/2018 Johannes Freudenberg, CCHMC

affy – BioConductor library
“extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe level data” ( Contains functions to Load Store Plot, and Preprocess Affymetrix microarray data Many other related packages affycomp, affydata, affyPLM, annaffy, gcrma, makecdfenv, matchprobes, simpleaffy, vsn, … 9/21/2018 Johannes Freudenberg, CCHMC

affy – BioConductor library
… if you haven‘t already done that Download BioConductor install script source(" (or source(" Run the script biocLite() (or getBioC()) Load the “affy” library library(affy) Load our example data (3+3 subset from Bhattacharjee et al., 2001) Run the following script to download the six CEL files (*) source(" Load CEL files into R harvard.rawData <- ReadAffy() If you prefer interactive harvard.rawData <- ReadAffy(widget = T) (*) Note: You may have to change your working directory to a path where you have writing permission 9/21/2018 Johannes Freudenberg, CCHMC

What is an AffyBatch object?
Take a first look at the experiment data harvard.rawData Experiment data is stored in an AffyBatch object slotNames(harvard.rawData) "cdfName" – GeneChip version "nrow", "ncol" – size of the chip (usually 640x640) "exprs" – expression matrix, contains all PM and MM intensities of the experiment "phenoData" – experiment annotation "description", "annotation" – more annotation slots … 9/21/2018 Johannes Freudenberg, CCHMC

Accessing an AffyBatch object
Extracting the expression matrix exprs(harvard.rawData)(*) or intensity(harvard.rawData)(*) PM values pm(harvard.rawData)[1:10,] MM values mm(harvard.rawData)[1:10,] probe names probeNames(harvard.rawData)[1:10] gene names geneNames(harvard.rawData)[1:10] a probe set probeset(harvard.rawData, "100_g_at") the type of GeneChip cdfName(harvard.rawData) … (*) Don’t try this at home… 9/21/2018 Johannes Freudenberg, CCHMC

How does my data look? – Diagnostic plot I
Plot an image of an array image(harvard.rawData[,1]) 9/21/2018 Johannes Freudenberg, CCHMC

How does my data look? – Diagnostic plot II
Plot the intensity distribution hist(harvard.rawData, main = "Harvard data") 9/21/2018 Johannes Freudenberg, CCHMC

How does my data look? – Diagnostic plot III
MvA plots MAplot(harvard.rawData) 9/21/2018 Johannes Freudenberg, CCHMC

How does my data look? Plot a probe set plot(probeset(harvard.rawData, geneNames(harvard.rawData)[1])[[1]]) 9/21/2018 Johannes Freudenberg, CCHMC

How does my data look? Plot a probe set par(mfrow = c(2, 3)) barplot(probeset(harvard.rawData,geneNames(harvard.rawData)[1])[[1]]) 9/21/2018 Johannes Freudenberg, CCHMC

MAS 5.0 - Background correction
Intended to correct for optical effects Divide array into K zones (default K = 16) Lowest 2% of the intensities in zone k are used to compute background bZk and noise nZk of zone k Background b(x, y) of cell (x, y) is the weighted sum of all bZk Noise n(x,y) is computed likewise Background corrected intensity A(x,y) = max(I(x,y) – b(x,y), 0.5*n(x,y)) (Affymetrix, 2002) 9/21/2018 Johannes Freudenberg, CCHMC

harvard.mas5BG <- bg.correct(harvard.rawData,"mas") hist(harvard.mas5BG, main = "Harvard data after MAS 5.0 BG correction") 9/21/2018 Johannes Freudenberg, CCHMC

par(mfrow = c(2,3)) MAplot(harvard.mas5BG) 9/21/2018 Johannes Freudenberg, CCHMC

RMA Background correction
S observed PM intensity Model: S sum of “true” signal X and background signal Y S = X + Y, where X ~ Exp(), Y~ N(,²) independent random variables X Y S + = (Speed, 2002) 9/21/2018 Johannes Freudenberg, CCHMC

RMA Background correction
E(X | S = s) S Correct for background by replacing S with E(X | S = s) To do that estimate , ,  from data Let a = s -  - ² b =  9/21/2018 Johannes Freudenberg, CCHMC

RMA - Background correction
harvard.rmaBG <- bg.correct(harvard.rawData,"rma") hist(harvard.rmaBG, main = "Harvard data after RMA BG correction") 9/21/2018 Johannes Freudenberg, CCHMC

RMA - Background correction
par(mfrow = c(2,3)) MAplot(harvard.rmaBG) 9/21/2018 Johannes Freudenberg, CCHMC

Why should we normalize?
… to remove chip effects 9/21/2018 Johannes Freudenberg, CCHMC

MAS 5.0 – Global Constant Normalization
Global constant normalization: Statistical parameters are used as global (= per-chip) scaling factor, such as: Sum Median, Quantiles/Percentiles Mean (also trimmed mean, asymmetric trimmed mean) Normalization transforms data – afterwards parameter is equal for all chips Intensity independent normalization(!) MAS 5.0: 2% trimmed mean m (default = 500) on the linear scale (as opposed to the log scale) Note: In MAS 5.0, this step is done after summarization 9/21/2018 Johannes Freudenberg, CCHMC

harvard.mas5norm <- normalize(harvard.mas5BG, "constant") hist(harvard.mas5norm, main = "Harvard data after MAS 5.0 BG & normalization") 9/21/2018 Johannes Freudenberg, CCHMC

par(mfrow = c(2,3)) MAplot(harvard.mas5norm) 9/21/2018 Johannes Freudenberg, CCHMC

RMA – Quantile normalization
Assumption: True intensity distributions identical in all replicate samples Then all points lie on the diagonal in a Q-Q plot Works the same way in higher dimensions If observed intensity distribution are different “force” them to be equal 9/21/2018 Johannes Freudenberg, CCHMC

harvard.rmaNorm <- normalize(harvard.rmaBG, "quantiles") hist(harvard.rmaNorm, main = "Harvard data after RMA BG & normalization") 9/21/2018 Johannes Freudenberg, CCHMC

par(mfrow = c(2,3)) MAplot(harvard.rmaNorm) 9/21/2018 Johannes Freudenberg, CCHMC

PM correction Background correction intended to adjust signal for non-specific contributions such as unspecific binding cross-hybridization auto- fluorescence of the surface Traditional approach cannot be used with high-density arrays Therefore, Affymetrix invented MMs, which are designed to measure non-specific signal contributions Original idea: “true” signal = PM - MM Problems: PM < MM in approx. 30% of all probe pairs MM signals really are specific (but less sensitive) 9/21/2018 Johannes Freudenberg, CCHMC

MAS 5.0 – PM correction In MAS 5.0 “ideal” mismatch (IM) is computed IM = MM if MM is “well-behaved” (i.e. MM < PM) IM = down-scaled MM if MM > PM but most other MMs of the corresponding probeset are “well-behaved” IM  PM if most MMs are “defective” Adjusted probe value is PM – IM … for a more detailed description refer to Affymetrix’ Statistical Algorithms Description Document 9/21/2018 Johannes Freudenberg, CCHMC

MAS 5.0 – PM correction harvard.mas5PMcorr <- pmcorrect.mas(harvard.mas5norm) plot(log2(pm(harvard.mas5norm[,6])),log2(harvard.mas5PMcorr[,6]), pch = ".", xlab = "PM", ylab = "corrected probe value", main = "Harvard data after MAS 5.0 PM correction") 9/21/2018 Johannes Freudenberg, CCHMC

PM correction? MMs are greater than corresponding PMs 30% of all probe pairs Therefore, RMA does not do PM correction RMA uses only PMs and ignores MMs 9/21/2018 Johannes Freudenberg, CCHMC

PM correction? MM signals are specific (but less sensitive) Hybridization behavior/ labeling efficiency depends on middle base Chudin et al., 2001 9/21/2018 Johannes Freudenberg, CCHMC

PM correction! Intensities depend on number of A, C, G, and T, respectively Intensities depend on position of A, C, G, and T, respectively Wu et al., 2004 New model based approach – GCRMA (Wu et al., 2004) 9/21/2018 Johannes Freudenberg, CCHMC

Summarization Purpose for every probe set (i.e. gene/ EST), merge the 16 to 20 probe values into a single expression value MAS 5.0 Computes Tukey’s Bi-Weight (a robust weighted mean) for every probe set Does not “borrow” information from other arrays RMA Performs Median-Polish (a robust method to fit a linear model similar to a two-way ANOVA model) Information is “borrowed” from other arrays Note: Reported RMA expression values are on log2 scale! 9/21/2018 Johannes Freudenberg, CCHMC

Summarization harvard.mas5expr<-computeExprSet(harvard.mas5norm,"mas","mas") harvard.rmaExpr <-computeExprSet(harvard.rmaNorm,"pmonly", "medianpolish") plot(log2(exprs(harvard.mas5expr)[,1]), log2(exprs(harvard.mas5expr)[,4]), xlab = "Adeno 1", ylab = "Normal 1", main = "MAS 5") plot(exprs(harvard.rmaExpr)[,1], exprs(harvard.rmaExpr)[,6], xlab = "Adeno 1", ylab = "Normal 1", main = "RMA") 9/21/2018 Johannes Freudenberg, CCHMC

expresso – does it all at once
GCRMA Example: harvard.exprData <- expresso(harvard.rawData, bgcorrect.method = "rma", normalize.method = "constant", pmcorrect.method = "pmonly", summary.method = "avgdiff") Result is an “expression set” which is a subclass of “AffyBatch” 9/21/2018 Johannes Freudenberg, CCHMC

Shortcuts Some of the most widely used strategies have wrapper functions MAS 5.0 harvard.mas5 <- mas5(harvard.rawData) RMA harvard.rma <- rma(harvard.rawData) GCRMA library(gcrma) harvard.gcrma <- gcrma(harvard.rawData) 9/21/2018 Johannes Freudenberg, CCHMC

What preprocessing strategy to use?
… No one knows But there is evidence that MAS 5.0 is not a good idea RMA is a much better alternative Other, model-based approaches work well (e.g. GCRMA, VSN) You can evaluate your favorite strategy using benchmark data Spike-In experiments (Affymetrix, GeneLogic) Dilution experiment (GeneLogic) 9/21/2018 Johannes Freudenberg, CCHMC

affycomp affycomp is an R package to evaluate your favorite preprocessing strategy using publicly available benchmark data It’s also a website ( 69 different strategies submitted so far Limitations of benchmark data(?) Unrealistically high data quality Unrealistic set-up “Over-fitting”(?) 9/21/2018 Johannes Freudenberg, CCHMC

Take-home messages Affymetrix microarrays are in-situ synthesized oligonucleotide arrays Each gene is represented by a set of perfect match probes and mismatch probes Raw probe intensities require preprocessing Background correction (optional) Normalization (recommended) PM correction using MM probes (optional) Summarization (required) BioConductor’s affy package offers the necessary tools for handling Affymetrix data It really does matter which strategy you use! 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix and Limma

Limma can handle Affy data
Just to double check… > sample adeno1.CEL adeno2.CEL adeno3.CEL normal1.CEL normal2.CEL normal3.CEL Load the library… > library(limma) Please read the user’s guide… > limmaUsersGuide() 9/21/2018 Johannes Freudenberg, CCHMC

Need to define regression model… > design < model.matrix(~1+factor(c(0, 0, 0, 1, 1, 1))) > colnames(design) <- c("adeno", "normVsCa") > design adeno normVsCa First coefficient estimates mean log-expression for adeno carcinoma and plays the role of an intercept Second coefficient estimates difference between carcinoma and normal tissue 9/21/2018 Johannes Freudenberg, CCHMC

Differentially expressed genes can be found by > fit <- lmFit(harvard.rma, design) > fit <- eBayes(fit) > topTable(fit, 30, coef = "normVsCa", adjust = "BH") 9/21/2018 Johannes Freudenberg, CCHMC

lmFit – linear model fit
Fits gene expression data to a linear regression model such as y = sytematic effect sytematic effect … random effect Goal is to separate random variation from systematic effects Example expr(genei) = baseline expri + carcinoma effecti + εij εij ~ N(μi, σi) 9/21/2018 Johannes Freudenberg, CCHMC

lmFit – linear model fit
9/21/2018 Johannes Freudenberg, CCHMC

eBayes – empirical Bayes statistics for differential expression
Mario will talk about this in detail but briefly Underestimated variance leads to overestimated t-statistic which leads to false positives Overestimated variance leads to underestimated t-statistic which leads to false negatives eBayes improves t-statistic by replacing traditional variance estimate with modified variance estimate which “borrows” information from other genes 9/21/2018 Johannes Freudenberg, CCHMC

Visualize top 300 genes heatmap() visualizes intensities and clusters genes and samples > m <- topTable(fit, 300, coef="normVsCa", adjust="BH") > index <- as.numeric(row.names(m)) > heatmap(-exprs( harvard.rma)[index,]) 9/21/2018 Johannes Freudenberg, CCHMC

References Affymetrix, Statistical Algorithms Description Document. 2002, Affymetrix, Inc.: Santa Clara, CA. Bhattacharjee, A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS. 98 (24), , November 2001. BioConductor Vignettes. BioConductor. Limma: Linear Models for Microarray Data User’s Guide. Chudin E, Walker R, Kosaka A, Wu SX, Rabert D, Chang TK, Kreder DE. Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol. 2002;3(1):RESEARCH0005. Epub 2001 Dec 14. Dudoit S, Gentleman R, Irizarry R, Yang YH. Introduction to DNA Microarray Technologies. Bioconductor Short Course, Winter Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays. Nat Genet Jan; 21(1 Suppl):20-4. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol Dec;14(13): Speed, T. (2002). Summarizing and comparing GeneChip® data. Presentation at Affymetrix User Meting Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F. A Model Based Background Adjustment for Oligonucleotide Expression Arrays. May 28, Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 1. 9/21/2018 Johannes Freudenberg, CCHMC

Affymetrix and BioConductor

Similar presentations

Presentation on theme: "Affymetrix and BioConductor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Affymetrix and BioConductor

Similar presentations

Presentation on theme: "Affymetrix and BioConductor"— Presentation transcript:

Similar presentations

About project

Feedback