Presentation is loading. Please wait.

Presentation is loading. Please wait.

Affymetrix and BioConductor

Similar presentations


Presentation on theme: "Affymetrix and BioConductor"— Presentation transcript:

1 Affymetrix and BioConductor
Johannes Freudenberg Cincinnati Children’s Hospital Medical Center

2 Johannes Freudenberg, CCHMC
Overview Affymetrix' high-density oligonucleotide microarrays GeneChip® Technology Terminology Spotted vs. Affymetrix Arrays Preprocessing Affymetrix MA data using BioConductor Overview Background correction Normalization PM correction Summarization Many preprocessing strategies available – which one to use? Affymetrix & Limma 9/21/2018 Johannes Freudenberg, CCHMC

3 Johannes Freudenberg, CCHMC
Affymetrix GeneChip® (=0.5 inch) Dudoit et al., 2002 9/21/2018 Johannes Freudenberg, CCHMC

4 Affymetrix GeneChip® technology
Lipshutz et al., 1999 9/21/2018 Johannes Freudenberg, CCHMC

5 Affymetrix GeneChip® technology (2)
Original Publication (Lockhart et al., 1996) Affymetrix ( 9/21/2018 Johannes Freudenberg, CCHMC

6 Affymetrix GeneChip® technology (3)
Lipshutz et al., 1999 9/21/2018 Johannes Freudenberg, CCHMC

7 Johannes Freudenberg, CCHMC
Terminology Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer. Perfect match (PM): A 25-mer complementary to a reference sequence of interest (gene, ). Mismatch (MM): same as PM but with base change for the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) . The purpose of the MM probe design is to measure non-specific binding and background noise. Probe-pair: a (PM,MM) pair. Probe set: a collection of probe-pairs (11 to 20) related to a common gene or EST. AffyID: an identifier for a probe-pair set (eg. “A28102_at”) CEL file: text (or binary) file containing raw probe intensities for a single chip created by MAS (GCOS) software MAS: Microarray Suite – Affymetrix software package GCOS: GeneChip operating software – new Affymetrix software Adapted from Dudoit et al., 2002 9/21/2018 Johannes Freudenberg, CCHMC

8 Affymetrix vs. other technologies
Feature Affymetrix Other technologies Probe synthesis - In-situ synthesized - Spotted cDNA - Oligos attached to beads Probes length - 25-mers - 50-mers - Varying lengths Probe density - High density - Low density Probes per gene different probe pairs - One probe - Multiple identical probes Labeling - Biotin Cy3, Cy5, etc. Hybridization - One target per spot Competitive 9/21/2018 Johannes Freudenberg, CCHMC

9 Johannes Freudenberg, CCHMC
What’s the evidence? Expectation/ Hope(?): measured hybridization intensity proportional to # of mRNA transcripts Spike-in experiment seems to confirm this However Only ten genes Selection of these genes? Lockhart et al., 1996 9/21/2018 Johannes Freudenberg, CCHMC

10 Affymetrix chips – Summary
Biological Sample Extracted mRNA Labeled mRNA Hybridization, scanning & image processing Data (pre-)processing CEL files 9/21/2018 Johannes Freudenberg, CCHMC

11 Preprocessing for Affymetrix GeneChips®
CEL files within-chip cross-chip sequence specific background correction within-probe set aggregation of intensity values Two “standard” methods MAS 5.0 (now GCOS/GDAS) by Affymetrix RMA by Speed group (UC Berkeley) 9/21/2018 Johannes Freudenberg, CCHMC

12 affy – BioConductor library
“extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe level data” ( Contains functions to Load Store Plot, and Preprocess Affymetrix microarray data Many other related packages affycomp, affydata, affyPLM, annaffy, gcrma, makecdfenv, matchprobes, simpleaffy, vsn, … 9/21/2018 Johannes Freudenberg, CCHMC

13 affy – BioConductor library
… if you haven‘t already done that Download BioConductor install script source(" (or source(" Run the script biocLite() (or getBioC()) Load the “affy” library library(affy) Load our example data (3+3 subset from Bhattacharjee et al., 2001) Run the following script to download the six CEL files (*) source(" Load CEL files into R harvard.rawData <- ReadAffy() If you prefer interactive harvard.rawData <- ReadAffy(widget = T) (*) Note: You may have to change your working directory to a path where you have writing permission 9/21/2018 Johannes Freudenberg, CCHMC

14 What is an AffyBatch object?
Take a first look at the experiment data harvard.rawData Experiment data is stored in an AffyBatch object slotNames(harvard.rawData) "cdfName" – GeneChip version "nrow", "ncol" – size of the chip (usually 640x640) "exprs" – expression matrix, contains all PM and MM intensities of the experiment "phenoData" – experiment annotation "description", "annotation" – more annotation slots 9/21/2018 Johannes Freudenberg, CCHMC

15 Accessing an AffyBatch object
Extracting the expression matrix exprs(harvard.rawData)(*) or intensity(harvard.rawData)(*) PM values pm(harvard.rawData)[1:10,] MM values mm(harvard.rawData)[1:10,] probe names probeNames(harvard.rawData)[1:10] gene names geneNames(harvard.rawData)[1:10] a probe set probeset(harvard.rawData, "100_g_at") the type of GeneChip cdfName(harvard.rawData) (*) Don’t try this at home… 9/21/2018 Johannes Freudenberg, CCHMC

16 How does my data look? – Diagnostic plot I
Plot an image of an array image(harvard.rawData[,1]) 9/21/2018 Johannes Freudenberg, CCHMC

17 How does my data look? – Diagnostic plot II
Plot the intensity distribution hist(harvard.rawData, main = "Harvard data") 9/21/2018 Johannes Freudenberg, CCHMC

18 How does my data look? – Diagnostic plot III
MvA plots MAplot(harvard.rawData) 9/21/2018 Johannes Freudenberg, CCHMC

19 Johannes Freudenberg, CCHMC
How does my data look? Plot a probe set plot(probeset(harvard.rawData, geneNames(harvard.rawData)[1])[[1]]) 9/21/2018 Johannes Freudenberg, CCHMC

20 Johannes Freudenberg, CCHMC
How does my data look? Plot a probe set par(mfrow = c(2, 3)) barplot(probeset(harvard.rawData,geneNames(harvard.rawData)[1])[[1]]) 9/21/2018 Johannes Freudenberg, CCHMC

21 MAS 5.0 - Background correction
Intended to correct for optical effects Divide array into K zones (default K = 16) Lowest 2% of the intensities in zone k are used to compute background bZk and noise nZk of zone k Background b(x, y) of cell (x, y) is the weighted sum of all bZk Noise n(x,y) is computed likewise Background corrected intensity A(x,y) = max(I(x,y) – b(x,y), 0.5*n(x,y)) (Affymetrix, 2002) 9/21/2018 Johannes Freudenberg, CCHMC

22 MAS 5.0 - Background correction
harvard.mas5BG <- bg.correct(harvard.rawData,"mas") hist(harvard.mas5BG, main = "Harvard data after MAS 5.0 BG correction") 9/21/2018 Johannes Freudenberg, CCHMC

23 MAS 5.0 - Background correction
par(mfrow = c(2,3)) MAplot(harvard.mas5BG) 9/21/2018 Johannes Freudenberg, CCHMC

24 RMA Background correction
S observed PM intensity Model: S sum of “true” signal X and background signal Y S = X + Y, where X ~ Exp(), Y~ N(,²) independent random variables X Y S + = (Speed, 2002) 9/21/2018 Johannes Freudenberg, CCHMC

25 RMA Background correction
E(X | S = s) S Correct for background by replacing S with E(X | S = s) To do that estimate , ,  from data Let a = s -  - ² b =  9/21/2018 Johannes Freudenberg, CCHMC

26 RMA - Background correction
harvard.rmaBG <- bg.correct(harvard.rawData,"rma") hist(harvard.rmaBG, main = "Harvard data after RMA BG correction") 9/21/2018 Johannes Freudenberg, CCHMC

27 RMA - Background correction
par(mfrow = c(2,3)) MAplot(harvard.rmaBG) 9/21/2018 Johannes Freudenberg, CCHMC

28 Why should we normalize?
… to remove chip effects 9/21/2018 Johannes Freudenberg, CCHMC

29 MAS 5.0 – Global Constant Normalization
Global constant normalization: Statistical parameters are used as global (= per-chip) scaling factor, such as: Sum Median, Quantiles/Percentiles Mean (also trimmed mean, asymmetric trimmed mean) Normalization transforms data – afterwards parameter is equal for all chips Intensity independent normalization(!) MAS 5.0: 2% trimmed mean m (default = 500) on the linear scale (as opposed to the log scale) Note: In MAS 5.0, this step is done after summarization 9/21/2018 Johannes Freudenberg, CCHMC

30 MAS 5.0 – Global Constant Normalization
harvard.mas5norm <- normalize(harvard.mas5BG, "constant") hist(harvard.mas5norm, main = "Harvard data after MAS 5.0 BG & normalization") 9/21/2018 Johannes Freudenberg, CCHMC

31 MAS 5.0 – Global Constant Normalization
par(mfrow = c(2,3)) MAplot(harvard.mas5norm) 9/21/2018 Johannes Freudenberg, CCHMC

32 RMA – Quantile normalization
Assumption: True intensity distributions identical in all replicate samples Then all points lie on the diagonal in a Q-Q plot Works the same way in higher dimensions If observed intensity distribution are different “force” them to be equal 9/21/2018 Johannes Freudenberg, CCHMC

33 RMA – Quantile normalization
harvard.rmaNorm <- normalize(harvard.rmaBG, "quantiles") hist(harvard.rmaNorm, main = "Harvard data after RMA BG & normalization") 9/21/2018 Johannes Freudenberg, CCHMC

34 RMA – Quantile normalization
par(mfrow = c(2,3)) MAplot(harvard.rmaNorm) 9/21/2018 Johannes Freudenberg, CCHMC

35 Johannes Freudenberg, CCHMC
PM correction Background correction intended to adjust signal for non-specific contributions such as unspecific binding cross-hybridization auto- fluorescence of the surface Traditional approach cannot be used with high-density arrays Therefore, Affymetrix invented MMs, which are designed to measure non-specific signal contributions Original idea: “true” signal = PM - MM Problems: PM < MM in approx. 30% of all probe pairs MM signals really are specific (but less sensitive) 9/21/2018 Johannes Freudenberg, CCHMC

36 Johannes Freudenberg, CCHMC
MAS 5.0 – PM correction In MAS 5.0 “ideal” mismatch (IM) is computed IM = MM if MM is “well-behaved” (i.e. MM < PM) IM = down-scaled MM if MM > PM but most other MMs of the corresponding probeset are “well-behaved” IM  PM if most MMs are “defective” Adjusted probe value is PM – IM … for a more detailed description refer to Affymetrix’ Statistical Algorithms Description Document 9/21/2018 Johannes Freudenberg, CCHMC

37 Johannes Freudenberg, CCHMC
MAS 5.0 – PM correction harvard.mas5PMcorr <- pmcorrect.mas(harvard.mas5norm) plot(log2(pm(harvard.mas5norm[,6])),log2(harvard.mas5PMcorr[,6]), pch = ".", xlab = "PM", ylab = "corrected probe value", main = "Harvard data after MAS 5.0 PM correction") 9/21/2018 Johannes Freudenberg, CCHMC

38 Johannes Freudenberg, CCHMC
PM correction? MMs are greater than corresponding PMs 30% of all probe pairs Therefore, RMA does not do PM correction RMA uses only PMs and ignores MMs 9/21/2018 Johannes Freudenberg, CCHMC

39 Johannes Freudenberg, CCHMC
PM correction? MM signals are specific (but less sensitive) Hybridization behavior/ labeling efficiency depends on middle base Chudin et al., 2001 9/21/2018 Johannes Freudenberg, CCHMC

40 Johannes Freudenberg, CCHMC
PM correction! Intensities depend on number of A, C, G, and T, respectively Intensities depend on position of A, C, G, and T, respectively Wu et al., 2004 New model based approach – GCRMA (Wu et al., 2004) 9/21/2018 Johannes Freudenberg, CCHMC

41 Johannes Freudenberg, CCHMC
Summarization Purpose for every probe set (i.e. gene/ EST), merge the 16 to 20 probe values into a single expression value MAS 5.0 Computes Tukey’s Bi-Weight (a robust weighted mean) for every probe set Does not “borrow” information from other arrays RMA Performs Median-Polish (a robust method to fit a linear model similar to a two-way ANOVA model) Information is “borrowed” from other arrays Note: Reported RMA expression values are on log2 scale! 9/21/2018 Johannes Freudenberg, CCHMC

42 Johannes Freudenberg, CCHMC
Summarization harvard.mas5expr<-computeExprSet(harvard.mas5norm,"mas","mas") harvard.rmaExpr <-computeExprSet(harvard.rmaNorm,"pmonly", "medianpolish") plot(log2(exprs(harvard.mas5expr)[,1]), log2(exprs(harvard.mas5expr)[,4]), xlab = "Adeno 1", ylab = "Normal 1", main = "MAS 5") plot(exprs(harvard.rmaExpr)[,1], exprs(harvard.rmaExpr)[,6], xlab = "Adeno 1", ylab = "Normal 1", main = "RMA") 9/21/2018 Johannes Freudenberg, CCHMC

43 expresso – does it all at once
GCRMA Example: harvard.exprData <- expresso(harvard.rawData, bgcorrect.method = "rma", normalize.method = "constant", pmcorrect.method = "pmonly", summary.method = "avgdiff") Result is an “expression set” which is a subclass of “AffyBatch” 9/21/2018 Johannes Freudenberg, CCHMC

44 Johannes Freudenberg, CCHMC
Shortcuts Some of the most widely used strategies have wrapper functions MAS 5.0 harvard.mas5 <- mas5(harvard.rawData) RMA harvard.rma <- rma(harvard.rawData) GCRMA library(gcrma) harvard.gcrma <- gcrma(harvard.rawData) 9/21/2018 Johannes Freudenberg, CCHMC

45 What preprocessing strategy to use?
… No one knows But there is evidence that MAS 5.0 is not a good idea RMA is a much better alternative Other, model-based approaches work well (e.g. GCRMA, VSN) You can evaluate your favorite strategy using benchmark data Spike-In experiments (Affymetrix, GeneLogic) Dilution experiment (GeneLogic) 9/21/2018 Johannes Freudenberg, CCHMC

46 Johannes Freudenberg, CCHMC
affycomp affycomp is an R package to evaluate your favorite preprocessing strategy using publicly available benchmark data It’s also a website ( 69 different strategies submitted so far Limitations of benchmark data(?) Unrealistically high data quality Unrealistic set-up “Over-fitting”(?) 9/21/2018 Johannes Freudenberg, CCHMC

47 Johannes Freudenberg, CCHMC
Take-home messages Affymetrix microarrays are in-situ synthesized oligonucleotide arrays Each gene is represented by a set of perfect match probes and mismatch probes Raw probe intensities require preprocessing Background correction (optional) Normalization (recommended) PM correction using MM probes (optional) Summarization (required) BioConductor’s affy package offers the necessary tools for handling Affymetrix data It really does matter which strategy you use! 9/21/2018 Johannes Freudenberg, CCHMC

48 Affymetrix and Limma

49 Limma can handle Affy data
Just to double check… > sample adeno1.CEL adeno2.CEL adeno3.CEL normal1.CEL normal2.CEL normal3.CEL Load the library… > library(limma) Please read the user’s guide… > limmaUsersGuide() 9/21/2018 Johannes Freudenberg, CCHMC

50 Limma can handle Affy data
Need to define regression model… > design < model.matrix(~1+factor(c(0, 0, 0, 1, 1, 1))) > colnames(design) <- c("adeno", "normVsCa") > design adeno normVsCa First coefficient estimates mean log-expression for adeno carcinoma and plays the role of an intercept Second coefficient estimates difference between carcinoma and normal tissue 9/21/2018 Johannes Freudenberg, CCHMC

51 Limma can handle Affy data
Differentially expressed genes can be found by > fit <- lmFit(harvard.rma, design) > fit <- eBayes(fit) > topTable(fit, 30, coef = "normVsCa", adjust = "BH") 9/21/2018 Johannes Freudenberg, CCHMC

52 lmFit – linear model fit
Fits gene expression data to a linear regression model such as y = sytematic effect sytematic effect … random effect Goal is to separate random variation from systematic effects Example expr(genei) = baseline expri + carcinoma effecti + εij εij ~ N(μi, σi) 9/21/2018 Johannes Freudenberg, CCHMC

53 lmFit – linear model fit
9/21/2018 Johannes Freudenberg, CCHMC

54 eBayes – empirical Bayes statistics for differential expression
Mario will talk about this in detail but briefly Underestimated variance leads to overestimated t-statistic which leads to false positives Overestimated variance leads to underestimated t-statistic which leads to false negatives eBayes improves t-statistic by replacing traditional variance estimate with modified variance estimate which “borrows” information from other genes 9/21/2018 Johannes Freudenberg, CCHMC

55 Johannes Freudenberg, CCHMC
Visualize top 300 genes heatmap() visualizes intensities and clusters genes and samples > m <- topTable(fit, 300, coef="normVsCa", adjust="BH") > index <- as.numeric(row.names(m)) > heatmap(-exprs( harvard.rma)[index,]) 9/21/2018 Johannes Freudenberg, CCHMC

56 Johannes Freudenberg, CCHMC
References Affymetrix, Statistical Algorithms Description Document. 2002, Affymetrix, Inc.: Santa Clara, CA. Bhattacharjee, A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS. 98 (24), , November 2001. BioConductor Vignettes. BioConductor. Limma: Linear Models for Microarray Data User’s Guide. Chudin E, Walker R, Kosaka A, Wu SX, Rabert D, Chang TK, Kreder DE. Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol. 2002;3(1):RESEARCH0005. Epub 2001 Dec 14. Dudoit S, Gentleman R, Irizarry R, Yang YH. Introduction to DNA Microarray Technologies. Bioconductor Short Course, Winter Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays. Nat Genet Jan; 21(1 Suppl):20-4. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol Dec;14(13): Speed, T. (2002). Summarizing and comparing GeneChip® data. Presentation at Affymetrix User Meting Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F. A Model Based Background Adjustment for Oligonucleotide Expression Arrays. May 28, Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 1. 9/21/2018 Johannes Freudenberg, CCHMC


Download ppt "Affymetrix and BioConductor"

Similar presentations


Ads by Google