Assessing expression data quality in high-density oligonucliotide arrays.

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

Application of available statistical tools Development of specific, more appropriate statistical tools for use with microarrays Functional annotation of.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
1. Principles and important terminology 2. RNA Preparation and quality controls 3. Data handling 4. Costs 5. Protocols 6. Information for collaboration.
Microarray Quality Assessment Issues in High-Throughput Data Analysis BIOS Spring 2010 Dr Mark Reimers.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Introduction to Affymetrix Microarrays
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
DNA microarray and array data analysis
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina,
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
1 Models and methods for summarizing GeneChip probe set data.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Division of Human Cancer Genetics Ohio State University.
GeneChips and Microarray Expression Data
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Statistical Analyses of Microarray Data Rafael A. Irizarry Department of Biostatistics
Lecture 22 Introduction to Microarray
Data Type 1: Microarrays
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
GeneChip® Probe Arrays
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
Robust Estimators.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarray Data Analysis The Bioinformatics side of the bench.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Introduction to Oligonucleotide Microarray Technology
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
Using ArrayStar with a public dataset
Introduction to Affymetrix GeneChip data
Copy-number estimation using Robust Multichip Analysis - Supplementary materials for the aroma.affymetrix lab session Henrik Bengtsson & Terry Speed Dept.
Normalization Methods for Two-Color Microarray Data
The Basics of Microarray Image Processing
Getting the numbers comparable
Pre-processing AFFY data
Presentation transcript:

Assessing expression data quality in high-density oligonucliotide arrays

Outline GeneChip® technology Affymetrix QC recommendations Review of models for expression value estimation Assessing chip expression data quality

GeneChip® technology

Assay steps and basic data generation Prepare target: sample->total RNA->cDNA->cRNA (labelled, amplified, cleaned, fragmented) Hybridize to chip Scanning -> dat file contains one intensity per pixel 49 pixels per cell are summarized by 75 th percentile after removal of outer perimeter pixels. This is the cell intensity, each cell corresponding to a probe. On the HG-U133 chip, each target is represented by a set of 11 pairs of PM:MM probes. The MM probe is obtained by complementing the middle base in the PM oligo and is meant to be an internal control assumed to hybridize to nonspecific sequences about as effectively as its PM counterpart. Each PM probe is a 25 base long oligo selected with the objective of achieving linearity between log intensity and log concentration. How to combine the PM:MM intensities into a measure of expression for the target?

Sample preparation

Probe Arrays (HG-U133 update) 18µm Millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array >500,000 different complementary probes Single stranded, labeled RNA target Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell Compliments of D. Gerhold

Affymetrix QC recommendations

Starting RNA quality Gels detect 18S and 28S ribosomal RNA. Ratio of 260/280 absorbance values. Other gel electrophoresis patterns used at different stages of preparation are used to make qualitative assessments of the RNA samples.

Sample quality assessment by gel electrophoresis For total RNA, look for 18S and 28S bands (not shown here). For cDNA, a good sample will produce a smear extending from top to bottom of the gel. Unfragmented cRNA will also produce a smear running doen the gel. Fragmented cRNA gel should appear as a blob at the bottom of the gel indicating that the cRNA has been sucessfully fragmented to pieces about 50 bp in length,

Next slide from Vanderbilt MicroArray Shared Resource web site

Affymetrix standards for post hybridization and scanning quality assessment – Visual inspection of image Visual inspection of image: dat file (50MB), cell file: - B2 hybridization – checkerboard and array name - Quality of features – discrete squares with pixels of slightly varying intensity - General inspection – scratches (ignored), bright SAPE residue (masked out) - Grid alignment

Affymetrix standards for post hybridization and scanning quality assessment – examination of quality report. Array quality metrics:  Raw Q (Noise): The degree of pixel-to-pixel variation among the probe cells used to calculate the background = average over background cells (lower 2 percentile) of cell pixel intensity standard error. Between 1.5 and 3.0 is ok. Use scaled noise to get consistency between arrays.  Scaling factor ~ 100/2% trimmed mean of intensities (not logged). Should be kept below 10. Key is consistency across arrays.  Background ~ average of of cell intensities in lowest 2 percentile, by region, with smoothing. No range. Key is consistency.  Percent present calls. Typical range is 20-50%. (i.e. are PM>MM?). Note – All these quantities, including noise, can be extracted from the cel file.

Affymetrix standards - Examination of spikes and poly A controls Hybridization controls: bioB, bioC, bioD and cre. “At 1.5 pM bioB should be called Present 70% of the time. … the others should be called present 100% of the time with increasing Signal value (bioC, bioD, and cre, resp.) “Check that bio C, representing the minimum specification of detection, is present. Poly A controls: dap, lys, phe, thr, tryp. Used to monitor wet lab work. Sense strand cRNAs synthesized from the control genes can be added to samples prior to the reverse transcription step to monitor target synthesis and labeling efficiencies. Antisense cRNA transcripts can be added to the to the target cRNA sample to monitor the amplification and labelling steps. Housekeeping/Control Genes: GAPDH, beta-Actin, ISGF-3 (STAT1): 3’ and 5’ signal intensity ratios of control probe sets (GAPDH, Beta Actin): “A 1:1 molar ratio of the 3’ and 5’ transcript regions will not necessarily give a signal ratio of 1” All controls appear on the chip in both sense strand (_st) and antisense strand (_at) versions, and all have probe sets chose from the 5’, M and 3’ end of the target transcript.

Affymetrix standards - Examination of other spike ins or control probe sets: Normalization Control Set: 100 probe sets replicated on both A and B arrays (new to HG-U133) – these are a set of genes found to be called present with low MAS4 signal variability in a large set of tissues. Linearity and sensitivity of amplification as quantified using spike-in bacterial cRNA. Replicate analysis and reproducibility.

Probe Set Positions for GAPDH and Beta Actin

Length of target sequences on HG-U133A chip

Examination of dat file

Chip dat file - full

Chip dat file – checkered board – oligo B2

Chip dat file – checkered board – close up

Chip dat file – checkered board – close up w/ grid

Examination of cel file

Chip cel file – checkered board

Chip cel file – checkered board – close up w/ grid

Chip cel file – PM - MM

Limitatons of standard QC metrics and procedure Link between these metrics and the numbers we care about is missing. Quality of data gauged from spike-ins requiring special processing may not represent the quality of the rest of the data on the chip – risk of QCing the chip QC process itself, but not the gene expression data. Good end-point data quality assessment is needed to assess the validity of these indirect data quality assessments.

Review of models for gene expression value estimation

MAS 5 (Microarray Suite 5 by Affymetrix) Expression measures are derived as follows in Affymetrix’ Microarray Analysis Suite 5.0: A background correction is applied to the probe intensities. For each probe set the log expression is estimated by means of a one-step Tukey biweighted average of log(PM j - MM j * ), where MM* is an MM value modified to ensure that it does not exceed the PM value. This is equivalent to robustly estimating the parameter  in the model log(PM j - MM j * ) =  + e j To compare expression measures across chips, expression values are normalized by a multiplicative scaling factor. This is equivalent to shifting the expression values on the log scale. See Affy technical description [1].

RMA The Robust Multichip Average is an expression measure obtained from analysing a set of chips in the following way: A background correction is applied to probe intensities [3]. A probe intensity normalization vector is computed from the set of chips and the intensities of each chip normalized to this vector [4]. For one probe set, the log of the background corrected and normalized probe intensities, Y ki say, are modelled as the sum of a chip effect and a probe effect: Y jk =  j +  k +  jk where k indexes chips and j indexes probes To produce the RMA expression values, the model is fitted robustly and the estimated parameters  k used as estimate of log expression for each chip.

RMA vs MAS5 Background correction is different – Affymetrix removes a fixed amount with some local adjustment; RMA uses a model which results in an intensity dependent bg correction. Normalization is at probe level and intensity dependent. Multichip analysis enables the estimation of probe effects. RMA expression values has been shown to be highly reproducible and to detect changes in target mRNA concentration with great sensitivity [5, 6]. Our main interest here is in the use of model fit results for quality assessment. The size of the residuals from a fit indicates the quality of the fit and the variability of the parameter estimates. These can be summarize and visualized in various ways to provide chip expression data quality indicators.

Affymetrix public dataset - Spike-in design below is repeated 3 times with chips from different lots. (One large sample prepared from pancreas polya+ mRNA)

The model fits – ex 1

The model fits – ex 2

LS fit If we assume all measurements are equally precise we obtain the simplified error model Y jk =  j +  k +  jk, with  kj ~ iid N(0,  2 ) This model is commonly fitted by LS with parameter estimates: b j = Y j. – the mean of the observations for probe j a k = Y.k – the mean of the observations for chip k s 2 =  r 2 /(n-J-K+1) Under this model, parameters have estimated standard errors: SE(a k ) = s/sqrt(J), SE(b j )=s/sqrt(K) i.e. Every chip expression has the same estimated variability.

Robust fit The least squares fit provides optimal (unbiased, asym min var) estimates when the model is true, but the LS estimates produced under slight departures from the assumed model soon lose their good properties. Robust fitting procedures have been devised to produce estimates which are good under the assumed model and remain so under slight departures from these assumptions. A commonly used robust fitting procedure is iteratively reweighted least squares, in which an following an initial guess at the fit is followed by a sequece of weighted LS fits, with the wwights derived from the previous fit as follows: Estimate the scale: S = mad(res) Weights are given by: w jk = .huber(abs(r jk /S)) Weighted LS fit estimates and estimated standard errors are given by: b j =  k w jk * Y jk a k =  j w jk * Y jk SE(a k ) = S/sqrt(  j w jk ), SE(b j ) = S/sqrt(  k w jk ) Unscaled.SE(a k ) = 1/sqrt(  j w jk ), Unscaled.SE(b j ) = 1/sqrt(  k w jk )

 Huber function

Why should we use the robust fit? Image artefacts (scratches, bubbles, uneven hybridization, glare in scan) being a common occurrence, the gross error model is more realistic than the iid Normal model. Because of cross-hybridisation, and other reasons, probes within probe sets do not all respond the same way – the robust fitting procedure will go with the majority of the probes. The proof of the benefits of robustly fitting the model will be in the pudding (but that is not to be tasted today) For QC purposes, it is essential to use a robust fitting procedure in order to let the outliers speak out.

Assessing chip expression data quality

Chip expression data quality assessment Having fitted models at the probe set level across a set of chips, we want derive some chip specific quantities to be used as indicators of overall chip expression data quality.  Look at set of residuals for a chip over all probe sets, one residual per probe. Compare these batches of residuals across chips. Chips with large number of bad probes will have larger residuals – look at IQR  First summarize the residuals into a probe set SE for expression value for chip and compare batches of SEs between chips.  SEs in 2 are heterogeneous mix – can use batches of unscaled SEs to compare chips.  Can normalize further by rescaling by the median chip unscaled SE. All the above produce a batch of numbers for each chip. Need to have one, or a few numbers, per chip. Start with median of set in 3.

Data Picture 1.

Data Picture 2.

Analyzing chips one at a time Some will want to analyze chips one at a time – either because they have too few, or in some cases, too many, to analyze in batches. This is easily done, probe set by probe set: -Subtracting probe effects from cell intensities (properly normalized) gives probe effect corrected intensities – Z j =Y j – b j. -We can then get a probe set summary by summarizing these robustly – a=T(Z), where T is median, trimmed mean or other robust summary (note that we only have 11 points here) -Subtracting the probe summary from the probe effect corrected intensities produces a set of residuals - r j = Z j -a. -The residuals can be turned into weights using the estimate of scale from the fitted model – w j =psi.huber(r j /S).

Chip expression data quality assessment example Probe level data images: Residual chip pseudo-images Weight chip pseudo-images Probe level data: Boxplot residuals Bar 10 th percentile of weights distributio Probe set level data: Boxplot SEs Boxplot unscaled SEs Boxplot normalized unscaled SEs

data set used for illustration Case study 1 – 24 chips with common pancreas RNA preparation (part of Affymetrix Latin square experiment) We first look at the RMA derived diagnostics case, by case. Then look at control and housekeeping genes and 3’-5’ bias, by analysis (to make comparisons between the sources of RNA easier).

Case Study 1 – 24 Chips from Affy Latin square experiment

Images of weights

Images of positive residuals

Probe level summaries

Probe set level summaries

Probe set level summaries - LR

Now look at control/house keeping fragments in Affymetrix chips Note that all chips were produced with a common source of RNA. Here look at a sample of 8.

Log RMA expression for hybridization controls – Affy pancreas samples

Log RMA expression for Poly A Controls – Affy pancreas samples

Log RMA expression for housekeeping controls – Affy pancreas samples

Look at 3’ to 5’ trand in probe intensities over entire chip Next slides

logPM.gc.norm – Affy data

References 1.New Statistical Algorithms for Monitoring Gene Expression on GeneChip® Probe Arrays, Affymetrix technical report. 2. Array Design for the GeneChip® Human Genome U133 Set, Affymetrix technical note. 3.Discussion on Background, Ben Bolstad. 4.Bolstad BM, et. al. (2003), A comparison of normalization methods for high density oligonucleotide array data basedon variance and bias.Bioinformatics Jan 22;19(2): Irizarry, R. et.al (2003) Summaries of Affymetrix GeneChip probe level data, Nucleic Acids Research, 2003, Vol. 31, No. 4 e15 (Available online * ) 6.Irizarry, R. et. al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, in press. 7. * EAz2cYYbEWQrE&keytype=ref&siteid=nar