Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.

Slides:



Advertisements
Similar presentations
MicroArray Image Analysis Robin Liechti
Advertisements

Experimental Design and Differential Expression Class web site: Statistics for Microarrays.
MicroArray Image Analysis
MicroArray Image Analysis Robin Liechti
Microarray Normalization
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical.
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Normalization of Microarray Data - how to do it! Henrik Bengtsson Terry Speed
TIGR Spotfinder: a tool for microarray image processing
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
Statistics for Microarrays
The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Fred Hutchinson Cancer Research Center March 9, 2001.
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Normalization Class web site: Statistics for Microarrays.
Gene Expression Data Analyses (2)
Low Level Statistics and Quality Control Javier Cabrera.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Cluster Analysis Class web site: Statistics for Microarrays.
Image Analysis Class web site: Statistics for Microarrays.
Gene expression and the transcriptome I. Genomics and transcriptome After genome sequencing and annotation, the second major branch of genomics is analysis.
A robust neural networks approach for spatial and intensity-dependent normalization of cDNA microarray data A.L. Tarca, J.E.K. Cooke and J. MacKay Presented.
Corrections and Normalization in microarrays data analysis
Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.
Analysis of microarray data
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Preprocessing of cDNA microarray data Lecture 19, Statistics 246, April 1, 2004.
Image Quantitation in Microarray Analysis More tomorrow...
Scanning and Image Processing -by Steve Clough. GSI Lumonics cDNA microarrays use two dyes with well separated emission spectra such as Cy3 and Cy5 to.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Gene expression and the transcriptome I
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
Statistical Analyses of Microarray Data Rafael A. Irizarry Department of Biostatistics
DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.
CDNA Microarrays MB206.
Panu Somervuo, March 19, cDNA microarrays.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Pre-processing in DNA microarray experiments Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor short course Summer 2002.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.
Microarray Data Pre-Processing
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Pre-processing DNA Microarray Data Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor Short Course Winter 2002 © Copyright.
Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006.
The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Lecture 2 – Pre-processing and Normalization José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute.
Normalization Methods for Two-Color Microarray Data
Image Processing for cDNA Microarray Data
Getting the numbers comparable
Normalization for cDNA Microarray Data
Presentation transcript:

Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001

Joint work with Yee Hwa (Jean) Yang Terry Speed. R packages –Spot, for image analysis; –SMA, “Statistics for Microarray Analysis”, for normalization and much more.

Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

Image analysis

The raw data from a microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes. Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

Steps in image analysis 1. Addressing. Estimate location of spot centers. 2. Segmentation. Classify pixels as foreground (signal) or background. 3. Information extraction. For each spot on the array and each dye signal intensities; background intensities; quality measures.

Spot Software package. Spot, built on the freely available software package R. Batch automatic addressing. Segmentation. Seeded region growing (Adams & Bischof 1994): adaptive segmentation method, no restriction on the size or shape of the spots. Information extraction –Foreground. Mean of pixel intensities within a spot. –Background. Morphological opening: non-linear filter which generates an image of the estimated background intensity for the entire slide. Spot quality measures.

Addressing 4 by 4 grids Automatic addressing within the same batch of images. Other problems: -- Mis-registration -- Rotation -- Skew in the array Estimate translation of grids. Estimate row and column positions. Foreground and background grids.

Segmentation Adaptive segmentation, SRGFixed circle segmentation Spots usually vary in size and shape.

Segmentation — Small spot — Not circular Results from SRG

Seeded region growing Adaptive segmentation method. Requires the input of seeds, either individual pixels or groups of pixels, which control the formation of the regions into which the image will be segmented. Here, based on fitted foreground and background grids from the addressing step. The decision to add a pixel to a region is based on the absolute gray-level difference of that pixel’s intensity and the average of the pixel values in the neighboring region. Done on combined red and green images.

Information extraction Spot foreground intensities Rfg Gfg –mean or median of foreground pixel intensities. Spot background intensities Rbg Gbg –local; –morphological opening; –constant (global); –none. Quality information –spot quality; –slide quality. Signal Background

Local background ---- GenePix ---- QuantArray ---- ScanAnalyze

Morphological opening Image is probed with a structuring element, here, a square with side length about twice the spot to spot distance. Morphological opening: erosion followed by dilation. Erosion (Dilation): the eroded (dilated) value at a pixel x is the minimum (maximum) value of the image in the window defined by the structuring element when its origin is at x. Done separately for the red and green images. Produces image of estimated background for the entire slide.

Background matters Morphological openingLocal background

Quality measures Spot quality –Brightness: foreground/background ratio; –Uniformity: variation in pixel intensities and ratios of intensities; –Morphology: area, perimeter, circularity. Slide quality –Percentage of spots with no signal; –Range of intensities; –Distribution of spot signal area, etc. How to use quality measures in subsequent analyses?

Normalization

Identify and remove systematic sources of variation in the measured fluorescence intensities, other than differential expression, for example –different labeling efficiencies of the dyes; –different amounts of Cy3- and Cy5- labeled mRNA; –different scanning parameters; –sector/print-tip, spatial, or plate effects, etc. Necessary for within and between slides comparisons of expression levels.

Normalization The need for normalization can be seen most clearly in self-self hybridizations where the same mRNA sample is labeled with the Cy3 and Cy5 dyes. The imbalance in the red and green intensities is usually not constant across the spots within and between arrays, and can vary according to overall spot intensity, location, plate origin, etc. These factors should be considered in the normalization.

Single-slide data display Usually: R vs. G log 2 R vs. log 2 G. Preferred M = log 2 R - log 2 G vs. A = (log 2 R + log 2 G)/2. An MA-plot amounts to a 45 o counterclockwise rotation of a log 2 R vs. log 2 G plot followed by scaling.

Self-self hybridization log 2 R vs. log 2 GM vs. A

Normalization Within-slides –Location normalization. –Scale normalization. –Which genes to use? Paired-slides (dye-swap) –Self-normalization. Between-slides.

Apo AI experiment Goal. Identify genes with altered expression in the livers of apo AI knock-out mice compared to inbred C57Bl/6 control mice. 8 treatment (trt) mice and 8 control (ctl) mice. 16 hybridizations: target mRNA from each of the 16 mice is labeled with Cy5, pooled mRNA from control mice is labeled with Cy3. Probes: ~6,000 cDNAs, including 200 related to lipid metabolism.

Target mRNA samples: R = apo A1 KO mouse liver G = pooled control C57Bl/6 mouse liver

MA plot M = log 2 R - log 2 G, A = (log 2 R + log 2 G)/2 log 2 R vs. log 2 G M vs. A

Location normalization log 2 R/G log 2 R/G –l(intensity, location, plate, …) Global normalization. Normalization function l is constant across the spots and equal to the mean or median of the log-ratios M. Adaptive normalization. Normalization function l depends on a number of predictor variables, such as spot intensity, location, plate origin. The normalization function can be obtained by robust locally weighted regression of the log-ratios M on the predictor variables.

Location normalization Variable selection. Many different smoothers are available. E.g. lowess and loess smoothers. For local regression, choices must be made regarding: the weight function, the parametric family that is fitted locally, the bandwidth, and the fitting criterion. As with any other normalization approach, these choices represent a modeling of the data.

Location normalization Intensity dependent normalization. Regression of M on A. Intensity and print-tip dependent normalization. Same as above, for each print-tip separately. Spatial normalization. Regression of M on 2D-coordinates. Other variables: time of printing, plate, etc.

Global median normalization First panel: density estimates for log 2 G and log 2 R. Second panel: MA plot after median normalization.

Intensity dependent normalization

Intensity + print-tip dependent normalization Before

Intensity + print-tip dependent normalization After

Effect of location normalization Before normalization After intensity + print-tip normalization

Scale normalization The log-ratios M from different print-tips or plates may exhibit different spreads and some scale adjustment may be necessary. log 2 R/G (log 2 R/G –l)/s Can use a robust estimate of scale like the median absolute deviation (MAD) MAD = median | M – median(M) |.

Effect of location + scale normalization

Comparing different normalization methods

Which genes to use? All genes on the array. Constantly expressed genes (housekeeping). Controls –Spiked controls (e.g. plant genes); –Genomic DNA titration series. Rank invariant set.

Follow-up experiment 50 distinct clones with largest absolute t-statistics from apo AI experiment + 72 other clones. Spot each clone 8 times. Two hybridizations with dye-swap: Slide 1: trt  red, ctl  green. Slide 2: trt  green, ctl  red.

Follow-up experiment

Self-normalization Slide 1, M = log 2 (R/G) - l Slide 2, M’ = log 2 (R’/G’) - l’ Combine by subtracting the normalized log-ratios: [ (log 2 (R/G) - l) - (log 2 (R’/G’) - l’) ] / 2  [ log 2 (R/G) + log 2 (G’/R’) ] / 2  [ log 2 (RG’/GR’) ] / 2 provided l = l’. Assumption: the normalization functions are the same for the two slides.

Checking the assumption MA plot for slides 1 and 2

Result of self-normalization (M - M’)/2 vs. (A + A’)/2

Summary Case 1. Only a few genes are expected to change. Within-slides –Location: intensity + print-tip dependent lowess normalization. –Scale: for each print-tip-group, scale by MAD. Between-slides –An extension of within-slide scale normalization. Case 2. Many genes expected to change. –Paired-slides: Self-normalization. –Use of controls or known information.

cDNA gene expression data Genes mRNA samples Gene expression level of gene i in mRNA sample j = log 2 ( Red intensity / Green intensity) sample1sample2sample3sample4sample5 … Data on p genes for n samples

Reading for next Monday Robust local regression W. S. Cleveland. (1979). Robust localy weighted regression and smoothing scatterplots. Journal of the American Statistical Association. 74: Experimental design M. K. Kerr & G. A. Churchill. (2001). Experimental Design for Gene Expression Microarrays. Biostatistics. 2: