Corrections and Normalization in microarrays data analysis

Slides:

Advertisements

Similar presentations

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.

Advertisements

Experimental Design and Differential Expression Class web site: Statistics for Microarrays.

Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.

Microarray Normalization

Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical.

Mathematical Statistics, Centre for Mathematical Sciences

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Image Quantitation in Microarray Analysis More tomorrow...

Normalization of Microarray Data - how to do it! Henrik Bengtsson Terry Speed

Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley

Getting the numbers comparable

Statistics for Microarrays

The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Fred Hutchinson Cancer Research Center March 9, 2001.

Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.

DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.

Normalization Class web site: Statistics for Microarrays.

Differentially expressed genes

Statistical Analysis of Microarray Data

Gene Expression Data Analyses (2)

1 Lecture 21, Statistics 246, April 8, 2004 Identifying expression differences in cDNA microarray experiments, cont.

Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.

Some thoughts of the design of cDNA microarray experiments Terry Speed & Yee HwaYang, Department of Statistics UC Berkeley MGED IV Boston, February 14,

Gene Expression BMI 731 week 5

Gene expression and the transcriptome I. Genomics and transcriptome After genome sequencing and annotation, the second major branch of genomics is analysis.

Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.

1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.

(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.

Preprocessing of cDNA microarray data Lecture 19, Statistics 246, April 1, 2004.

Image Quantitation in Microarray Analysis More tomorrow...

Gene expression and the transcriptome I

CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.

Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.

Statistical Analyses of Microarray Data Rafael A. Irizarry Department of Biostatistics

DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.

CDNA Microarrays MB206.

Panu Somervuo, March 19, cDNA microarrays.

Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.

WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.

We calculated a t-test for 30,000 genes at once How do we handle results, present data and results Normalization of the data as a mean of removing.

A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.

1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.

Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.

Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.

Statistics for Differential Expression Naomi Altman Oct. 06.

Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.

A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.

Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.

Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.

(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.

The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental.

Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.

The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais.

Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Microarray: An Introduction

Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.

1 Lecture 20, Statistics 246, April 6, 2004 Identifying expression differences in cDNA microarray experiments cDNA microarray experiments.

CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)

Normalization Methods for Two-Color Microarray Data

Estimating expression differences in cDNA microarray experiments

Image Processing for cDNA Microarray Data

Getting the numbers comparable

Normalization for cDNA Microarray Data

Design Issues Lecture Topic 6.

Presentation transcript:

Corrections and Normalization in microarrays data analysis Mauro Delorenzi

Acknowledgments Terry Speed (Berkeley / WEHI) Yee Hwa Yang (Berkeley) Uni. Cal. Statistics Berkeley / WEHI Bioinformatics Terry Speed (Berkeley / WEHI) Yee Hwa Yang (Berkeley) Sandrine Dudoit (Stanford) Ingrid Lönnstedt (Uppsala) Yongchao Ge (Berkeley) Natalie Thorne (WEHI) Mauro Delorenzi (WEHI) Most slides were taken from our collection Collaborations with: Peter Mac CI, Melb. Brown-Botstein lab, Stanford Matt Callow (LBNL) CSIRO Image Analysis Group

16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg) R, G Biological question Gene regulation Class prediction Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation

overlay images and normalise excitation scanning cDNA clones (probes) laser 2 laser 1 emission PCR product amplification purification printing mRNA target) overlay images and normalise 0.1nl/spot Hybridise target to microarray microarray analysis

Scanner's Spots Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale.

Gene Expression Data slide 1 slide 2 slide 3 slide 4 slide 5 … Gene expression data on p genes for n samples Slides slide 1 slide 2 slide 3 slide 4 slide 5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes 3 Gene expression level of gene 5 in slide 4 j = Log2( Red intensity / Green intensity) These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

Some statistical questions Image analysis: addressing, segmenting, quantifying Normalisation: within and between slides Quality: of images, of spots, of (log) ratios Which genes are (relatively) up/down regulated? Assigning p-values to tests / confidence to results Planning of experiments: design, sample size Discrimination and allocation of samples Clustering, classification: of samples, of genes Selection of genes relevant to any given analysis Analysis of time course, factorial and other special experiments ……………………& more 4

I. The simplest problem is identifying differentially expressed genes using one slide This is a common enough hope Efforts are frequently successful It is not hard to do by eye The problem is probably beyond formal statistical inference (valid p-values, etc) for the foreseeable future. 4

Objectives Important aspects of a statistical analysis include: Tentatively separating systematic sources of variation ("artefacts"), that bias the results, from random sources of variation ("noise"), that hide the truth. Removing the former and quantifying the latter Identifying and dealing with the most relevant source of variation in subsequent analyses Only if this is done can we hope to make more or less valid probability statements about the confidence in the results Every Correction is a new source of variability. There is a trade-off between gains and losses. The best method depends on the characteristic of the data and this can vary. 4

Typical Statistical Approach Measured value = real value + systematic errors + noise Corrected value = real value + noise Analysis of Corrected value => (unbiased) CONCLUSIONS Estimation of Noise => quality of CONCLUSIONS, statistical significance (level of confidence) of the conclusions 4

Step 1: Background Correction Image Analysis => Rfg ; Rbg ; Gfg ; Gbg (fg = foreground, bg = background.) For each spot on the slide we calculate Red intensity = R = Rfg - Rbg Green intensity = G = Gfg - Gbg M = Log2( Red intensity / Green intensity) Subtraction of background values (additive background model assuming to be locally constant …) Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust, noise in the scanner measurement Not included: real cross-hybridisation and unspecific hybridisation to the probe 4

The intensity pairs (R, G) are highly processed data and the methods of image processing and background correction of the laser scan images can have a large impact. Before applying normalisation, inference, cluster analysis and the like, it is important to identify and remove systematic sources of variation such as due to different labeling efficiencies and scanning properties of the two dyes or spatial inhomogeneities. With many different users and protocols, the portion of the variation due to systematic effects can vary substantially. There are many sources of systematic variation which affect the measured gene expression levels. Normalisation is the term used to describe the process of re moving such variation. Until the variation is properly accounted for or modelled, there is no question of the system being in statistical control and hence no basis for a statistical model to describe chance variation. 4

Step 2: An M vs A (MVA) Plot M = log R/G = logR - logG Lowess curve blanks Positive controls (spotted in varying concentrations) Negative controls A = ( logR + logG ) /2

A reminder on logarithms

A numerical example

Why use an M vs A plot ? Logs stretch out region we are most interested in. Can more clearly see features of the data such as intensity dependent variation, and dye-bias. Differentially expressed genes more easily identified. Intuitive interpretation

MVA plot: looking at data 1 Spot identifier Lowess curve S1.n. Control Slide: Dye Effect, Spread.

MVA plot: looking at data 2 S1.p . Normalised data. Spread.

MVA plot: looking at data 3 S4. A-dependent variability.

MVA plot: analysing data 4 S17. Saturation

MVA plot: looking at data 5: Unique effects of different scanners

Normalisation - Median Step 3: Normalisation - median Assumption: Changes roughly symmetric First panel: smooth density of log2G and log2R. Second panel: M vs A plot with median put to zero

Step 4: Normalisation - lowess Assumption: changes roughly symmetric at all intensities.

A hypothetical quantitative model a. linear response

A realistic hypothetical quantitative model b. power function-response Median Effect Scale Effect Dye-Intensity Effect

Step 5: Normalisation - between groups Log-ratios Print-tip groups After within slide global lowess normalization. Likely to be a spatial effect.

Normalization between groups (ctd) Log-ratios Print-tip groups After print-tip location- and scale- normalization.

Effects of Location Normalisation (example) Before After

Taking varying scale into account Step 6: Rescaling (Spread-Normalisation) Assumption: All (print-tip-)groups should have the same spread in M True ratio is ij where i represents different (print-tip)-groups and j represents different spots. Observed is Mij, where Mij = ai * log(ij) Robust estimate of ai is Corrected values are calculated as:

Illustration: print-tip-group - Normalisation Assumption: For every print group: changes roughly symmetric at all intensities. Glass Slide Array of bound cDNA probes 4x4 blocks = 16 pin groups

Step 7: Assessing Significance MVA-plot and critical curves Newton’s, Sapir & Churchill’s and Chen’s single slide method

Other Approaches These normalisation procedures are based on the assumption that spots are as likely to be higher in the first or the second dye. They work well with a high number of independent spots. If (a few) genes were selected another approach might be needed. For the correction of dye-effects we recommend to use either: Paired dye-swapped slides and/or Internal Controls as spikes or a dilution series In the second case, instead of all genes only the control spots are used to compute the corrections. In the first case, the data from the two slides can be combined. Assuming identical dye-intensity interactions in the two slides, the effect is corrected by taking: A = 0,5 (A1 + A2) M= 0,5 (M1 – M2) This procedure is called self-normalisation, as it is done spot-by-spot. A number of controls give indication if it is working well. It also deals with some artifacts that cause some genes to be always higher in one dye than in the other. 4

II. The second simplest problem is identifying differentially expressed genes using replicated slides There are a number of different aspects: First, between-slide normalization; then What should we look at: averages, SDs t-statistics, other summaries? How should we look at them? Can we make valid probability statements? 4

Selecting genes up/down regulated 1 M t t M Results from the Apo AI ko experiment

Which genes are (relatively) up/down regulated? Selecting genes up/down regulated Two samples. e.g. KO vs. WT or mutant vs. WT Two samples with a reference (e.g. pooled control) T C  n T C*  n C For each gene form the t statistic: average of n trt Ms sqrt(1/n (SD of n trt Ms)2) For each gene form the t statistic: average of n trt Ms - average of n ctl Ms sqrt(1/n (SD of n trt Ms)2 + (SD of n ctl Ms)2)

Which genes have changed? When permutation testing is possible 1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G). 2. For each gene form the t statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2) 3. Form a histogram of 6,000 t values. 4. Do a normal Q-Q plot; look for values “off the line”. 5. Permutation testing. 6. Adjust for multiple testing. 9

Histogram & qq plot ApoA1

Adjusted and Unadjusted p-values for the 50 genes with the largest absolute t-statistics.

Which genes have changed? When Permutation testing is not possible Our current approach is to use M-averages, SDs, t-statistics and a new statistic we call B, inspired by empirical Bayes. We hope in due course to calibrate B and use that as our main tool. Empirical Bayes log posterior odds ratio 9

T B t  M B t B

Remarks for multiarrays experiments Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene. Averages can be driven by outliers. Ts can be driven by tiny variances. B = LOR will, we hope use information from all the genes combine the best of M. and T avoid the problems of M. and T

Some web sites: Technical reports, talks, software etc. http://www.stat.berkeley.edu/users/terry/zarray/Html/ Especially: Dudoit et al: “Statistical methods for …” Yee Hwa Yang et al. “Normalization for cDNA Microarray Data” Statistical software R “GNU’s S” http://lib.stat.cmu.edu/R/CRAN/ Packages within R environment: -- Spot http://www.cmis.csiro.au/iap/spot.htm -- SMA (statistics for microarray analysis) http://www.stat.berkeley.edu/users/terry/zarray/Software /smacode.html