The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental.

Slides:



Advertisements
Similar presentations
Randomized Complete Block and Repeated Measures (Each Subject Receives Each Treatment) Designs KNNL – Chapters 21,
Advertisements

Experimental Design and Differential Expression Class web site: Statistics for Microarrays.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical.
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Introduction to the design of cDNA microarray experiments Statistics 246, Spring 2002 Week 9, Lecture 1 Yee Hwa Yang.
Experimental design for microarrays Presented by Alex Sánchez and Carmen Ruíz de Villa Departament d’Estadística. Universitat de Barcelona.
Normalization of Microarray Data - how to do it! Henrik Bengtsson Terry Speed
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
Statistics for Microarrays
University of Louisville The Department of Bioinformatics and Biostatistics.
The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Fred Hutchinson Cancer Research Center March 9, 2001.
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
Normalization Class web site: Statistics for Microarrays.
Differentially expressed genes
Gene expression Terry Speed Lecture 4, December 18, 2001.
Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.
Terry Speed Wald Lecture III August 9, 2001
Gene Expression Data Analyses (2)
1 Lecture 21, Statistics 246, April 8, 2004 Identifying expression differences in cDNA microarray experiments, cont.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Image Analysis Class web site: Statistics for Microarrays.
Some thoughts of the design of cDNA microarray experiments Terry Speed & Yee HwaYang, Department of Statistics UC Berkeley MGED IV Boston, February 14,
Gene Expression BMI 731 week 5
Making Sense of Complicated Microarray Data
Gene expression and the transcriptome I. Genomics and transcriptome After genome sequencing and annotation, the second major branch of genomics is analysis.
A robust neural networks approach for spatial and intensity-dependent normalization of cDNA microarray data A.L. Tarca, J.E.K. Cooke and J. MacKay Presented.
Corrections and Normalization in microarrays data analysis
\department of mathematics and computer science Supervised microarray data analysis Mark van de Wiel.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Preprocessing of cDNA microarray data Lecture 19, Statistics 246, April 1, 2004.
Image Quantitation in Microarray Analysis More tomorrow...
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Gene expression and the transcriptome I
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
CDNA Microarrays MB206.
Panu Somervuo, March 19, cDNA microarrays.
WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 6: Case Study.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.
Analysis and Management of Microarray Data Previous Workshops –Computer Aided Drug Design –Public Domain Resources in Biology –Application of Computer.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Pre-processing in DNA microarray experiments Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor short course Summer 2002.
Techniques for Analysing Microarrays Which genes are involved in ovarian and prostate cancer?
Statistics for Differential Expression Naomi Altman Oct. 06.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Pre-processing DNA Microarray Data Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor Short Course Winter 2002 © Copyright.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
1 Lecture 20, Statistics 246, April 6, 2004 Identifying expression differences in cDNA microarray experiments cDNA microarray experiments.
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Statistics for Microarray Data Analysis – Lecture 3
Normalization Methods for Two-Color Microarray Data
Estimating expression differences in cDNA microarray experiments
Image Processing for cDNA Microarray Data
Getting the numbers comparable
Normalization for cDNA Microarray Data
Presentation transcript:

The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental Health Research, March 5, 2001 NIEHS National Center for Toxicogenomics NCSU Bioinformatics Research Center

Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

Some motherhood statements Important aspects of a statistical analysis include: Tentatively separating systematic from random sources of variation Removing the former and quantifying the latter, when the system is in control Identifying and dealing with the most relevant source of variation in subsequent analyses Only if this is done can we hope to make more or less valid probability statements

The simplest cDNA microarray data analysis problem is identifying differentially expressed genes using one slide This is a common enough hope Efforts are frequently successful It is not hard to do by eye The problem is probably beyond formal statistical inference (valid p-values, etc) for the foreseeable future, and here’s why

An M vs. A plot M = log 2 (R / G) A = log 2 (R*G) / 2

Background matters From Spot From GenePix

From the NCI60 data set (Stanford web site) No background correction With background correction

An experiment having within-slide replicates

Background makes a difference Background methodSegmentation methodExp1 Exp2 S.nbg66 Gp.nbg76 SA.nbg66 No backgroundQA.fix.nbg76 QA.hist.nbg76 QA.adp.nbg1414 S.valley1721 GP1111 Local surroundingSA1214 QA.fix1823 QA.hist98 QA.adp2726 OthersS.morph99 S.const1414 Medians of the SD of log 2 (R/G) for 8 replicated spots multiplied by 100 and rounded to the nearest integer.

Normalisation - lowess Global lowess (Matt Callow’s data, LNBL) Assumption: changes roughly symmetric at all intensities.

From the NCI60 data set (Stanford web site)

Ngai lab, UCB

Tiago’s data from the Goodman lab, UCB

From the Ernest Gallo Clinic & Research Center

From Peter McCallum Cancer Research Institute, Australia

Normalisation - print tip Assumption: For every print group, changes roughly symmetric at all intensities.

M vs A after print-tip normalisation

Normalization (ctd) Another data set After within slide global lowess normalization. Likely to be a spatial effect. Print-tip groups Log-ratios

Assumption: All print-tip-groups have the same spread in M True log ratio is  ij where i represents different print-tip-groups and j represents different spots. Observed is M ij, where M ij = a i  ij Robust estimate of a i is MAD i = median j { |y ij - median(y ij ) | } Taking scale into account

Normalization (ctd) That same data set Normalization (ctd) That same data set After print-tip location and scale normalization. Incorporate quality measures. Log-ratios Print-tip groups

Matt Callow’s Srb1 dataset (#5). Newton’s and Chen’s single slide method

Matt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method

Genomic DNA vs. Genomic DNA The approach of Roberts et al (Rosetta) Data from Bing Ren

The second simplest cDNA microarray data analysis problem is identifying differentially expressed genes using replicated slides There are a number of different aspects: First, between-slide normalization; then What should we look at: averages, SDs t- statistics, other summaries? How should we look at them? Can we make valid probability statements? A report on work in progress

Normalization (ctd) Yet another data set Between slides this time (10 here) Only small differences in spread apparent We often see much greater differences Slides Log-ratios

The “NCI 60” experiments (no bg)

Assumption: All slides have the same spread in M True log ratio is  ij where i represents different slides and j represents different spots. Observed is M ij, where M ij = a i  ij Robust estimate of a i is MAD i = median j { |y ij - median(y ij ) | } Taking scale into account

Which genes are (relatively) up/down regulated? Two samples. e.g. KO vs. WT or mutant vs. WT TC  n n For each gene form the t statistic: average of n trt Ms sqrt(1/n (SD of n trt Ms) 2 )  n n

Which genes are (relatively) up/down regulated? Two samples with a reference (e.g. pooled control) TC*  n n For each gene form the t statistic: average of n trt Ms - average of n ctl Ms sqrt(1/n (SD of n trt Ms) 2 + (SD of n ctl Ms) 2 ) C C*  n n

One factor: more than 2 samples Samples: Liver tissue from mice treated by cholesterol modifying drugs. Question 1: Find genes that respond differently between the treatment and the control. Question 2: Find genes that respond similarly across two or more treatments relative to control. T1 C T2T3T4 x 2

One factor: more than 2 samples Samples: tissues from different regions of the mouse olfactory bulb. Question 1: differences between different regions. Question 2: identify genes with a pre-specified patterns across regions. T3 T4 T2 T6 T1 T5

Two or more factors 6 different experiments at each time point. Dyeswaps. 4 time points (30 minutes, 1 hour, 4 hours, 24 hours) 2 x 2 x 4 factorial experiment. ctlOSM EGF OSM & EGF  4 times

Which genes have changed? When permutation testing possible 1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log 2 (R/G). 2. For each gene form the t statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms) 2 + (SD of 8 ctl Ms) 2 ) 3. Form a histogram of 6,000 t values. 4. Do a normal Q-Q plot; look for values “off the line”. 5. Permutation testing. 6. Adjust for multiple testing.

Histogram & qq plot ApoA1

Apo A1: Adjusted and Unadjusted p-values for the 50 genes with the largest absolute t-statistics.

Which genes have changed? Permutation testing not possible Our current approach is to use averages, SDs, t-statistics and a new statistic we call B, inspired by empirical Bayes. We hope in due course to calibrate B and use that as our main tool. We begin with the motivation, using data from a study in which each slide was replicated four times.

Results from 4 replicates

B=LOR compared

M t t  M Results from the Apo AI ko experiment

M t t  M Results from the Apo AI ko experiment

Empirical Bayes log posterior odds ratio

M B t M  B t  B t  M  B Results from SR-BI transgenic experiment

M B t M  B t  B t  M  B Results from SR-BI transgenic experiment

Extensions include dealing with Replicates within and between slides Several effects: use a linear model ANOVA: are the effects equal? Time series: selecting genes for trends

Un-enriched DNA (Cy3) antibody-enriched DNA (Cy5) Rosetta once more: In vivo Binding Sites of Gal4p in Galactose P <0.001

Summary (for the second simplest problem) Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene. Averages can be driven by outliers. Ts can be driven by tiny variances. B = LOR will, we hope –use information from all the genes –combine the best of M. and T –avoid the problems of M. and T

Acknowledgments UCB/WEHI Yee Hwa Yang Sandrine Dudoit Ingrid Lönnstedt Natalie Thorne David Freedman CSIRO Image Analysis Group Michael Buckley Ryan Lagerstorm Ngai lab, UCB Goodman lab, UCB Peter Mac CI, Melb. Ernest Gallo CRC Brown-Botstein lab Matt Callow (LBNL) Bing Ren (WI)

Some web sites: Technical reports, talks, software etc. Statistical software R “GNU’s S” Packages within R environment: -- Spot -- SMA (statistics for microarray analysis) /smacode.html

Factorial Design Zone Effect A1P01 P04 A Age Effect

Different ways of estimating parameters. e.g. Z effect. 1 = (  + z) - (  ) = z = ((  + a) - (  )) -((  + a)-(  + z)) = (a) - (a + z) = z =…= z Factorial design   a  z  z+a+za A1P01 P04A How do we combine the information?

Regression analysis Define a matrix X so that E(M)=X  Use least squares estimate for z, a, za

Looking at effect of Z: log(zone 4 / zone1) gene A gene B

Estimate Log 2 (SE) Z effect  t =  / SE   t

Zone Age Zone  Age

Age Zone. Age interaction Zone 19 Top 50 genes from each effect

T B t  M  B t  B

M t t  M

M B t M  B t  B t  M  B

Some statistical questions Image analysis: addressing, segmenting, quantifying Normalisation: within and between slides Quality: of images, of spots, of (log) ratios Which genes are (relatively) up/down regulated? Assigning p-values to tests/confidence to results.

Some statistical questions, ctd Planning of experiments: design, sample size Discrimination and allocation of samples Clustering, classification: of samples, of genes Selection of genes relevant to any given analysis Analysis of time course, factorial and other special experiments…………………………...& much more

The “NCI 60” experiments (bg)