Statistical Analyses of Microarray Data Rafael A. Irizarry Department of Biostatistics
Outline Scientific questions Review of technology Role of statistics Two case studies
Scientific Questions Expression Differential expression Expression patterns “ To understand gene function, it is helpful to know when and where it is expressed and…” “…under what circumstances the expression level is affected.” “… questions concerning functional pathways and how cellular components work together to regulate and carry out cellular processes.” Lipshutz et al. (1999) Nature genetics, 21, pp
What do Microarrays do? Interrogate labeled nucleic acid samples model systems, microdissections, cell lines, human tissue bank kanR UPTAG DOWNTAG RNA samples Oligonucleotide barcodes
How do they do it? Probes Labeled targets
cDNA clones (probes) PCR product amplification purification printing microarray Hybridize target to microarray mRNA target excitation laser 1 laser 2 emission scanning analysis 0.1nl/spot overlay image and normalize cDNA Arrays
High Density Oligonucleotide Arrays 24µm Millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array >200,000 different complementary probes Single stranded, labeled RNA target Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell Compliments of D. Gerhold
Role of Statistics
Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination Quantify Expression
Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale
Does one size fit all?
Segmentation: limitation of the fixed circle method SRGFixed Circle Inside the boundary is spot (fg), outside is not.
Some local backgrounds We use something different again: a smaller, less variable value. Single channel grey scale
Quantification of Expression For each spot on the slide we calculate Red intensity = Rfg – Rbg fg = foreground, bg = background, and Green intensity = Gfg – Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity) we now have one differential expression for each gene for each array
Top 2.5%of ratios red, bottom 2.5% of ratios green The red-green ratios can be spatially biased
Another example
Oligo Array Image Analysis About 100 pixels per probe cell These intensities are combined to form one number representing expression for the probe cell oligo
Normalization at Probe Level
Dilution Experiment Data
PM MM
Default until 2002 GeneChip ® software uses Avg.diff with A a set of “suitable” pairs chosen by software. Log ratio version is also used. For differential expression Avg.diffs are compared between chips.
What is the evidence? Lockhart et. al. Nature Biotechnology 14 (1996)
Two case studies
Spike-In Experiments Add concentrations (0.5pM – 100 pM) of 11 foreign species cRNAs to hybridization mixture Set A: 11 control cRNAs were spiked in, all at the same concentration, which varied across chips. Set B: 11 control cRNAs were spiked in, all at different concentrations, which varied across chips. The concentrations were arranged in 12x12 cyclic Latin square (with 3 replicates)
Set A: Probe Level Data (12 chips)
Spike-In B Probe SetConc 1Conc 2Rank BioB BioB BioC BioB-M BioDn DapX CreX CreX BioC DapX DapX-M Later we consider 23 different combinations of concentrations
Observed Ranks GeneAvDiffMAS 5.0Li&WongAvLog(PM-BG) BioB BioB BioC BioB-M30363 BioDn DapX CreX CreX BioC DapX DapX-M
kanR A Transformation into deletion pool Select for Ura + transformants Genomic DNA preparation Circular pRS416 PCR Cy5 labeled PCR productsCy3 labeled PCR products Oligonucleotide array hybridization B EcoRI linearized PRS416 NHEJ Defective MCS CEN/ARS URA3 ttaa aatt CEN/ARS URA3 UPTAG DOWNTAG
.
Average Red and Green Scatter Plot
Average Red and Green MVA plot
Histograms
QQ-Plot
Z-Scores
Average Red and Green MVA Plot
Average Red and Green Scatter Plot
Summary Simple data exploration useful tool for quality assessment Statistical thinking helpful for interpretation Statistical models may help find signals in noise
Acknowledgements UC Berkeley Stat Ben Bolstad Sandrine Dudoit Terry Speed Jean Yang MBG (SOM) Jef Boeke Siew-Loon Ooi Marina Lee Forrest Spencer Biostatistics Karl Broman Leslie Cope Carlo Coulantoni Giovanni Parmigiani Scott Zeger Gene Logic Francois Colin Uwe Scherf’s Group PGA Tom Cappola Skip Garcia Joshua Hare WEHI Bridget Hobbs Natalie Thorne