Design of microarray gene expression profiling experiments Peter-Bram ’t Hoen
2 Lay-out Practical considerations Pooling Randomization One-color vs Two-colors Two-color hybridization designs Ratio-based vs Intensity-based analysis
3 Think before you start research question choice of technology controls and replicates Ref: Churchill Nature Genetics Supplement 32:
4 Research question Limit your (initial) number of question / conditions choose best timepoint for mRNA regulation can be different from protein/activity pilots using RT-qPCR experimental follow-up what will you do with the data? verification of differential gene expression in vitro experiments to study mechanism "in vivo" verification in tissue sections
5 Choice of technology What is affordable? Do a pilot to estimate the variance for your samples, experimental set-up and platform Calculate your power: What is the lower border of the effect size that you can pick up?
6 Controls positive: genes whose regulation is known check on biological experiment & data analysis positive: spikes in mRNA and/or hyb mix check labeling procedure and hybridization detection range (sensitivity) and dynamic range "landing lights" for gridding software negative controls: non-specific binding check cross-hybridization: buffer, non-homologous DNA
7 Spikes RCACabrbcLLTP4LTP6 Spiked 2-fold change (copies/cell) XCP2RPC1NAC1TIMPRK Spiked 3-fold change (copies/cell) Test RNA Reference RNA spike …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… Array containing DNA controls …… …… …… …… …… …… cDNA probe synth. & hybridize
8 Spikes Van de Peppel et al. EMBO Reports 4, 387 (2003)
9 Controls positive: genes whose regulation is known check on biological experiment & data analysis positive: spikes in mRNA and/or hyb mix check labeling procedure and hybridization detection range (sensitivity) and dynamic range "landing lights" for gridding software negative controls: non-specific binding check cross-hybridization: buffer, non-homologous DNA
10 Replicates Include sufficient replicates, based on pilot experiment Biological replicates are preferred over technical replicates Control experimental variables with possible unintended effects genetic background gender age
11 Randomization Randomize samples with respect to experimental influences experimenter day of hybridization batch of arrays dye etc
12 Pooling Often done because of lack of sufficient amounts of RNA, but good amplification protocols are available Advantages: dampening of individual variation, may increase statistical power Generally not recommended: outliers in the population may result in large and significant effects information on the differences in the population is lost and is probably biologically relevant in fact, it is an artificial way to increase the significance of your findings
13 Hybridization design One color: not many difficulties expected Two color: what to hybridize with what in which color? Reference design Paired design Loop design Mixed design Read: Yang & Speed (2002). Design issues for cDNA microarray experiments. Nature Reviews Genetics 3,
14 Hybridization design: general issues Comparisons on the same array are more precise than comparisons on different arrays Identify most important comparisons Hybridize those on the same slide Dye swap A dye-effect is always there Balance designs with respect to dye (exception: some common reference designs)
15 Common reference vs direct hybridizations Direct Common reference A B A B R Variance[ log(A/B) ] for slide = s 2 then the variance of the average of the two measurements is s 2 /2 s 2 /2 log(A/B) = log(A/R) – log(B/R) and variance of log(A/B) is variance[ log(A/R) ] + variance[ log(B/R) ] = s 2 + s 2 = 2 s 2
16 More samples Loop Reference 6 arrays A B R C A B C Log (A/B) = 2/3 log (A/B) + 1/3 {log (A/C) – log (B/C)} Assuming that all variances are equal Variance [ log(A/B) ] = 4/9 (s 2 / 2) + 1/9 (s 2 ) = 1/3 s 2 Variance [ log(A/B) ] = Variance [ log(A/C) ] = Variance [ log(B/C) ] = 0.5s s 2 = s 2
17 Common reference vs direct hybridizations Theoretical Considerations A design is optimal when it minimizes the variance of the effect of interest Look for designs leading to small variance of log(A/B) Practical considerations Common reference may be desired when experiment is extended in the future or when a lot of different conditions have to be compared Choose a biologically relevant common reference (say: your control sample). In that case, your ratios are of interest and better interpretable
18 Time-course designs Take 4 time points T1 T2 T3 T4 The best choice of design depends on the comparisons of interest and on the number of slides available
19 Time-course designs Using 3 slides: T 1 T 2 T 3 T 4 which is the best to estimate changes relative to the initial time point: T 2 / T 1, T 3 / T 1, T 4 / T 1
20 Time-course designs Using 3 slides: T 1 T 2 T 3 T 4 which is the best to estimate relative changes between successive time points: T 2 / T 1, T 3 / T 2, T 4 / T 3
21 Time course designs Using 4 slides: T 1 T 2 T 3 T 4 R which is the reference design; All comparisons have equal precision
22 Time course design Using 4 slides: T 1 T 2 T 3 T 4 which is the loop design, balanced wrt dye Distant comparisons have lower precision
23 Time course designs Using 4 slides: T 1 T 2 T 3 T 4 also uses exactly 2 hybridizations per treatment, balanced wrt dye. Most precise estimates: 1/2, 1/3, 2/4, 3/4
24 Factorial designs Designs for studies which involve factors as explanatory variables Age group gender Cell line Tumor types
25 Factorial designs Glonek & Solomon (2004) Admissible design: using the same number of arrays, there are no other designs yielding smaller variances of all parameters Glonek et al.Biostatistics 5, (2004)
26 Factorial design; example Time 0h 24h Cell lines I (non-leukaemic) II (leukaemic) Find genes diff. expressed at 24 but not at 0: interaction between time and cell line
27 Factorial design; possible samples All combinations of factor levels. In this case, 4 are possible:
28 Factorial design: analysis model (log-)linear model is used experimental conditions correspond to parameter combinations as in:
29 Factorial design; possible arrays I,0 I,24 II,0II,24 (1) (2) (3) (4) (5) (6)
30 Optimal admissible design Designs that are not worse than others, and for which the variance of the parameter of interest is (one of the) smallest In the example: wish to find admissible designs for which the interaction term has one of the smallest variances
31 Glonek et al.Biostatistics 5, (2004)
32 Optimal admissible design Glonek et al.Biostatistics 5, (2004)
33 Factorial designs: conclusions Design with all pairwise comparisons is not the best in this case Best design can only be found with respect to a model if model does not fit the data well, design choice may not be the best make sure model chosen is adequate
34 How to compare efficiently many different conditions? Common reference: not efficient Loop and mixed designs: not all comparisons have equal precisions GA Churchill, Nat Genet Dec;32 Suppl:490-5
35 Possible solution Randomized design Intensity-based rather than ratio-based calculations Requires: Hybridization of two samples independent; no competition for binding sites Absence of large spot and array effects To be tested for each platform
36 Our favourite platform Spotted collection of 65-mer oligonucleotides (Sigma- Compugen collection) 22K
37 Design used to demonstrate independent hyb ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
38 Distribution of signal intensities is similar ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
39 Correlation of intensities is high ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004) R > < R < 0.95 R < 0.90
40 Effect of addition of unlabelled target Single target on microarray Two targets on microarray ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
41 Correlation of ratios calculated from different hyb designs ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
42 Intensity-based analysis Hybridizations of two targets on the array are independent No saturation and no competition Intensity readings show high inter-array correlation Comparisons on the same array have highest precision and all other comparisons have equal precision ‘t Hoen et al. Nucleic Acids Res. 32:e41 (2004)
43 Example of randomized design Turk et al. FASEB J 20, (2006) Mouse models for muscular dystrophy
44 Our design Randomly assign samples to the arrays, avoiding co- hybridization of sample from the same group 2 biological replicates 4 technical replicates (dye- swap + replicate spotting) Turk et al. FASEB J 20, (2006)
45 Intensity-based analysis can go wrong Vinciotti et al. Bioinformatics 21: (2005)
46 Intensity-based analysis can go wrong Vinciotti et al. Bioinformatics 21: (2005)
47 Some guidelines First determine the main question, pointing out the effect of interest log[A/B] Then choose analysis model, so that effect variance can be computed VAR { log[A/B] } Practical constraints: amount of RNA available, number of hybridizations, number of slides A good design measures the effect of interest as accurately as possible small VAR { log[A/B] }
48 Some useful links
49 Acknowledgements Human and Clinical Genetics, LUMC Judith Boer Renée de Menezes Rolf Turk Ellen Sterrenburg Johan den Dunnen Gertjan van Ommen Microarray facility: Leiden Genome Technology Center
50 Case study Two genetically-modified zebrafish strains and one wild-type Defects mainly in muscle development Apparent at hours of development; early death Question: which biological pathways are affected and responsible for defective myogenesis?
51 Possible platforms and budget Affymetrix (1-color): 500 euro per chip; variance for ratio of two samples on two chips: s 2 Homespotted arrays (2-color): 100 euro per chip variance for ratio of two samples on one chip: 2s 2 Budget: 12,000 euro
52 Questions Isolation of specific compartments / whole animal lysates? Pooling? How many replicates? Which hybridization design? What is the variance of the most important comparisons?