Design of experiments and basic analysis: estimating and testing for differential expression. Statistics for Microarray Data Analysis – Lecture 3 The Fields Institute for Research in Mathematical Sciences May 25, 2002
Design of cDNA microarray experiments
Allocation of samples to the slides Some aspects of design Layout of the array Which cDNA sequences to print? Library Controls Spatial positions Allocation of samples to the slides Different design layout A vs B : Treatment vs control Multiple treatments Time series Factorial Replication number of hybridizations use of dye swap in replication Different types replicates (e.g pooled vs unpooled material (samples)) Other considerations Physical limitations: the number of slides and the amount of material Extensibility - linking There are two broad aspects of to designing a microarray experiments. The first part is designing the array. Such as which cDNA sequence to print, what library to spots and what quality controls to include. This part of design is more of a bioinformatics question. The second aspects is on the allocation of samples to the slides. This refers to the assignment of dye labels to the samples and to determine which samples should be paried and hyb on the same slides. Later in the talk, we’ll see the different design choices in each of the different experimental settings. -- Other general issues to keep in mind which affects design choices are replication. The number and type of replication, this often determines precision and generalizability of your experiments. Extensibility, refers to the ability to compare between essentially arbitrarily many sources of data sets. -- In the interest of time, I will be focusing on the aspect that illustrate precision of estimates varies in different design layout in 4 different experimental context / setting.
Graphical representation
Natural design choice T1 T2 Tn-1 Tn T1 T2 T3 T4 C Ref Case 1: Meaningful biological control (C) Samples: Liver tissue from four mice treated by cholesterol modifying drugs. Question 1: Genes that respond differently between the Ti and the C. Question 2: Genes that responded similarly across two or more treatments relative to control. Case 2: Use of universal reference. Samples: Different tumor samples. Question: To discover tumor subtypes. In some cases, given the nature of the experiment and the material available, one design stands out as preferable to all others. For example, if we wish to study mRNA from cells, each treated by a different drug, and the primary comparisons of interest are those of the treated cells versus the untreated cells, then the appropriate design is clear: the untreated cells become a de facto reference, and all hybridizations involve one treated set of cells and the untreated cells. Remember that in a 2-color microarray system, every thing has to be pairwise comparisons, we can not simply observed the effect of T1, T2 rather we need to observed the relative expression of T1 to something else. In this case, relative expression of T1 to C is a natural choice. These are examples, where given the nature of the scientific question, you have a natural design choice. With most experiments, a number of designs can be devised which seem suitable for use, and we need some principles for choosing one from the set of possibilities.
Finding differentially expressed genes Wanted: tools to identify the genes whose expression levels are associated with a covariate or a response of interests. Examples include: Qualitative covariates or factors: e.g. treatment, cell type, tumor class; Quantitative covariates: e.g. dose, time; Responses: e.g. survival, cholesterol level, weight; Any combination of the above.
The simplest design question: Direct versus indirect comparisons Two samples e.g. KO vs. WT or mutant vs. WT Indirect Direct T T C Ref C average (log (T/C)) log (T / Ref) – log (C / Ref ) 2 /2 22 These calculations assume independence of replicates: the reality is not so simple.
Identifying differentially expressed genes with one slide This is a common enough hope Efforts are frequently successful It is not hard to do by eye The problem is probably beyond formal statistical inference (valid p-values, etc) for the foreseable future….why? 4
Single-slide methods Existing methods. Model dependent rules for deciding whether (R,G) corresponds to a differentially expressed gene. Amount to drawing two curves in the (R,G)-plane and calling a gene differentially expressed if its (R,G) falls outside the region between the two curves. At this time we do not know enough about the systematic and random variation within a microarray experiment to justify such strong modeling assumptions. n=1 may not be enough.
Single-slide methods, cont Existing methods differ in the distributional assumptions they make regarding (R,G). Chen et al. Each (R,G) is assumed to be normally and independently distributed with constant CV. Decision based on R/G only. Newton et al. Gamma-Gamma-Bernoulli hierarchical model for each (R,G). Roberts et al. Each (R,G) is assumed to be normally and independently distributed with variance depending linearly on the mean. Sapir & Churchill. Each log R/G is assumed to be distributed according to a mixture of normal and uniform distributions. Decision based on R/G only.
Matt Callow’s Srb1 dataset (#5). Newton’s and Chen’s single slide method
Identifying differentially expressed genes with replicated slides Some aspects: Between-slide normalization. Summaries: Averages and SDs, t, Mann-Whitney, Cox model score and F statistics, regression coefficients, and others How should we look at them? Can we make valid probability statements? 4
8 treatment mice (Ki) and 8 control mice (Ci) Apo AI experiment Goal. To identify genes with altered expression in the livers of Apo AI knock-out mice (K) compared to inbred C57Bl/6 control mice (C). 8 treatment mice (Ki) and 8 control mice (Ci) 16 hybridizations: liver mRNA from each of the 16 mice (Ki , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3. Probes: ~ 6,000 cDNAs (genes), including 200 related to lipid metabolism. K 8 C* C 8 Data provided by Matt Callow, LBNL
Identifying differentially expressed genes, cont For each slide, we summarize each spot with M=log2(R/G). For each spot, call these k1, k2, … k8, c1, c2, …, c8. Statistics a) average difference: b) t statistic: c) B statistics: later… d) Robust t: not today. To identify differentially expressed genes: a) Diagnostic plots: q-q plot, histogram. b) Testing: p-values, adjusted p-values. 9
Histogram & normal q-q plot of t-statistics ApoA1
Why a normal q-q plot? One of the things we want to do with our t-statistics is roughly speaking, to identify the extreme ones. It is natural to rank them, but how extreme is extreme? Since the sample sizes here are not too small ( two samples of 8 each gives 16 terms in the difference of the means), approximate normality is not an unreasonable expectation for the null marginal distribution. Converting ranked t’s into a normal q-q plot is a great way to see the extremes: they are the ones that are “off the line”, at one end or another. This technique is particularly helpful when we have thousands of values. Of course we can’t expect all differentially expressed genes to stand out as extremes: many will be masked by more extreme random variation, which is a big problem in this context.
Useful plots of t-statistics
A more cautious approach These plots are useful, but we need to look at them more closely. Can we trust average effect sizes alone? Can we trust the t statistic alone? Here is evidence that the answer is no.
Results from 4 replicates of a different experiment
Points to note One set (green) has a high average M but also a high variance and a low t. Another (pale blue) has an average M near zero but a very small variance, leading to a large negative t. A third (dark blue) has a modest average M and a low variance, leading to a high positive t. A fourth (purple) has a moderate average M and a moderate variance, leading to a small t. Another pair (yellow, red) have moderate average Ms and middling variances, and moderately large ts. Does this happen with our Apo AI experiment?
M\t t\M t M Sets defined by cut-offs: from the Apo AI ko experiment
M\t t\M t M Results from the Apo AI ko experiment
M\t t\M t M Apo AI experiment: t vs average A.
An empirical Bayes story Using average M alone, we ignore useful information in the SD across replicated. Some large values are large because of outliers. Using t alone, we are liable to be misled by very small SDs. With thousands of genes, some SDs will be very small. Formal testing can sort out these issues for us, but if we simply want to rank, what should we rank on? One approach (SAM) is to inflate the SDs slightly. Another approach can be based on the following empirical Bayes story. There are a number of variants. Suppose that our M values are independently and normally distributed, and that a proportion p of genes are differentially expressed, i.e. have M’s with non-zero means. Further, suppose that the variances and means of these are chosen jointly from inverse chi-square and normal conjugate priors, respectively. Genes not differentially expressed have zero means and variances from the same inverse chi-squared distribution. The scale and d.f. parameters in the inverse chi-square are estimated from the data, as is a parameter c connecting the prior for the mean with that for the variances. We then look for the posterior probability that a given gene is differentially expressed, and find it is an increasing function of B over the page.
Empirical Bayes log posterior odds ratio (LOR) Notice that for large n this approximately t=M./s .
Comparison of different criteria These data come from the Srb1 transgenic mouse experiment with 8 replicates. See Table on next page.
High in all three statistics - clearly differentially expressed. Table Sets of genes. ”1” indicates that the genes in the set are extreme* for that statistic. High in all three statistics - clearly differentially expressed. No genes here - extreme for M. and T => extreme for B! False negatives in T - large M. and true moderately high variance. False positives in M. - Large M. but too large variance to be trusted. False negatives in M., but detected by T and B. False positives in T - small M. but tiny variance. False negatives in M. And T (high but not extreme) - detected by B. Not differentially expressed genes. Comment 1 . B T M. * |M.|>0.5 |T|>4.5 B>-2 These limits are chosen for illustration. Normally they would be slightly higher.
Summary Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene. Averages can be driven by outliers. ts can be driven by tiny variances. B = LOR will, we hope use information from all the genes combine the best of M. and t avoid the problems of M. and t Ranking on B could be helpful.
Identifying differentially expressed genes K samples: Single factor experiment
Linear models In many situations we want to combine data from different experiments in a slightly more elaborate manner than simply averaging. One way of doing so is via (fixed effects) linear models, where we estimate certain quantities of interest which we call effects for each gene on our slide. Typically these estimates may be regarded as approximately normally distributed with common SD, and mean zero in the absence of any relevant differential expression. In such cases, the preceding two strategies: q-q plots, and various combinations of estimated effect (cf M.), standardized estimate (cf. t) both apply. We illustrate in a couple of cases.
Design I: Design II: A P L A L P w Slide 17 Extending this to 3 samples, we base our comparison criterion on the precision of the different estimate. Again, all array experiment produce pair wise comparison.
Linear model analysis A Log ratios: y Parameters: b = ( a-p, l-p ), where a = log2A, p = log2P and l = log2L Model: E(y1) = p – a E(y2) = l – p E(y3) = a – l L P Slide 18 As all measurement are paired comparisons, only the differences between the effects are estimable and the contrasts of interest are thus the pairwise differences. The design matrix X **** the variance are the diagonal element of (X’X) -1
A P L A P L A L P w w Number of Slides N = 3 N=6 N=3 Ave. variance 2 I (a) Common reference I (b) Common reference II Direct comparison Number of Slides N = 3 N=6 N=3 Ave. variance 2 0.67 Units of material Ave. variance A P L A P L A 2 2 2 L P w w Slide 19 Here, we provide the comparisons of a few deisgn choice and for presentation we set sigma to 1 For examples, holding the number of slide constant, the average var associated with all 3 pairwise comparisons are 2 for common reference and 0.67 for direct designs. *** Efficiency ratio = 3 Notice here, we use 2 units of material in II and only 1 in I(a) For k = 3, efficiency ratio (Design I(a) / Design II) = 3 In general, efficiency ratio = 2k / (k-1)
A P L A P L A L P w w Units of material A = B = C = 1 A = B = C = 2 I (a) Common reference I (b) Common reference II Direct comparison Number of Slides N = 3 N=6 N=3 Ave. variance 2 0.67 Units of material A = B = C = 1 A = B = C = 2 Ave. variance 1 A P L A P L A 2 2 2 L P w w Slide 20 Alternatively, holding the of units of material constant, the gain in efficiency is not so striking and we save on slides. The efficiency ratio here is 1.5 For k = 3, efficiency ratio (Design I(b) / Design II) = 1.5 In general, efficiency ratio = k / (k-1)
Targets samples: K=6 A D P L V M
Estimation A P Multiple direct comparisons between different samples (no common reference) Different ways of estimating the same contrast: e.g. A compared to P Direct = A-P Indirect = A-M + (M-P) or A-D + (D-P) or -(L-A) - (P-L) 2 L D Slide 39 An alternative approach is to make multiple direct comparisons. To remind our-self, we have 6 samples, and each arrow in this schematic represents an hybridization. We can estimate the same contrast A compared to P two ways. Directly, and indirectly through the common reference M or D or L. How do we combined these information? 2 M V
Analysis using a linear model Log ratios: y Parameters: b = ( a–l, p-l, d-l, v-l, m-l ), where a = log2A, p = log2P, d = log2D, v = log2V, m = log2M, l = log2L Model: Slide 40 For every gene or spot in this array, we uses a linear model to find the least square estimates for A-L. P-L…. Estimate for all other estimable contrasts can be done in the usual way. We perform this linear model 20,000 times for each genes on the slides / array. Ordinary least squares: In practice, we use robust regression. Estimates for other estimable contrasts follow in the usual way.
Pairwise comparisons: effects vs average intensity red: genes used in clusters; blue: genes used to normalize
Contrasts Because of the connectivity of our experiment, we can estimate all 15 different pairwise comparisons directly and/or indirectly. For every gene we thus have a pattern based on the 15 pairwise comparisons. Gene #15,228 Slide 41 Because of the connectivity of our experiment, we can estimate all 15 different pairwise comparisons directly and/or indirectly. For every gene, we obtain estimates for all 15 pair-wise comparisons; 15 simple contrasts. For example, this is showing for gene 15228, the gene is more active in the back compare to the front of the bulb and no difference between the left and the right. One can see, look at all pairwise comparisons is rather difficult to visualize the corresponding 3-D pattern. Thus we used a different contrasts…
Contrasts in another way Instead of estimating pairwise comparisons between each of the six effects, we can come closer to estimating the effects themselves by doing so subject to the standard zero sum constraint (6 parameters, 5 d.f.). What we estimate for a, say, subject to this constraint, is in reality an estimate of a - 1/6(a + p + d + v + m + l). In effect we have created the whole-bulb reference in silico. Gene # 15,228 Slide 42 In this example, the gene 15228 is predicted by this analysis to be expressed in the posterior and dorsal portion of the bulb. To identify this prediction really means, we performed in-situ hybridization on the bulb.
Single factor experiment – time course Ref Possible designs: All sample vs common pooled reference All sample vs time 0 Direct hybridization between times. Pooled reference Compare to T1 t vs t+1 t vs t+2 t vs t+3
T2 T3 T4 T1 Ref N=3 A) T1 as common reference 1 2 1.5 Design choices in time series t vs t+1 t vs t+2 T1T2 T2T3 T3T4 T1T3 T2T4 T1T4 Ave N=3 A) T1 as common reference 1 2 1.5 B) Direct Hybridization 3 1.67 N=4 C) Common reference D) T1 as common ref + more .67 1.06 E) Direct hybridization choice 1 .75 .83 F) Direct Hybridization choice 2 T2 T3 T4 T1 Ref
Identifying differentially expressed genes 2 x 2 factorial experiment:
Examples of two factors, each with two levels Example 1: Suppose we wish to study the joint effect of two drugs, A and B. 4 possible treatment combinations: C: No treatment A: drug A only. B: drug B only. A.B: both drug A and B. Example 2: Our interest in comparing two strain of mice (mutant and wild-type) at two different times, postnatal and adult. 4 possible samples: C: WT at postnatal A: WT at adult (effect of time only) B: MT at postnatal (effect of the mutation only) A.B : MT at adult (effect of both time and the mutation).
One possible design: Use C as a common reference A.B B A y1 y2 y3 y1 = log (A / C) = a + error y2 = log (B / C) = b + error y3 = log (AB / C) = a + b + ab + error Estimate (ab) with y3 - y2 - y1.
Statisticians recognise a factorial design m m+a A C B AB 1 2 3 4 5 6 m+b m+a+b+ab
Analysis using a linear model Log ratios: y Parameters: b = (a, b, ab), where main effect a, main effect b and interaction effect ab. Model: C A A.B B Slide 40 For every gene or spot in this array, we uses a linear model to find the least square estimates for A-L. P-L…. Estimate for all other estimable contrasts can be done in the usual way. We perform this linear model 20,000 times for each genes on the slides / array. Ordinary least squares: In practice, we use robust regression. Estimates for other estimable contrasts follow in the usual way.
Estimates of a effect log2(A/C) vs ave A gene A gene B = average log √(R*G)
Estimates of a effect vs SE t = / SE t Log2(SE)
2 x 2 factorial: design options Indirect A balance of direct and indirect I) II) III) IV) # Slides N = 6 Main effect A 0.5 0.67 NA Main effect B 0.43 0.3 Interaction A.B 1.5 1 C A.B B A Depending on the question of interest: Interaction only; Main effect only A combination of both Table entry: variance
More general n by m factorial experiment 2 factors, one with n levels and the other with m levels OE experiment (2 by 2): interested in difference between zones, age and also zone.age interaction. Further experiment (2 by 3): only interested in genes where difference between treatment and controls changes with time. treatment control control treatment 0 12 24 0 12 24
WT P1 WT.P11 WT.P21 MT.P1 MT.P11 MT.P21 + a1 + a1 + a2 2 5 7 1 4 + b MT.P11 +a1+b+a1.b MT.P21 + (a1 + a2) + b + (a1 + a2)b 3 6
Common reference approach Estimate (1.) with M5 – M4 - M2 + M1 M1 = Lc.MT.P1 M2 = Lc.WT.P11 + 1 M3 = Lc.WT.P21 + (1 + 2) M4 = Lc.MT.P1 + M5 = Lc.MT.P11 + 1 + + 1 * M6 = Lc.MT.P21 + (1 + 2) + + (1 + 2)* Common reference approach Estimate (1.) with M5 – M4 - M2 + M1 Estimate (1 + 2). with M6 – M4 – M3 + M1
References V G Tusher et al Proc. Nat. Acad. Sci, USA 98 (2001) 5116-5121. S Dudoit et al Statistica Sinica 12 (2002) 111-139. I Lönnstedt & TPS Statistica Sinica 12 (2002) 31-46.