Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for Differential Expression Naomi Altman Oct. 06.

Similar presentations


Presentation on theme: "Statistics for Differential Expression Naomi Altman Oct. 06."— Presentation transcript:

1 Statistics for Differential Expression Naomi Altman Oct. 06

2 Some things to consider before we start Model Replication Correlation / Independence Treatments (conditions, varieties...)

3 Some things to consider before we start Model Using a statistical model sheds light on the analysis by quantifying features such as condition effects, sources of biological and experimental variation, etc. Models can be written down before the data are collected, which clarify how the data should be collected and analyzed. When an estimate of variability is available, the model can be used to determine appropriate sample size. Replication Correlation / Independence Treatments

4 Some things to consider before we start Model Replication Statistical methods compare the condition means to the variation within condition. The within condition variation can only be estimated by replication of the condition. Often technical replication (multiple probes in a probeset or multiple hybridizations of the same sample) are treated as if it has biological meaning, but this is not true replication. Correlation / Independence Treatments

5 Some things to consider before we start Model Replication Correlation / Independence Observations are correlated because: they are taken on the same individual they are measured on the same array they are processed in the same replicate Most simple analysis methods assume independence and hence must be modified to handle correlated data. Treatments

6 Some things to consider before we start Model Replication Correlation / Independence Treatments: what is interesting? what is the "action"? how many can we really handle

7 2 treatments We have already considered the simple case of 2 treatments using t-tests (or permutation, bootstrap or Wilcoxon versions of the tests) Which tests do we use and when are they appropriate?

8 Tests for 2 treatments Two-sample "t-tests" (and similar tests) require independent samples within and between the 2 treatments i.e. 1.all RNA samples are biologically independent 2.Each sample is hybridized to a different array single channel arrays such as Affy, Nimblegen, CodeLink 2 channel arrays with a reference sample in the same channel on each array (use M as the data)

9 Tests for 2 treatments The paired "t-test" (and similar tests) 1. Each array includes both treatments. 2. Different arrays come from different biological samples. 3. There is no dye effect or technical dye-swaps have been done and the technical replicates have been averaged.

10 Tests for 3 or more treatments with independent samples Requires independent samples. (We cannot extended the paired sample idea, because we do not have 3 or more channels on the array.) H 0 : all the population means are equal H A : At least one of the means differs

11 Tests for 3 or more treatments with independent samples examples: Cancers: several cancer types with 1 sample per patient, several patients with each cancer Genotypes: several genotypes of mice with 1 sample per mouse, several mice per genotype Drug: different doses applied to different individuals with 1 sample per individual, several individuals per dose

12 Tests for 3 or more treatments with independent samples The t-test assumes that the spreads are all approximately equal and that the populations are approximately normally distributed. The other versions of the test do not require normality. The test statistic is the ratio of the variance among the sample means to the variance of each sample

13 Tests for 3 or more treatments with independent samples If there are T treatments, with n i observations from the i th treatment. N=n 1 +... + n T F*=MStr/MSE has an F-distribution when the null is true. One-Way ANOVA

14 One-way ANOVA summary(aov(iris$Sepal.Length~iris$Species)) Df Sum Sq Mean Sq F value Pr(>F) iris$Species 2 63.212 31.606 119.26 < 2.2e-16 *** Residuals 147 38.956 0.265 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Permution, bootstrap and rank tests (Kruskal-Wallace test) are readily extended to this situation

15 More complex situations Many microarray experiments do not fall into this simple situation due to correlation in the data due to: biological correlation (same cell-line, individual...) using 2-channel microarrays having multiple probes for the same gene Also, we may have multifactor studies: e.g. 2 genotypes, control and exposed, time course For this we use Linear Mixed Models

16 Linear Models It is useful to consider a model for the observed data (on a single probe or probeset): Y=log 2 (intensity) =  +  +  +  +... + error  is the mean over all the conditions and arrays error is the random error that is a mixture of measurement error and biological variability the other terms are systematic deviations from the mean, due to the treatments, array effects, lab effects, etc.

17 Linear Models e.g. Comparison of liver and kidney tissue in male and female mice on 2-channel arrays with 3 replicate spots per gene 5 males and 5 females Y is the log 2 (intensity) in one channel for one spot. We need to remember that dye might have an effect.

18 Linear Models Fixed effects are the conditions of interest in the experiment: Random effects are conditions which explain some of the noise in the model:

19 How does the model help us? Generally, differential expression analysis is looking for differences between treatments that are larger than expected by chance. The model helps us to understand the meaning of "by chance". The model also allows us to design our experiment to minimize the probability of chance observation of large differences.

20 How Does the Model Help Us? mean Log 2 Intensity MaleFemale Liver5.66.3 Kidney9.310.7 difference between male and female in liver difference between liver and kidney in males What is larger than expected by chance? Suppose the arrays are: 5 arrays - male and female liver 5 arrays - male and female kidney Suppose the arrays are: 5 arrays - male liver and kidney 5 arrays - female liver and kidney

21 The simplest model 2 treatments on 2 channel arrays with independent biological samples, no dye effect and no dye-swap All of the data are independent. M=log 2 (Red) - log 2 (Green) M i =  + error i No differential expression implies H 0 :  The F-test for this model is just t 2 from the paired t-test

22 One-Way "ANOVA" Y ij =  +  i + error ij  is the mean expression for the gene over the entire experiment.  i is the deviation of the mean of the i th condition from the overall mean  i  i =0 The error variance should not depend on the condition.

23 More Complicated Models with Fixed Effects Only Y ijk =  +  i +  j +(  ) ij +error ijk We may have 2 or more factors, e.g. genotype and drug dose genotype and time point treatment and dye  is the mean expression for the gene over the entire experiment.  i is the deviation of the mean of the i th level of factor A from the overall mean,  i  i =0  i is the deviation of the mean of the i th level of factor B from the overall mean,  j  j =0  ij is the deviation of the mean of the ij th combination of levels from  +  i +  j, mean  i (  ij =  j (  ij =0 The error variance should not depend on the condition.

24 More Complicated Models with Fixed Effects Only Interaction among factors No interaction among factors

25 More Complicated Models with Fixed Effects Only Y ijk =  +  i +  j +(  ) ij +error ijk Normal Theory ANOVA is readily extended to this situation and more factors can be added. Permutation and bootstrap methods begin to get complicated, but can still be applied. Rank-based methods are available for 2 factors, but get complicated

26 Replicates that are not Independent We often have replicates that are NOT independent: multiple spots for the same gene on an array multiple arrays from the same RNA multiple RNAs from the same tissue multiple samples from the same individual multiple labs multiple "batches"

27 Replicates that are not Independent e.g. A dye-swap experiment in which the dye-swaps are technical replicates (1 dye-swap pair per sample) and there are 2 spots per gene on the array with 2 or more treatments Y ijkt =  +  i +  j +  k +  s +  t + error ijkt  is the mean expression for the gene over the entire experiment.  i is the deviation of the mean of the i th treatment,  i  i =0  i is the deviation of the mean of the i th level of dye from the overall mean,  r +  g =0  k is the array effect which induces a correlation between the 2 spots on the same array  k ~N(0,   2 )  s is the spot effect which induces a correlation between the 2 channels at the same spot  s ~N(0,   2 )  t is the biological sample effect which induces a correlation between the 2 arrays in the dye-swap pair  t ~N(0,   2 )

28 Replicates that are not Independent The lack of independence can be modeled as a random effect. This is handled in a straightforward manner by ANOVA modeling but... all the other methods get MUCH more complicated. Much of the available software does not handle this very well.

29 Replicates that are not Independent In some cases, we can return to fixed effects models by averaging (but this loses power). e.g. technical replicates can be averaged and the averages can be used as if they were the primary data This is much better than discarding technical replicates, but not as good as modeling them.

30 Replicates that are not Independent Example 2 conditions on a 2-channel array with replicate spots for each gene, and a dye-swap technical replicate. e.g. 2 genotypes of mouse 3 mice per genotype 1 mouse from each genotype on each array 2 arrays from each pair of mice 4 replicate spots per array We will simplify by modeling M, rather than each channel. A1 B1 A2 B2 A3 B3

31 Replicates that are not Independent Example effects: mouse pair dye (or equivalently genotype) array pair and array are random dye is fixed we need to keep track of whether M is R-G or A-B (genotype difference) We do not need to include spot as we are using M A1 B1 A2 B2 A3 B3

32 Replicates that are not Independent Example data for 1 mouse pair (m): 2 arrays, with 4 spots per array M mdas m is the mouse pair identifier (1,2,3) d is the dye for genotype A (r,g) a is the array (1-6 or 1,2 within m) s is the spot (1-4 within array) A1 B1 A2 B2 A3 B3

33 Replicates that are not Independent Example M ijkt =  + m i + d j + a k + error ijkt effects: m mouse pair (random) d dye (or equivalently genotype in R) a array (random) t=1,2,3,4 for the spots The hypothesis of no genotype effect is  =0. Notice that we have to be careful about the sign of M. If we code the effects in the way it is usually done for ANOVA, M=A-B not R-G A1 B1 A2 B2 A3 B3

34 Replicates that are not Independent Example M ijkt =  + m i + d j + a k + error ijkt Our estimate of m is just the sample mean of M over all the spots. But our estimate of the SE of ave(M) is not the sample average, due to the other effects. A1 B1 A2 B2 A3 B3

35 Replicates that are not Independent Example M ijkt =  + m i + d j + a k + error ijkt 3 mouse pairs 6 arrays 24 observations/gene

36 What if we ignore the Dependence Compare with We would use: The denominator of the ordinary t-test is much too small


Download ppt "Statistics for Differential Expression Naomi Altman Oct. 06."

Similar presentations


Ads by Google