Presentation is loading. Please wait.

Presentation is loading. Please wait.

Differential Expressions: Multiple Treatments ANOVA Kruskal Wallis Factorial Set-up.

Similar presentations


Presentation on theme: "Differential Expressions: Multiple Treatments ANOVA Kruskal Wallis Factorial Set-up."— Presentation transcript:

1 Differential Expressions: Multiple Treatments ANOVA Kruskal Wallis Factorial Set-up

2 Some Examples: 1. Is there a difference in the mean expression for three different conditions? 2. Is there a difference in the mean sugar content in five different brands on cereal? 3. IS there a difference in the mean Pb content in the three main lakes in Eastern WA, (Couer D’Alene, Liberty Lake and Newman Lake) 4. Is there a difference in the cholesterol ratio among the 4 groups Young male, Young female, Older males, Older females.

3 Models In each case we are interested in comparing multiple means to each other. MODEL: Y= X  +  X now is categorical and the matrix takes values of 1 or 0’s. ONE WAY ANOVA: Y ij =  +  i +  ij TWO WAY ANOVA Y ijk =  +  j +  i +  ijk TWO-WAY ANOVA with INTERACTIONS Y ijk =  +  j +  i +  ji +  ijk

4 One-Way Anova let us deal with ONEWAY ANOVA for simplicity. So the hypothesis of interest is: Ho:       Ha: at least one is unequal. So we are interested in finding whether or not at least one of the treatments are different from another. We are also interested in identifying WHICH ones are different.

5 Example The logic behind ANOVA: The idea is we decide whether the means are the same or not based on their variability. Assume that we wish to compare the three expression mean based on five replicate arrays

6 Model Based Approach To ANOVA Like multiple regression and simple linear regression ANOVA can also written in terms of a linear model: Cell-means model Y ij =  i +  ij OR Y ij =  +  i +  ij Where: Y ij is our observed data  : our overall mean (grand mean)  i : our effect from treatment i  ij : our error terms. Our assumption is that eij are independent and follows a Normal distribution with mean 0 and variance  2.

7 Hypotheses So the hypothesis we are testing are: Ho:      k Ha: at least one inequality Ho:      k =(0) Ha: at least one inequality

8 Partitioning the SS Here too, we divide the SSTotal into: SSTotal = SSModel + SSError df N-1 r-1 N-r E(MSE) =  2 E(MSModel)=  2 +  ni  i 2 /r-1 Hence under the null, the term on the right drops out and E(MSE)/E(MSModel) =1. Also Cochran’s theorem indicates that the error chi-square and the model chi-square are independent under the null, hence MSM/MSE follows a F-statistic and we can test the null using F critical points.

9 Follow-up Analysis Once we declare that there is an overall difference we want to see where the differences lie. We could be interested in the following: 1. Comparing all pairs of treatments to each other 2. Comparing some pre-chosen specific treatments to each other 3. Comparing the treatments to a STANDARD treatment 4. Comparing treatments to the BEST treatment. When we are comparing pairs generally we have t(t-1)/2 total number of comparisons. Hence, if we perform each comparison at Type I error or level alpha (say.05) our overall Type I error becomes VERY large. So there are different methods for controlling the Type I error.

10 MC methods 1. Fisher’s LSD (controls per comparison error rate) a. Essentially this is doing t(t-1)/2 pooled t tests (or confidence intervals) using the overall pooled variance each at level alpha. b. Easier to find significances (liberal) c. Extremely high overall error rate for large number of treatments 2. Tukey’s HSD (controls Family wise error rate) a. Essentially does t(t-1)/2 pooled t tests (or confidence intervals) each at a level lower than alpha, so that the overall error rate is alpha. b. Harder to find significances (conservative) c. This is the exact method and would be recommended by statisticians if your sample sizes foe each treatment is equal. For unequal sample sizes, other methods in use are Bonferroni method, Tukey- Kramer method, Scheffe method etc. There are MANY methods for multiple comparisons and is a very active research area in statistics.

11 Non-parametric Alternative Here we assume the same data structure as a one-way layout. The model : Y ij =  +  i +  ij Here we do not assume underlying normality any more, but still assume equal variance and independence. The hypotheses: Ho:      k =(0) Ha: at least one inequality

12 Procedure Rank all observations jointly from smallest to largest. Let rij be the rank of Yij. For i=1…k, define Ri =  r ij, R i. = R i /n i, R.. = (N+1)/2 Compute, H = 12  (R i. – R.. ) 2 /(N)(N+1) Reject H 0 if H > h( ,k, n1…nk) Or if H >  2 (k-1,  ) (large sample approximation)

13 Multiple Treatments in Micro-arrays For microarrays most of the ideas from DE for 2 conditions extend fairly easily into multiple conditions. However here we have multiplicity from two different aspects, the multiple conditions and the multiple genes. There does not appear to be any consensus on HOW to do this. Most of what appears to be proposed is to use EB methods. However, one can perform ANOVA F tests or Kruskal- Wallis test for one gene at a time and rank the genes by the attribute of interest.

14 Linear Models Approach We will do a brief discussion on the linear model approach. Analyze all arrays together combining information in optimal way Here we use combined estimation of precision Extensible to arbitrarily complicated experiments Design matrix: specifies RNA targets used on arrays Contrast matrix: specifies which comparisons are of interest

15

16 Parallel inference for genes 10,000-40,000 linear models Curse of dimensionality: Need to adjust for multiple testing, e.g., control family- wise error rate (FWE) or false discovery rate (FDR). Boon of parallelism: Can borrow information from one gene to another.

17

18

19 Estimating hyper-parameters Closed form estimators with good properties are available: for c 0 in terms of quantiles of the for s 0 and d 0 in terms of the first two moments of log s 2 | t˜ g |.

20 Within-array replicate spots Replicate spots of each gene on same array, assume duplicates at regular spacing Assume spatial component of correlation between duplicates is same for each gene Estimate spatial correlation from consensus estimator across genes Greatly improves estimation of precision

21 How many genes are differentially expressed? Log-ratios don’t appear to be normally distributed, this is hard to check Log-ratios for different genes are correlated in unknown way High level of multiple testing means that very small p- values are required – distributional assumptions must hold in extreme tail Little opportunity for usual CLT results to apply

22 Ranking easier than testing If there was only one gene, a t-test would give a reliable p- value for judging whether the true log-ratio was zero With many genes, computed p-values cannot be trusted (unless we have > 16 arrays) It is more realistic to rank the genes in order of evidence for differential expression.

23 LIMMA Package for R Linear models for microarray data. A software package for the R programming environment. Focus is differential expression including - moderated t-statistics - methods for duplicate spots - classifying F-tests - stemmed heat diagrams Available from www.bioconductor.org

24 Microarray time course experiments: types/features Typically short series: k = 4-10 time points for shorter,and 11-20 time points for longer series; Often irregularly spaced; with no or few (< 5) replications. Can be periodic, OR May have no particular pattern, as in developmental time courses.

25 Time Series Issues May be longitudinal, where mRNA samples at different times are extracted from the same unit (cell line, tissue or individual), but more commonly cross-sectional, where mRNA samples are from different units. Gene expression values at different time points may be correlated, especially in a longitudinal study, or when a common reference design is used for a crosssectional study. At other times, the experimental design induces correlations in cross-sectional studies.

26 Issues… contd Two general types of hypotheses of interest: the one- sample (or one-class) problem: which genes are changing in time? and the 2 or >2 sample (or class) problem: which genes are changing differently in time across the samples (or classes)? Two broad types of mRNA samples: from cells or cell lines which give reasonably repeatable responses within classes, and whole organism (mice, humans), where there is a lot of response variability within classes.

27 Analyzing time series data Generally time is used as a FACTOR in MA experiments with time series and the contrasts of interest are defined. These are then tested using traditional ANOVA or EB methods.


Download ppt "Differential Expressions: Multiple Treatments ANOVA Kruskal Wallis Factorial Set-up."

Similar presentations


Ads by Google