Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA.

Slides:



Advertisements
Similar presentations
Linear Models for Microarray Data
Advertisements

Ch 7.7: Fundamental Matrices
Managerial Economics in a Global Economy
STA305 Spring 2014 This started with excerpts from STA2101f13
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Statistical Techniques I EXST7005 Multiple Regression.
CHAPTER 2 Building Empirical Model. Basic Statistical Concepts Consider this situation: The tension bond strength of portland cement mortar is an important.
1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
Sample Size Power and other methods. Non-central Chisquare.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Matrix Algebra Matrix algebra is a means of expressing large numbers of calculations made upon ordered sets of numbers. Often referred to as Linear Algebra.
The General Linear Model. The Simple Linear Model Linear Regression.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Basics of ANOVA Why ANOVA Assumptions used in ANOVA
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Multiple Regression Models
The Simple Regression Model
Ch 7.3: Systems of Linear Equations, Linear Independence, Eigenvalues
MAE 552 Heuristic Optimization Instructor: John Eddy Lecture #20 3/10/02 Taguchi’s Orthogonal Arrays.
Business Statistics - QBM117 Interval estimation for the slope and y-intercept Hypothesis tests for regression.
Simple Linear Regression Analysis
Genomic Profiles of Brain Tissue in Humans and Chimpanzees.
Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.
General Linear Model & Classical Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM M/EEGCourse London, May.
Correlation and Linear Regression
QNT 531 Advanced Problems in Statistics and Research Methods
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Design of Engineering Experiments Part 4 – Introduction to Factorials
Applications The General Linear Model. Transformations.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
The Use of Dummy Variables. In the examples so far the independent variables are continuous numerical variables. Suppose that some of the independent.
So far... We have been estimating differences caused by application of various treatments, and determining the probability that an observed difference.
Factorial ANOVA More than one categorical explanatory variable STA305 Spring 2014.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
CHAPTER 3 INTRODUCTORY LINEAR REGRESSION. Introduction  Linear regression is a study on the linear relationship between two variables. This is done by.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Be humble in our attribute, be loving and varying in our attitude, that is the way to live in heaven.
CHAPTER 11 SECTION 2 Inference for Relationships.
Design Of Experiments With Several Factors
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Chapter 14 Repeated Measures and Two Factor Analysis of Variance
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Statistics for Differential Expression Naomi Altman Oct. 06.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
1 Introduction to Mixed Linear Models in Microarray Experiments 2/1/2011 Copyright © 2011 Dan Nettleton.
Differential Expressions: Multiple Treatments ANOVA Kruskal Wallis Factorial Set-up.
Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
Categorical Independent Variables STA302 Fall 2015.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
Designs for Experiments with More Than One Factor When the experimenter is interested in the effect of multiple factors on a response a factorial design.
Final Project Everybody still registered for the grade who did not have their own project will receive an with file names to be used for their project.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10: Comparing Models.
Estimating standard error using bootstrap
Linear Algebra Review.
Basics of Group Analysis
Prepared by Lee Revere and John Large
Simple Linear Regression
CHAPTER 12 More About Regression
Fixed, Random and Mixed effects
Regression Analysis.
14 Design of Experiments with Several Factors CHAPTER OUTLINE
Presentation transcript:

Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA

ANOVA: Some Examples Is there a difference in the mean hourly wages for three different ethnic groups? Is there a difference in the mean sugar content in five different brands on cereal? IS there a difference between Mutant and Wild Type version of the organisms IS there a dye effect, as well as a treatment effect? For a time course experiment are there significant differences in gene expression for the different time points?

Model for ANOVA The general linear model which applies for ANOVA, Regression as well as ANCOVA is written as: Y = X   nX1) (nXp) (pX1) (nX1) This is the matrix formulation of the model. Y: response vector (observed) X: design matrix (observed)  : parameter vector (to be estimated)  : error vector (unobserved, randomness)

How to write a Design Matrix Consider a data set where we are looking at comparing 3 different fertilizers, A, B and C. For each fertilizer we have two plot of lands. Data: PlotFertilizerYield (TONNES) 1A12 2A15 3B21 4B18 5C10 6C9

Models: cell means model We can write this as: Yij =  i +  ij This is the cell-means model The corresponding design matrix is: Each row corresponding to the unit, each column corresponding to the Treatment

Model: Factor effect Model We can write this as: Yij =  i +  ij This is the factor effect model, here we have an OVERALL mean and the  i are the differences of each treatment level /factor from the overall mean. Here we put the added requirement that  i = 0 The corresponding design matrix is: Each row corresponding to the unit, each column corresponding to the Treatment, but the last treatment is expressed in terms of the other treatments.

Parameter Vectors For the cell means model:  ’  HO:  For the factor effects model:  ’  HO: 

Usage Most of Statistics uses the Factor effects model as it makes the interpretation of the hypothesis easier as we are testing our null that all the treatment effects are 0. However, in LIMMA in R we will use the easier cell-means model for design matrix construction and we need to define a contrast matrix.

LIMMA and Design Matrices This is what LIMMA says about constructing design Matrices: “The package limma uses an approach called linear models to analyse designed microarray experiments. This approach allows very general experiments to be analysed just as easily as a simple replicated experiment. The approach requires one or two matrices to be specified. The first is the design matrix which indicates in effect which RNA samples have been applied to each array. The second is the contrast matrix which specifies which comparisons you would like to make between the RNA samples. For very simple experiments, you may not need to specify the contrast matrix.”

More on Design Matrices The philosophy of the approach is as follows. You have to start by fitting a linear model o your data which fully models the systematic part of your data. The model is specified by the design matrix. Each row of the design matrix corresponds to an array in your experiment and each column corresponds to a coefficient which is used to describe the RNA sources in our experiment. With Affymetrix or single-channel data, or with two-color with a common reference, you will need as many coefficients as you have distinct RNA sources, no more and no less. With direct-design two-color data you will need one fewer coefficient than you have distinct RNA sources, unless you wish to estimate a dye- effect for each gene, in which case the number of RNA sources and the number of coefficients will be the same. Any set of independent coefficients will do, providing they describe all your treatments. The main purpose of this step is to estimate the variability in the data, hence the systematic part needs to be modeled so it can be distinguished from random variation.

LIMMA: contrasts In practice the requirement to have exactly as many coefficients as RNA sources is too restrictive in terms of questions you might want to answer. You might be interested in more or fewer comparisons between the RNA source. Hence the contrasts step is provided so that you can take the initial coefficients and compare them in as many ways as you want to answer any questions you might have, regardless of how many or how few these might be.

Writing out Design and Contrast Matrices: Example 1: This a one-factor ANOVA with 4 levels. The model is Yij =  i +  ij, i =1,…,4, j=1…3. Write out the contrast matrix if we were interested in comparing level 1 to level 2, and level 3 to the mean of level 1 and 2.

Example 1: Designs and Contrast Matrices arraym1m2m3m The contrast matrix for comparing: so that B= C’D comparing level 1 to level 2, level 3 to the mean of level 1 and 2. c1100 c2-1/2 10

Example 2 This a two-factor ANOVA with 3 levels for Factor A and 2 levels for Factor B. The model is Yij =  i +  j+  ij, i =1,…,3, j=1…2. Write out the contrast matrix for comparing Factor 1, levels 2 and 3 and Factor 2 levels 1 and 2.

Example 2: Design and Contrast Matrix The Design Matrix arraya1b1a1b2a2b1a2b2a3b1a3b Write out the contrast matrix for comparing : Factor 1, levels 2 and 3 Factor 1: levels 1 and 3 Factor 2 levels 1 and 2. Contrast: C1: C2: C2:

Differential Expressions for Factorial Designs: Design Matrices and Contrasts, using R. Example The Estrogen Data set: Let us consider the Estrogen Data set, and look at how we use R to look at differential expressions using design matrices. NameFileNameTarget Abs10.1low10-1.celEstAbsent10 Abs10.2low10-2.celEstAbsent10 Pres10.1high10-1.celEstPresent10 Pres10.2high10-2.celEstPresent10 Abs48.1low48-1.celEstAbsent48 Abs48.2low48-2.celEstAbsent48 Pres48.1high48-1.celEstPresent48 Pres48.2high48-2.celEstPresent48

Description of Experiment There are 8 files in all, coming from a 2X2 factorial design. This is a design where there are 2 factors each at 2 levels. The study was done to measure the changes in gene expression for breast cancer patients due to estrogen (two levels Presence and Absence) at two time points (10hr and 48hr). This experiment data is available at the Bioconductor website.

Contrasts of Interest It is of interest to compare: 1.the effect of estrogen at 10 hours (compare presence to absence at 10 hours), 2.the effect of estrogen at 48 hours (compare presence and absence at 48 hours) 3.the effect of time in the absence of estrogen (compare Absent 10 to Absent 48).

Targets File Method To do this in R we can use different ways. Lets use the Targets file method as we did in 2 condition comparison before. So lets first put together a tab-delimited text file like the one above. I call it EstrogenTargets.txt so it describes a name, the filename and the targets containing the factor level infromation

Design matrix method One way to do this in R (to me it’s the simplest one in terms of Design matrices), is to write a Design Matrix using the factor combinations, WITHOUT the intercept term. R (at least LIMMA) writes the Design matrix as: EstAbsent10EstPresent10EstAbsent48EstPresent So our model is Y = X  g + 

Contrast Matrix Now to define the contrast we need to look at the transformation: b g = C’a g so, we define C as: C’ = – This will define: (EstPresent10-EstAbsent10) (EstPresent48-EstAbsent48) (EstAbsent48-EstAbsent10)

In R using Targets file design=model.matrix(~- 1+factor(targets$Target,level=unique(targets$Target))) colnames(design)=unique(targets$Target) numParameters=ncol(design) parameterNames=colnames(design) contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,-1,0,1,0),nrow=ncol(design)) Using the Targets file, efficient if you know how R works and you don’t have to put in the Matrix.

In R using the design matrix directly design<-matrix(c(1,0,0, 0,1,0,0,0,0,1, 0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1),nrow=8) contrastMatrix=matrix(c(-1,1,0,0,0,0,-1,1,- 1,0,1,0),nrow=ncol(design)) R constructs the matrices using the columns.

An example for Optimal Designs Suppose we have 12 arrays in a single channel framework and we have 5 conditions that we want to compare. Because of the unbalance it is harder to design orthogonal designs here. Sometimes people use classes of design that are already available and have properties like orthogonality. Designs in this class include: Margolin Designs (less than 6 conditions), Plackett-Burman designs and other such designs.

Consider the following Margolin Design: orthogonal for 6 conditions and 12 arrays

What if I have 5 conditions In some ways we could drop one column and use the Design matrix with the dropped column to preserve some optimality conditions. Question is which column to drop? The following R-code helps us decide whether we drop column 2 or 3 or 4.

A<-matrix(c(1,1,1,1,1,1,1, + 1,1,-1,-1,-1,-1,1, + 1,-1,1,-1,-1,-1,1, + 1,-1,-1,1,1,-1,1, + 1,-1,-1,1,-1,1,1, + 1,-1,-1,-1,1,1,1, + 1,-1,-1,-1,-1,-1,-1, + 1,-1,1,1,1,1,-1, + 1,1,-1,1,1,1,-1, + 1,1,1,-1,-1,1,-1, + 1,1,1,-1,1,-1,-1, + 1,1,1,1,-1,-1,-1), nrow=12) > B<-t(A) > C<-B%*%A > D<-solve(C) > det(D) [1] e-07 > sum(diag(D)) [1]

> A1<-A[,-2] > A2<-A[,-3] > A4<-A[,-4] > A1t=t(A1) > A2t=t(A2) > a3t=t(A4) > A4t=t(A4) > a1ta1=A1t%*%A1 > a2ta2=A2t%*%A2 > a4ta4=A4t%*%A4 > b1=solve(a1ta1) > b2=solve(a2ta2) > b3=solve(a4ta4) > aa1=sum(diag(b1)) > aa2=sum(diag(b2)) > aa4=sum(diag(b3))

Results from dropping columns >aa1 [1] (trace after dropping col 2) > aa2 [1] (trace after dropping col 3) > aa4 [1] (trace after dropping col 4) > det(b1) [1] e-06 (determinant after dropping col 2) > det(b2) [1] e-06 (determinant after dropping col 3) > det(b3) [1] e-06 (determinant after dropping col 4)