© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2.

© Department of Statistics 2001 Slide 3 R formulation Regression model y ~ x1 + x2 + x3 Anova model y ~ A+ B (A, B factors) Model with both factors and continuous variables y ~ A*B*x1 + A*B*x2 What do these mean? How do we interpret the output?

© Department of Statistics 2001 Slide 7 Solving the equations We could calculate the matrix X T X directly, but this is not very accurate (subject to roundoff errors). For example, when trying to fit polynomials, this method breaks down for polynomials of low degree Better to use the “QR decomposition” which avoids calculating X T X

© Department of Statistics 2001 Slide 8 Solving the normal equations Use “QR decomposition” X=QR X is n x p and must have “full rank” (no column a linear combination of other columns) Q is n x p “orthogonal” (i.e. Q T Q = identity matrix) R is p x p “upper triangular” (all elements below the diagonal zero), all diagonal elements positive, so inverse exists

© Department of Statistics 2001 Slide 9 Solving using QR X T X = R T Q T QR = R T R X T y = R T Q T y Normal equations reduce to R T Rb = R T Q T y Premultiply by inverse of R T to get Rb = Q T y Triangular system, easy to solve

© Department of Statistics 2001 Slide 12 What R has to do When you run lm, R forms the matrix X from the model formula, then fits the model E(Y)=Xb Steps: 1.Extract X and Y from the data and the model formula 2.Do the QR decomposition 3.Solve the equations Rb = r 4.Solutions are the numbers reported in the summary

© Department of Statistics 2001 Slide 13 Forming X When all variables are continuous, it’s a no-brainer 1.Start with a column of 1’s 2. Add columns corresponding to the independent variables It’s a bit harder for factors

© Department of Statistics 2001 Slide 14 Factors: one way anova Consider model y ~ a where a is a factor having 3 levels say. In this case, we 1.Start with a column of ones 2.Add a dummy variable for each level of the factor (3 in all), order is order of factor levels Problem: matrix has 4 columns, but first is sum of last 3, so not linearly independent Solution: Reparametrize!

© Department of Statistics 2001 Slide 15 Reparametrizing Let X a be the last 3 columns (the 3 dummy variables) Replace X a by X a C (ie X a multiplied by C), where C is a 3 x 2 “contrast matrix” with the properties 1.Columns of X a C are linearly independent 2.Columns of X a C are linearly independent of the column on 1’s In general, if a has k levels, C will be k x (k-1)

© Department of Statistics 2001 Slide 16 The “treatment” parametrization Here C is the matrix C = 0 0 1 0 0 1 (You can see the matrix in the general case by typing contr.treatment(k) in R, where k is the number of levels) This is the default in R

© Department of Statistics 2001 Slide 17 Treatment parametrization (2) The model is E[Y] = X  where X is 1 0 0 1 1 0 1 0 1 The effect of the reparametrization is to drop the first column of Xa, leaving the others unchanged.... Observations at level 1 Observations at level 2 Observations at level 3

© Department of Statistics 2001 Slide 18 Treatment parametrization (3) Mean response at level 1 is   Mean response at level 2 is     Mean response at level 3 is     Thus,   is interpreted as the baseline (level 1) mean The parameter   is interpreted as the offset for level 2 (difference between levels 1 and 2) The parameter   is interpreted as the offset for level 3 (difference between levels 1 and 3)...

© Department of Statistics 2001 Slide 19 The “sum” parametrization Here C is the matrix C = 1 0 0 1 (You can see the matrix in the general case by typing contr.sum(k) in R, where k is the number of levels) To get this in R, you need to use the options function options(contrasts=c("contr.sum", "contr.poly"))

© Department of Statistics 2001 Slide 20 sum parametrization (2) The model is E[Y] = X  where X is 1 1 0 1 0 1 1 -1 -1 The effect of this reparametrization is to drop the last column of Xa, and change the rows corresponding to the last level of a.... Observations at level 1 Observations at level 2 Observations at level 3

© Department of Statistics 2001 Slide 21 Sum parameterization (3) Mean response at level 1 is     Mean response at level 2 is     Mean response at level 3 is         Thus,   is interpreted as the average of the 3 means, the “overall mean” The parameter   is interpreted as the offset for level 1 (difference between level 1 and the overall mean) The parameter   is interpreted as the offset for level 2 (difference between level 1 and the overall mean) The offset for lervel 3 is      ...

© Department of Statistics 2001 Slide 22 The “Helmert” parametrization Here C is the matrix C = -1 -1 1 -1 0 2 (You can see the matrix in the general case by typing contr.helmert(k) in R, where k is the number of levels)

© Department of Statistics 2001 Slide 23 Helmert parametrization (2) The model is E[Y] = X  where X is 1 -1 -1 1 1 -1 1 0 2 The effect of this reparametrization is to change all the rows and columns.... Observations at level 1 Observations at level 2 Observations at level 3

© Department of Statistics 2001 Slide 24 Helmert parametrization (3) Mean response at level 1 is         Mean response at level 2 is         Mean response at level 3 is       Thus,   is interpreted as the average of the 3 means, the “overall mean” The parameter   is interpreted as half the difference between level 2mean and level 1mean The parameter   is interpreted as the one third of the difference between the level 3 mean and the average of the level 1 and 2 means...

© Department of Statistics 2001 Slide 25 Using R to calculate the relationship between  -parameters and means Thus, the matrix (X T X) -1 X T gives the coefficients we need to find the  ’s from the  ’s

© Department of Statistics 2001 Slide 26 Example: One way model In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures. The response variable (ratio) is the ratio of damages to undamaged cells, and the explanatory variable (treatment) is the substance

© Department of Statistics 2001 Slide 27 Data ratio treatment 0.08 control + 49 other control obs 0.08 choralhydrate + 49 other choralhydrate obs 0.10 diazapan + 49 other diazapan obs 0.10 hydroquinone + 49 other hydroquinine obs 0.07 econidazole + 49 other econidazole obs 0.17 colchicine + 49 other colchicine obs

© Department of Statistics 2001 Slide 28 > cancer.lm<-lm(ratio ~ treatment,data=cancer.df) > summary(cancer.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.26900 0.02037 13.207 < 2e-16 *** treatmentcolchicine 0.17920 0.02881 6.221 1.69e-09 *** treatmentcontrol -0.03240 0.02881 -1.125 0.262 treatmentdiazapan 0.01180 0.02881 0.410 0.682 treatmenteconidazole -0.00420 0.02881 -0.146 0.884 treatmenthydroquinone 0.04300 0.02881 1.493 0.137 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.144 on 294 degrees of freedom Multiple R-Squared: 0.1903, Adjusted R-squared: 0.1766 F-statistic: 13.82 on 5 and 294 DF, p-value: 3.897e-12 lm output

© Department of Statistics 2001 Slide 29 Relationship between means and betas X<-model.matrix(cancer.lm) coef.mat<-solve(t(X)%*%X)%*%t(X) > levels(cancer.df$treatment) [1] "chloralhydrate" "colchicine" "control" "diazapan" "econidazole" "hydroquinone" > cancer.df$treatment[c(1,51,101,151,201,251)] control chloralhydrate diazapan hydroquinone econidazole colchicine Levels: chloralhydrate colchicine control diazapan econidazole hydroquinone > round(50*coef.mat[,c(1,51,101,151,201,251)]) 1 51 101 151 201 251 (Intercept) 0 1 0 0 0 0 treatmentcolchicine 0 -1 0 0 0 1 treatmentcontrol 1 -1 0 0 0 0 treatmentdiazapan 0 -1 1 0 0 0 treatmenteconidazole 0 -1 0 0 1 0 treatmenthydroquinone 0 -1 0 1 0 0

© Department of Statistics 2001 Slide 31 Two factors: model y ~ a * b To form X: 1.Start with column of 1’s 2.Add X a C a 3.Add X b C b 4.Add X a C a : X b C b (Every column of X a C a multiplied elementwise with every column of X b C b )

© Department of Statistics 2001 Slide 32 Two factors: example Experiment to study weight gain in rats –Response is weight gain over a fixed time period –This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low) –See coursebook Section 4.4

© Department of Statistics 2001 Slide 33 Data > diets.df gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low 5 107 Cereal Low 6 49 Pork Low 7 102 Beef High 8 74 Cereal High 9 79 Pork High 10 76 Beef Low... 60 observations in all

© Department of Statistics 2001 Slide 34 Two factors: the model If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is  ij Other standard assumptions (equal variance, normality, independence) apply

© Department of Statistics 2001 Slide 36 Decomposition of the means We usually want to split each “cell mean” up into 4 terms: –A term reflecting the overall baseline level of the response –A term reflecting the effect of factor A (row effect) –A term reflecting the effect of factor B (column effect) –A term reflecting how A and B interact.

© Department of Statistics 2001 Slide 37 Mathematically… Overall Baseline:  11 (mean when both factors are at their baseline levels) Effect of i th level of factor A (row effect):  i1  11   The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline) Effect of j th level of factor B (column effect) :  1j -  11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline) Interaction: what’s left over (see next slide)

© Department of Statistics 2001 Slide 38 Interactions Each cell (except the first row and column) has an interaction: Interaction = cell mean - baseline - row effect - column effect If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B –In mathematical terms,  ij –  i’j doesn’t depend on j Equivalently, effect of changing levels of B is the same for all levels of A If interactions are zero, relationship between factors and response is simple

© Department of Statistics 2001 Slide 39 Splitting up the mean: rats Cell Means BeefCerealPorkBaseline col High10085.999.5100 Low79.283.978.779.2 Baseline row10085.999.5100 Split-up BeefCerealPorkRow effect High*** * Low*18.80-20.8 Col effect*-14.1-0.5100 Factors are : level (amount of food) and source (diet) 83.9 =100+(-20.8)+(-14.1)+18.8 interaction

© Department of Statistics 2001 Slide 40 Fit model > rats.lm<-lm(gain~source+level + source:level) > summary(rats.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 *** sourceCereal -1.410e+01 6.551e+00 -2.152 0.03585 * sourcePork -5.000e-01 6.551e+00 -0.076 0.93944 levelLow -2.080e+01 6.551e+00 -3.175 0.00247 ** sourceCereal:levelLow 1.880e+01 9.264e+00 2.029 0.04736 * sourcePork:levelLow -3.052e-14 9.264e+00 -3.29e-15 1.00000 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 14.65 on 54 degrees of freedom Multiple R-Squared: 0.2848, Adjusted R-squared: 0.2185 F-statistic: 4.3 on 5 and 54 DF, p-value: 0.002299

© Department of Statistics 2001 Slide 41 Fitting as a regression model Note that using the treatment contrasts, this is equivalent to fitting a regression with dummy variables R2, C2, C3 R2 = 1 if obs is in row 2, zero otherwise C2 = 1 if obs is in column 2, zero otherwise C3 = 1 if obs is in column 3, zero otherwise The regression is Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)

© Department of Statistics 2001 Slide 42 Notations For two factors A and B Baseline:  =  11 A main effect:  i =  i1 -  11 B main effect:  j =  1j -  11 AB interaction:  ij =  ij -  i1 -  1j +  11 Then  ij =  +  i +  j +  ij

© Department of Statistics 2001 Slide 44 Using R to interpret parameters >rats.df<-read.table(file.choose(), header=T) >rats.lm<-lm(gain~source*level, data=rats.df) >X<-model.matrix(rats.lm) >coef.mat<-solve(t(X)%*%X)%*%t(X) >round(10*coef.mat[,1:6]) 1 2 3 4 5 6 (Intercept) 1 0 0 0 0 0 sourceCereal -1 1 0 0 0 0 sourcePork -1 0 1 0 0 0 levelLow -1 0 0 1 0 0 sourceCereal:levelLow 1 -1 0 -1 1 0 sourcePork:levelLow 1 0 -1 -1 0 1 >rats.df[1:6,] gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low 5 107 Cereal Low 6 49 Pork Low Cell Means betas

© Department of Statistics 2001 Slide 45 X matrix: details (first six rows) (Intercept) source source level sourceCereal: sourcePork: Cereal Pork Low levelLow levelLow 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 1 Col of 1’sXaCaXaCa XbCbXbCb X a C a :X b C b

© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2.

Similar presentations

Presentation on theme: "© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2.

Similar presentations

Presentation on theme: "© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2."— Presentation transcript:

Similar presentations

About project

Feedback