Download presentation
Presentation is loading. Please wait.
Published byKellie Manning Modified over 9 years ago
1
© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2
2
© Department of Statistics 2001 Slide 2 Agenda R formulation Matrix formulation Least squares fit Numerical details – QR decomposition R parameterisations –Treatment –Sum –Helmert
3
© Department of Statistics 2001 Slide 3 R formulation Regression model y ~ x1 + x2 + x3 Anova model y ~ A+ B (A, B factors) Model with both factors and continuous variables y ~ A*B*x1 + A*B*x2 What do these mean? How do we interpret the output?
4
© Department of Statistics 2001 Slide 4 Regression model Mean of observation = 0 + 1 x 1 + 2 x 2 + 3 x 3 Estimate ’s by least squares ie minimize
5
© Department of Statistics 2001 Slide 5 Matrix formulation Arrange data into a matrix and vector Then minimise
6
© Department of Statistics 2001 Slide 6 Normal equations Minimising b’s satisfy Proof: Non-negative, zero when b=beta hat
7
© Department of Statistics 2001 Slide 7 Solving the equations We could calculate the matrix X T X directly, but this is not very accurate (subject to roundoff errors). For example, when trying to fit polynomials, this method breaks down for polynomials of low degree Better to use the “QR decomposition” which avoids calculating X T X
8
© Department of Statistics 2001 Slide 8 Solving the normal equations Use “QR decomposition” X=QR X is n x p and must have “full rank” (no column a linear combination of other columns) Q is n x p “orthogonal” (i.e. Q T Q = identity matrix) R is p x p “upper triangular” (all elements below the diagonal zero), all diagonal elements positive, so inverse exists
9
© Department of Statistics 2001 Slide 9 Solving using QR X T X = R T Q T QR = R T R X T y = R T Q T y Normal equations reduce to R T Rb = R T Q T y Premultiply by inverse of R T to get Rb = Q T y Triangular system, easy to solve
10
© Department of Statistics 2001 Slide 10 Solving a triangular system
11
© Department of Statistics 2001 Slide 11 A refinement We need Q T y: Solution: do QR decomp of [X,y] Thus, solve Rb = r
12
© Department of Statistics 2001 Slide 12 What R has to do When you run lm, R forms the matrix X from the model formula, then fits the model E(Y)=Xb Steps: 1.Extract X and Y from the data and the model formula 2.Do the QR decomposition 3.Solve the equations Rb = r 4.Solutions are the numbers reported in the summary
13
© Department of Statistics 2001 Slide 13 Forming X When all variables are continuous, it’s a no-brainer 1.Start with a column of 1’s 2. Add columns corresponding to the independent variables It’s a bit harder for factors
14
© Department of Statistics 2001 Slide 14 Factors: one way anova Consider model y ~ a where a is a factor having 3 levels say. In this case, we 1.Start with a column of ones 2.Add a dummy variable for each level of the factor (3 in all), order is order of factor levels Problem: matrix has 4 columns, but first is sum of last 3, so not linearly independent Solution: Reparametrize!
15
© Department of Statistics 2001 Slide 15 Reparametrizing Let X a be the last 3 columns (the 3 dummy variables) Replace X a by X a C (ie X a multiplied by C), where C is a 3 x 2 “contrast matrix” with the properties 1.Columns of X a C are linearly independent 2.Columns of X a C are linearly independent of the column on 1’s In general, if a has k levels, C will be k x (k-1)
16
© Department of Statistics 2001 Slide 16 The “treatment” parametrization Here C is the matrix C = 0 0 1 0 0 1 (You can see the matrix in the general case by typing contr.treatment(k) in R, where k is the number of levels) This is the default in R
17
© Department of Statistics 2001 Slide 17 Treatment parametrization (2) The model is E[Y] = X where X is 1 0 0 1 1 0 1 0 1 The effect of the reparametrization is to drop the first column of Xa, leaving the others unchanged.... Observations at level 1 Observations at level 2 Observations at level 3
18
© Department of Statistics 2001 Slide 18 Treatment parametrization (3) Mean response at level 1 is Mean response at level 2 is Mean response at level 3 is Thus, is interpreted as the baseline (level 1) mean The parameter is interpreted as the offset for level 2 (difference between levels 1 and 2) The parameter is interpreted as the offset for level 3 (difference between levels 1 and 3)...
19
© Department of Statistics 2001 Slide 19 The “sum” parametrization Here C is the matrix C = 1 0 0 1 (You can see the matrix in the general case by typing contr.sum(k) in R, where k is the number of levels) To get this in R, you need to use the options function options(contrasts=c("contr.sum", "contr.poly"))
20
© Department of Statistics 2001 Slide 20 sum parametrization (2) The model is E[Y] = X where X is 1 1 0 1 0 1 1 -1 -1 The effect of this reparametrization is to drop the last column of Xa, and change the rows corresponding to the last level of a.... Observations at level 1 Observations at level 2 Observations at level 3
21
© Department of Statistics 2001 Slide 21 Sum parameterization (3) Mean response at level 1 is Mean response at level 2 is Mean response at level 3 is Thus, is interpreted as the average of the 3 means, the “overall mean” The parameter is interpreted as the offset for level 1 (difference between level 1 and the overall mean) The parameter is interpreted as the offset for level 2 (difference between level 1 and the overall mean) The offset for lervel 3 is ...
22
© Department of Statistics 2001 Slide 22 The “Helmert” parametrization Here C is the matrix C = -1 -1 1 -1 0 2 (You can see the matrix in the general case by typing contr.helmert(k) in R, where k is the number of levels)
23
© Department of Statistics 2001 Slide 23 Helmert parametrization (2) The model is E[Y] = X where X is 1 -1 -1 1 1 -1 1 0 2 The effect of this reparametrization is to change all the rows and columns.... Observations at level 1 Observations at level 2 Observations at level 3
24
© Department of Statistics 2001 Slide 24 Helmert parametrization (3) Mean response at level 1 is Mean response at level 2 is Mean response at level 3 is Thus, is interpreted as the average of the 3 means, the “overall mean” The parameter is interpreted as half the difference between level 2mean and level 1mean The parameter is interpreted as the one third of the difference between the level 3 mean and the average of the level 1 and 2 means...
25
© Department of Statistics 2001 Slide 25 Using R to calculate the relationship between -parameters and means Thus, the matrix (X T X) -1 X T gives the coefficients we need to find the ’s from the ’s
26
© Department of Statistics 2001 Slide 26 Example: One way model In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures. The response variable (ratio) is the ratio of damages to undamaged cells, and the explanatory variable (treatment) is the substance
27
© Department of Statistics 2001 Slide 27 Data ratio treatment 0.08 control + 49 other control obs 0.08 choralhydrate + 49 other choralhydrate obs 0.10 diazapan + 49 other diazapan obs 0.10 hydroquinone + 49 other hydroquinine obs 0.07 econidazole + 49 other econidazole obs 0.17 colchicine + 49 other colchicine obs
28
© Department of Statistics 2001 Slide 28 > cancer.lm<-lm(ratio ~ treatment,data=cancer.df) > summary(cancer.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.26900 0.02037 13.207 < 2e-16 *** treatmentcolchicine 0.17920 0.02881 6.221 1.69e-09 *** treatmentcontrol -0.03240 0.02881 -1.125 0.262 treatmentdiazapan 0.01180 0.02881 0.410 0.682 treatmenteconidazole -0.00420 0.02881 -0.146 0.884 treatmenthydroquinone 0.04300 0.02881 1.493 0.137 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.144 on 294 degrees of freedom Multiple R-Squared: 0.1903, Adjusted R-squared: 0.1766 F-statistic: 13.82 on 5 and 294 DF, p-value: 3.897e-12 lm output
29
© Department of Statistics 2001 Slide 29 Relationship between means and betas X<-model.matrix(cancer.lm) coef.mat<-solve(t(X)%*%X)%*%t(X) > levels(cancer.df$treatment) [1] "chloralhydrate" "colchicine" "control" "diazapan" "econidazole" "hydroquinone" > cancer.df$treatment[c(1,51,101,151,201,251)] control chloralhydrate diazapan hydroquinone econidazole colchicine Levels: chloralhydrate colchicine control diazapan econidazole hydroquinone > round(50*coef.mat[,c(1,51,101,151,201,251)]) 1 51 101 151 201 251 (Intercept) 0 1 0 0 0 0 treatmentcolchicine 0 -1 0 0 0 1 treatmentcontrol 1 -1 0 0 0 0 treatmentdiazapan 0 -1 1 0 0 0 treatmenteconidazole 0 -1 0 0 1 0 treatmenthydroquinone 0 -1 0 1 0 0
30
© Department of Statistics 2001 Slide 30 Two factors: model y ~ a + b To form X: 1.Start with column of 1’s 2.Add X a C a 3.Add X b C b
31
© Department of Statistics 2001 Slide 31 Two factors: model y ~ a * b To form X: 1.Start with column of 1’s 2.Add X a C a 3.Add X b C b 4.Add X a C a : X b C b (Every column of X a C a multiplied elementwise with every column of X b C b )
32
© Department of Statistics 2001 Slide 32 Two factors: example Experiment to study weight gain in rats –Response is weight gain over a fixed time period –This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low) –See coursebook Section 4.4
33
© Department of Statistics 2001 Slide 33 Data > diets.df gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low 5 107 Cereal Low 6 49 Pork Low 7 102 Beef High 8 74 Cereal High 9 79 Pork High 10 76 Beef Low... 60 observations in all
34
© Department of Statistics 2001 Slide 34 Two factors: the model If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is ij Other standard assumptions (equal variance, normality, independence) apply
35
© Department of Statistics 2001 Slide 35 Diagramatically… Source = Beef Source = Cereal Source = Pork Level =High 11 12 13 Level =Low 21 22 23
36
© Department of Statistics 2001 Slide 36 Decomposition of the means We usually want to split each “cell mean” up into 4 terms: –A term reflecting the overall baseline level of the response –A term reflecting the effect of factor A (row effect) –A term reflecting the effect of factor B (column effect) –A term reflecting how A and B interact.
37
© Department of Statistics 2001 Slide 37 Mathematically… Overall Baseline: 11 (mean when both factors are at their baseline levels) Effect of i th level of factor A (row effect): i1 11 The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline) Effect of j th level of factor B (column effect) : 1j - 11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline) Interaction: what’s left over (see next slide)
38
© Department of Statistics 2001 Slide 38 Interactions Each cell (except the first row and column) has an interaction: Interaction = cell mean - baseline - row effect - column effect If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B –In mathematical terms, ij – i’j doesn’t depend on j Equivalently, effect of changing levels of B is the same for all levels of A If interactions are zero, relationship between factors and response is simple
39
© Department of Statistics 2001 Slide 39 Splitting up the mean: rats Cell Means BeefCerealPorkBaseline col High10085.999.5100 Low79.283.978.779.2 Baseline row10085.999.5100 Split-up BeefCerealPorkRow effect High*** * Low*18.80-20.8 Col effect*-14.1-0.5100 Factors are : level (amount of food) and source (diet) 83.9 =100+(-20.8)+(-14.1)+18.8 interaction
40
© Department of Statistics 2001 Slide 40 Fit model > rats.lm<-lm(gain~source+level + source:level) > summary(rats.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 *** sourceCereal -1.410e+01 6.551e+00 -2.152 0.03585 * sourcePork -5.000e-01 6.551e+00 -0.076 0.93944 levelLow -2.080e+01 6.551e+00 -3.175 0.00247 ** sourceCereal:levelLow 1.880e+01 9.264e+00 2.029 0.04736 * sourcePork:levelLow -3.052e-14 9.264e+00 -3.29e-15 1.00000 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 14.65 on 54 degrees of freedom Multiple R-Squared: 0.2848, Adjusted R-squared: 0.2185 F-statistic: 4.3 on 5 and 54 DF, p-value: 0.002299
41
© Department of Statistics 2001 Slide 41 Fitting as a regression model Note that using the treatment contrasts, this is equivalent to fitting a regression with dummy variables R2, C2, C3 R2 = 1 if obs is in row 2, zero otherwise C2 = 1 if obs is in column 2, zero otherwise C3 = 1 if obs is in column 3, zero otherwise The regression is Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)
42
© Department of Statistics 2001 Slide 42 Notations For two factors A and B Baseline: = 11 A main effect: i = i1 - 11 B main effect: j = 1j - 11 AB interaction: ij = ij - i1 - 1j + 11 Then ij = + i + j + ij
43
© Department of Statistics 2001 Slide 43 Re-label cell means, in data order Source = Beef Source = Cereal Source = Pork Level =High 11 22 33 Level =Low 44 55 66
44
© Department of Statistics 2001 Slide 44 Using R to interpret parameters >rats.df<-read.table(file.choose(), header=T) >rats.lm<-lm(gain~source*level, data=rats.df) >X<-model.matrix(rats.lm) >coef.mat<-solve(t(X)%*%X)%*%t(X) >round(10*coef.mat[,1:6]) 1 2 3 4 5 6 (Intercept) 1 0 0 0 0 0 sourceCereal -1 1 0 0 0 0 sourcePork -1 0 1 0 0 0 levelLow -1 0 0 1 0 0 sourceCereal:levelLow 1 -1 0 -1 1 0 sourcePork:levelLow 1 0 -1 -1 0 1 >rats.df[1:6,] gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low 5 107 Cereal Low 6 49 Pork Low Cell Means betas
45
© Department of Statistics 2001 Slide 45 X matrix: details (first six rows) (Intercept) source source level sourceCereal: sourcePork: Cereal Pork Low levelLow levelLow 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 1 Col of 1’sXaCaXaCa XbCbXbCb X a C a :X b C b
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.