© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2.

Slides:



Advertisements
Similar presentations
Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Advertisements

Topic 12: Multiple Linear Regression
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
The General Linear Model Or, What the Hell’s Going on During Estimation?
Matrix Algebra Matrix algebra is a means of expressing large numbers of calculations made upon ordered sets of numbers. Often referred to as Linear Algebra.
Matrix Algebra Matrix algebra is a means of expressing large numbers of calculations made upon ordered sets of numbers. Often referred to as Linear Algebra.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
The General Linear Model. The Simple Linear Model Linear Regression.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Ordinary least squares regression (OLS)
Linear and generalised linear models
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Ch. 14: The Multiple Regression Model building
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Linear and generalised linear models
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Linear regression models in matrix terms. The regression function in matrix terms.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture 5QF Introduction to Vector and Matrix Operations Needed for the.
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18.
Multiple Linear Regression - Matrix Formulation Let x = (x 1, x 2, …, x n )′ be a n  1 column vector and let g(x) be a scalar function of x. Then, by.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Chapter 14 Introduction to Multiple Regression
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture 6 Solving Normal Equations and Estimating Estimable Model Parameters.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Orthogonal Linear Contrasts This is a technique for partitioning ANOVA sum of squares into individual degrees of freedom.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Byron Gangnes Econ 427 lecture 3 slides. Byron Gangnes A scatterplot.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Orthogonal Linear Contrasts This is a technique for partitioning ANOVA sum of squares into individual degrees of freedom.
Chapter 13 Multiple Regression
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 5 Multiple Regression.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
General Linear Model.
12/22/ lecture 171 STATS 330: Lecture /22/ lecture 172 Factors  In the models discussed so far, all explanatory variables have been.
Linear Models Alan Lee Sample presentation for STATS 760.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Orthogonal Linear Contrasts A technique for partitioning ANOVA sum of squares into individual degrees of freedom.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Topic 20: Single Factor Analysis of Variance. Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
ANOVA and Multiple Comparison Tests
Simple and multiple regression analysis in matrix form Least square Beta estimation Beta Simple linear regression Multiple regression with two predictors.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
EXCEL: Multiple Regression
Chapter 14 Introduction to Multiple Regression
Multiple Regression.
Presentation transcript:

© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2

© Department of Statistics 2001 Slide 2 Agenda R formulation Matrix formulation Least squares fit Numerical details – QR decomposition R parameterisations –Treatment –Sum –Helmert

© Department of Statistics 2001 Slide 3 R formulation Regression model y ~ x1 + x2 + x3 Anova model y ~ A+ B (A, B factors) Model with both factors and continuous variables y ~ A*B*x1 + A*B*x2 What do these mean? How do we interpret the output?

© Department of Statistics 2001 Slide 4 Regression model Mean of observation =  0 +  1 x 1 +  2 x 2 +  3 x 3 Estimate  ’s by least squares ie minimize

© Department of Statistics 2001 Slide 5 Matrix formulation Arrange data into a matrix and vector Then minimise

© Department of Statistics 2001 Slide 6 Normal equations Minimising b’s satisfy Proof: Non-negative, zero when b=beta hat

© Department of Statistics 2001 Slide 7 Solving the equations We could calculate the matrix X T X directly, but this is not very accurate (subject to roundoff errors). For example, when trying to fit polynomials, this method breaks down for polynomials of low degree Better to use the “QR decomposition” which avoids calculating X T X

© Department of Statistics 2001 Slide 8 Solving the normal equations Use “QR decomposition” X=QR X is n x p and must have “full rank” (no column a linear combination of other columns) Q is n x p “orthogonal” (i.e. Q T Q = identity matrix) R is p x p “upper triangular” (all elements below the diagonal zero), all diagonal elements positive, so inverse exists

© Department of Statistics 2001 Slide 9 Solving using QR X T X = R T Q T QR = R T R X T y = R T Q T y Normal equations reduce to R T Rb = R T Q T y Premultiply by inverse of R T to get Rb = Q T y Triangular system, easy to solve

© Department of Statistics 2001 Slide 10 Solving a triangular system

© Department of Statistics 2001 Slide 11 A refinement We need Q T y: Solution: do QR decomp of [X,y] Thus, solve Rb = r

© Department of Statistics 2001 Slide 12 What R has to do When you run lm, R forms the matrix X from the model formula, then fits the model E(Y)=Xb Steps: 1.Extract X and Y from the data and the model formula 2.Do the QR decomposition 3.Solve the equations Rb = r 4.Solutions are the numbers reported in the summary

© Department of Statistics 2001 Slide 13 Forming X When all variables are continuous, it’s a no-brainer 1.Start with a column of 1’s 2. Add columns corresponding to the independent variables It’s a bit harder for factors

© Department of Statistics 2001 Slide 14 Factors: one way anova Consider model y ~ a where a is a factor having 3 levels say. In this case, we 1.Start with a column of ones 2.Add a dummy variable for each level of the factor (3 in all), order is order of factor levels Problem: matrix has 4 columns, but first is sum of last 3, so not linearly independent Solution: Reparametrize!

© Department of Statistics 2001 Slide 15 Reparametrizing Let X a be the last 3 columns (the 3 dummy variables) Replace X a by X a C (ie X a multiplied by C), where C is a 3 x 2 “contrast matrix” with the properties 1.Columns of X a C are linearly independent 2.Columns of X a C are linearly independent of the column on 1’s In general, if a has k levels, C will be k x (k-1)

© Department of Statistics 2001 Slide 16 The “treatment” parametrization Here C is the matrix C = (You can see the matrix in the general case by typing contr.treatment(k) in R, where k is the number of levels) This is the default in R

© Department of Statistics 2001 Slide 17 Treatment parametrization (2) The model is E[Y] = X  where X is The effect of the reparametrization is to drop the first column of Xa, leaving the others unchanged.... Observations at level 1 Observations at level 2 Observations at level 3

© Department of Statistics 2001 Slide 18 Treatment parametrization (3) Mean response at level 1 is   Mean response at level 2 is     Mean response at level 3 is     Thus,   is interpreted as the baseline (level 1) mean The parameter   is interpreted as the offset for level 2 (difference between levels 1 and 2) The parameter   is interpreted as the offset for level 3 (difference between levels 1 and 3)...

© Department of Statistics 2001 Slide 19 The “sum” parametrization Here C is the matrix C = (You can see the matrix in the general case by typing contr.sum(k) in R, where k is the number of levels) To get this in R, you need to use the options function options(contrasts=c("contr.sum", "contr.poly"))

© Department of Statistics 2001 Slide 20 sum parametrization (2) The model is E[Y] = X  where X is The effect of this reparametrization is to drop the last column of Xa, and change the rows corresponding to the last level of a.... Observations at level 1 Observations at level 2 Observations at level 3

© Department of Statistics 2001 Slide 21 Sum parameterization (3) Mean response at level 1 is     Mean response at level 2 is     Mean response at level 3 is         Thus,   is interpreted as the average of the 3 means, the “overall mean” The parameter   is interpreted as the offset for level 1 (difference between level 1 and the overall mean) The parameter   is interpreted as the offset for level 2 (difference between level 1 and the overall mean) The offset for lervel 3 is      ...

© Department of Statistics 2001 Slide 22 The “Helmert” parametrization Here C is the matrix C = (You can see the matrix in the general case by typing contr.helmert(k) in R, where k is the number of levels)

© Department of Statistics 2001 Slide 23 Helmert parametrization (2) The model is E[Y] = X  where X is The effect of this reparametrization is to change all the rows and columns.... Observations at level 1 Observations at level 2 Observations at level 3

© Department of Statistics 2001 Slide 24 Helmert parametrization (3) Mean response at level 1 is         Mean response at level 2 is         Mean response at level 3 is       Thus,   is interpreted as the average of the 3 means, the “overall mean” The parameter   is interpreted as half the difference between level 2mean and level 1mean The parameter   is interpreted as the one third of the difference between the level 3 mean and the average of the level 1 and 2 means...

© Department of Statistics 2001 Slide 25 Using R to calculate the relationship between  -parameters and means Thus, the matrix (X T X) -1 X T gives the coefficients we need to find the  ’s from the  ’s

© Department of Statistics 2001 Slide 26 Example: One way model In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures. The response variable (ratio) is the ratio of damages to undamaged cells, and the explanatory variable (treatment) is the substance

© Department of Statistics 2001 Slide 27 Data ratio treatment 0.08 control + 49 other control obs 0.08 choralhydrate + 49 other choralhydrate obs 0.10 diazapan + 49 other diazapan obs 0.10 hydroquinone + 49 other hydroquinine obs 0.07 econidazole + 49 other econidazole obs 0.17 colchicine + 49 other colchicine obs

© Department of Statistics 2001 Slide 28 > cancer.lm<-lm(ratio ~ treatment,data=cancer.df) > summary(cancer.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** treatmentcolchicine e-09 *** treatmentcontrol treatmentdiazapan treatmenteconidazole treatmenthydroquinone Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 294 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 5 and 294 DF, p-value: 3.897e-12 lm output

© Department of Statistics 2001 Slide 29 Relationship between means and betas X<-model.matrix(cancer.lm) coef.mat<-solve(t(X)%*%X)%*%t(X) > levels(cancer.df$treatment) [1] "chloralhydrate" "colchicine" "control" "diazapan" "econidazole" "hydroquinone" > cancer.df$treatment[c(1,51,101,151,201,251)] control chloralhydrate diazapan hydroquinone econidazole colchicine Levels: chloralhydrate colchicine control diazapan econidazole hydroquinone > round(50*coef.mat[,c(1,51,101,151,201,251)]) (Intercept) treatmentcolchicine treatmentcontrol treatmentdiazapan treatmenteconidazole treatmenthydroquinone

© Department of Statistics 2001 Slide 30 Two factors: model y ~ a + b To form X: 1.Start with column of 1’s 2.Add X a C a 3.Add X b C b

© Department of Statistics 2001 Slide 31 Two factors: model y ~ a * b To form X: 1.Start with column of 1’s 2.Add X a C a 3.Add X b C b 4.Add X a C a : X b C b (Every column of X a C a multiplied elementwise with every column of X b C b )

© Department of Statistics 2001 Slide 32 Two factors: example Experiment to study weight gain in rats –Response is weight gain over a fixed time period –This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low) –See coursebook Section 4.4

© Department of Statistics 2001 Slide 33 Data > diets.df gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low Cereal Low 6 49 Pork Low Beef High 8 74 Cereal High 9 79 Pork High Beef Low observations in all

© Department of Statistics 2001 Slide 34 Two factors: the model If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is  ij Other standard assumptions (equal variance, normality, independence) apply

© Department of Statistics 2001 Slide 35 Diagramatically… Source = Beef Source = Cereal Source = Pork Level =High  11  12  13 Level =Low  21  22  23

© Department of Statistics 2001 Slide 36 Decomposition of the means We usually want to split each “cell mean” up into 4 terms: –A term reflecting the overall baseline level of the response –A term reflecting the effect of factor A (row effect) –A term reflecting the effect of factor B (column effect) –A term reflecting how A and B interact.

© Department of Statistics 2001 Slide 37 Mathematically… Overall Baseline:  11 (mean when both factors are at their baseline levels) Effect of i th level of factor A (row effect):  i1  11   The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline) Effect of j th level of factor B (column effect) :  1j -  11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline) Interaction: what’s left over (see next slide)

© Department of Statistics 2001 Slide 38 Interactions Each cell (except the first row and column) has an interaction: Interaction = cell mean - baseline - row effect - column effect If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B –In mathematical terms,  ij –  i’j doesn’t depend on j Equivalently, effect of changing levels of B is the same for all levels of A If interactions are zero, relationship between factors and response is simple

© Department of Statistics 2001 Slide 39 Splitting up the mean: rats Cell Means BeefCerealPorkBaseline col High Low Baseline row Split-up BeefCerealPorkRow effect High*** * Low* Col effect* Factors are : level (amount of food) and source (diet) 83.9 =100+(-20.8)+(-14.1)+18.8 interaction

© Department of Statistics 2001 Slide 40 Fit model > rats.lm<-lm(gain~source+level + source:level) > summary(rats.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e e < 2e-16 *** sourceCereal e e * sourcePork e e levelLow e e ** sourceCereal:levelLow 1.880e e * sourcePork:levelLow e e e Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 54 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: 4.3 on 5 and 54 DF, p-value:

© Department of Statistics 2001 Slide 41 Fitting as a regression model Note that using the treatment contrasts, this is equivalent to fitting a regression with dummy variables R2, C2, C3 R2 = 1 if obs is in row 2, zero otherwise C2 = 1 if obs is in column 2, zero otherwise C3 = 1 if obs is in column 3, zero otherwise The regression is Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)

© Department of Statistics 2001 Slide 42 Notations For two factors A and B Baseline:  =  11 A main effect:  i =  i1 -  11 B main effect:  j =  1j -  11 AB interaction:  ij =  ij -  i1 -  1j +  11 Then  ij =  +  i +  j +  ij

© Department of Statistics 2001 Slide 43 Re-label cell means, in data order Source = Beef Source = Cereal Source = Pork Level =High 11 22 33 Level =Low 44 55 66

© Department of Statistics 2001 Slide 44 Using R to interpret parameters >rats.df<-read.table(file.choose(), header=T) >rats.lm<-lm(gain~source*level, data=rats.df) >X<-model.matrix(rats.lm) >coef.mat<-solve(t(X)%*%X)%*%t(X) >round(10*coef.mat[,1:6]) (Intercept) sourceCereal sourcePork levelLow sourceCereal:levelLow sourcePork:levelLow >rats.df[1:6,] gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low Cereal Low 6 49 Pork Low Cell Means betas

© Department of Statistics 2001 Slide 45 X matrix: details (first six rows) (Intercept) source source level sourceCereal: sourcePork: Cereal Pork Low levelLow levelLow Col of 1’sXaCaXaCa XbCbXbCb X a C a :X b C b