12/22/2015330 lecture 171 STATS 330: Lecture 17. 12/22/2015330 lecture 172 Factors  In the models discussed so far, all explanatory variables have been.

Slides:



Advertisements
Similar presentations
Tests of Significance for Regression & Correlation b* will equal the population parameter of the slope rather thanbecause beta has another meaning with.
Advertisements

Factorial ANOVA More than one categorical explanatory variable.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
© Department of Statistics 2001 Slide 1 Stats 760: Lecture 2.
1 Lecture 4:F-Tests SSSII Gwilym Pryce
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
1 Module II Lecture 4:F-Tests Graduate School 2004/2005 Quantitative Research Methods Gwilym Pryce
Multiple Regression Predicting a response with multiple explanatory variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
15: Linear Regression Expected change in Y per unit X.
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Introduction to Linear Regression and Correlation Analysis
Chapter 13: Inference in Regression
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 25 Categorical Explanatory Variables.
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
Lecture 3-3 Summarizing r relationships among variables © 1.
Applications The General Linear Model. Transformations.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Chapter 14 Introduction to Multiple Regression
The Use of Dummy Variables. In the examples so far the independent variables are continuous numerical variables. Suppose that some of the independent.
23-1 Multiple Covariates and More Complicated Designs in ANCOVA (§16.4) The simple ANCOVA model discussed earlier with one treatment factor and one covariate.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Inference for Regression Section Starter The Goodwill second-hand stores did a survey of their customers in Walnut Creek and Oakland. Among.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Lecture 4 Introduction to Multiple Regression
Lecture 14 Summary of previous Lecture Regression through the origin Scale and measurement units.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Categorical Independent Variables STA302 Fall 2013.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
General Linear Model.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.
Linear Models Alan Lee Sample presentation for STATS 760.
Chapter 8: Simple Linear Regression Yang Zhenlin.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar 8: Time Series Analysis & Forecasting – Part 1
29 October 2009 MRC CBU Graduate Statistics Lectures 4: GLM: The General Linear Model - ANOVA & ANCOVA1 MRC Cognition and Brain Sciences Unit Graduate.
INTRODUCTION TO MULTIPLE REGRESSION MULTIPLE REGRESSION MODEL 11.2 MULTIPLE COEFFICIENT OF DETERMINATION 11.3 MODEL ASSUMPTIONS 11.4 TEST OF SIGNIFICANCE.
Announcements There’s an in class exam one week from today (4/30). It will not include ANOVA or regression. On Thursday, I will list covered material and.
Stats Methods at IC Lecture 3: Regression.
The simple linear regression model and parameter estimation
Chapter 14 Introduction to Multiple Regression
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Interactions and Factorial ANOVA
Multiple Regression.
The Least-Squares Regression Line
Cases of F-test Problems with Examples
Chapter 12 Simple Linear Regression and Correlation
Hypothesis testing and Estimation
Distance – Time Graphs Time is usually the independent variable (plotted on the x-axis) Distance is usually the dependent variable (plotted on the y-axis)
Presentation transcript:

12/22/ lecture 171 STATS 330: Lecture 17

12/22/ lecture 172 Factors  In the models discussed so far, all explanatory variables have been numeric  Now we want to incorporate categorical variables into our models  In R, categorical variables are called factors

12/22/ lecture 173 Example  Consider an experiment to measure the rate of metal removal in a machining process on a lathe.  The rate depends on the speed setting of the lathe (fast, medium or slow, a categorical measurement) and the hardness of the material being machined (a continuous measurement)

12/22/ lecture 174 Data hardness setting rate slow slow slow slow slow medium medium medium medium medium fast fast fast fast fast 100

12/22/ lecture 175

12/22/ lecture 176 Model A model consisting of 3 parallel lines seems appropriate: Note same slope  ie parallel lines Different intercepts

12/22/ lecture 177 Baseline version We can regard the fast setting as a baseline  and express the other settings as “baseline plus offsets”: Baseline  Offset for medium line

12/22/ lecture 178 Baseline version (2) We can then write the model as

12/22/ lecture 179 “Deviation from mean” version Now let  be the mean of  F,  M and  S. Define “fast” line intercept Mean of intercepts

12/22/ lecture 1710 “Deviation from mean” version (2) Then Thus,  is now the “average intercept, and there are 3 offsets, one for each line. The 3 offsets add to zero. This is the form used in the Stage 2 course.

12/22/ lecture 1711 Dummy variables Back to baseline form: We can combine the 3 “baseline” equations into one by using “dummy variables”. Define med = 1 if setting =“medium” and 0 otherwise slow = 1 if setting =“slow” and 0 otherwise Then we can write the model as

12/22/ lecture 1712 Fitting The model can be fitted as usual using lm: > med <-ifelse(metal.df$setting=="medium", 1,0) > slow<-ifelse(metal.df$setting=="slow", 1,0) > summary(lm(rate~med + slow + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * med *** slow e-07 *** hardness e-09 ***

12/22/ lecture 1713 Fitting (2) Thus, the baseline has intercept The “medium” line has intercept = The “slow” line has intercept =

12/22/ lecture 1714 baseline Offset  m Offset  s

12/22/ lecture 1715 Fitting (3) Making dummy variables is a pain. Fortunately R allows us to write > summary(lm(rate ~ setting + hardness)) Estimate Std.Error t-value Pr(>|t|) (Intercept) * settingmedium *** settingslow e-07 *** hardness e-09 *** and get the same result, provided the variable setting is a factor.

12/22/ lecture 1716 Factors  Since the data for setting in the input data was character data, the variable setting was automatically recognized as a factor  In fact the 3 settings were 1000, 1200, 1400 rpm. What would happen if the input data had used these (numerical) values?  Answer: the lm function would have assumed that setting was a continuous variable and fitted a plane, not 3 parallel lines.

12/22/ lecture 1717 Factors (2) > rpm = rep(c(1000,1200,1400), c(5,5,5)) > summary(lm(rate~ rpm + hardness, data=metal.df)) Call: lm(formula = rate ~ rpm + hardness, data = metal.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-08 *** rpm e-07 *** hardness e-10 *** When rpm = 1000, the relationship is * * hardness i.e * hardness

12/22/ lecture 1718 Factors (3) InterceptSlope factornon-factorfactornon-factor Fast Medium Slow The non-factor model constrains the 3 intercepts to be equally spaced. OK for this data set, but not in general.

12/22/ lecture 1719 Factors (4)  To avoid this, we could recode the variable as character, or (easier) Use the factor function to coerce the numerical data into a factor rpm.as.factor = factor(rpm)

12/22/ lecture 1720 Factors (5) We can fit the “factor” model using the R code > rpm.as.factor = factor(rpm) > summary(lm(rate~rpm.as.factor + hardness)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** rpm.as.factor *** rpm.as.factor e-07 *** hardness e-09 *** These estimates are different!! What’s going on??

12/22/ lecture 1721 Levels  The different values of a factor are called “levels”  The levels of the factor setting are fast, medium, slow > levels(setting) [1] "fast" "medium" "slow"  The levels of the factor rpm.as.factor are 1000,1200,1400 > levels(rpm.as.factor) [1] "1000" "1200" "1400"

12/22/ lecture 1722 Levels (2)  By default, the levels are listed in alphabetical order  The first level is selected as the baseline  Thus, using setting, the baseline is “fast” Using rpm.as.factor, the baseline is “1000”

12/22/ lecture 1723 Levels (3) > rpm.newbaseline<-factor(rpm,levels=c("1400", "1200", "1000")) > summary(lm(rate~rpm.newbaseline + hardness, data=metal.df)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * rpm.newbaseline *** rpm.newbaseline e-07 *** hardness e-09 *** Can change the order using the factor function

12/22/ lecture 1724 Non-parallel lines  What if the lines aren’t parallel? Then the betas are different: the model becomes

12/22/ lecture 1725 Baseline version for the betas As before, we can regard the fast setting as a baseline  and express the other settings as “baseline plus offsets”: Baseline  Offset for medium line slope

12/22/ lecture 1726 Baseline version for both parameters We can then write the model as

12/22/ lecture 1727 Dummy variables for both parameters As before, we can combine these 3 equations into one by using “dummy variables”. Define med and slow as before, and h.med = hardness x med h.slow = hardness x slow Then we can write the model as

12/22/ lecture 1728 Fitting in R The model formula for this non-parallel model is rate ~ setting + hardness + setting:hardness or, even more compactly, as rate ~ setting * hardness > summary(lm(rate ~ setting*hardness)) Estimate Std. Error t value Pr(>|t|) (Intercept) settingmedium settingslow hardness e-07 *** settingmedium:hardness settingslow:hardness

12/22/ lecture 1729 Is the non-parallel model necessary? This amounts to testing if  M and  S are zero, or, equivalently, if the parallel model rate ~ setting + hardness is an an adequate submodel of the non-parallel model rate ~ setting * hardness As in Lecture 6, we use the anova function to compare the two models:

12/22/ lecture 1730 > model1<-lm(rate ~ setting + hardness) > model2<-lm(rate ~ setting * hardness) > anova(model1, model2) Analysis of Variance Table Model 1: rate ~ setting + hardness Model 2: rate ~ setting * hardness Res.Df RSS Df Sum of Sq F Pr(>F) Conclusion: since the F-value is small and the p- value is large, we conclude that the submodel (ie the parallel lines model) is adequate.