Announcements There’s an in class exam one week from today (4/30). It will not include ANOVA or regression. On Thursday, I will list covered material and put practice questions, etc on web. No office hours on Wed this week. (I’m out of town all day.) See web for Homework due this Thursday (4/25). I read your proposals and made comments / suggestions on the ones that analyze their own data.
Multiple Regression Cheese Example: In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of cheese were analyzed to determine the amount of acetic acid and hydrogen sulfide they contained. Overall scores for each cheese were obtained by combining the scores from several tasters. The goal is to predict the taste score based on the lactic acid and hydrogen sulfide content. (From Matt Wand)
Model: A simple model for taste is: Taste i = 0 + 1 acetic i + 2 H 2 S i + error i i = 1,…,n=30 Again the intercepts and slopes are selected to minimize the error sum of squares: SSE = {taste 1 – (b 0 + b 1 acetic 1 + b 2 H 2 S 1 )}2 + … + {taste 30 – (b 0 + b 1 acetic 30 + b 2 H 2 S 30 )} 2 Geometrically: The simple linear model estimated a line. A model with an intercept and 2 slopes estimates a surface. (see Matlab) Note that you could add more predictors too…
Minitab: Stat: Regression: Regression –Response is taste –Predictors are acetic and h2s Output: The regression equation is taste = H2S acetic Predictor Coef SE Coef T P Constant H2S acetic S = R-Sq = 40.6% R-Sq(adj) = 36.2% Analysis of Variance Source DF SS MS F P Regression Residual Error Total
Minitab: The regression equation is taste = H2S acetic Predictor Coef SE Coef T P Constant H2S acetic T = Coef / SE Coef P-value is for test:H0: Coef = 0, HA: Coef is not 0 (if p-value < , then reject H0) 1- CI for Coef: Coef +/- SE Coef t /2,df=error df Test statistic
Minitab: This is a test of the “usefulness of regression” Analysis of Variance Source DF SS MS F P Regression Residual Error Total The regression equation is taste = H2S acetic Model is regression equation + error: taste = H2S acetic + error MSE = = variance of error. F stat = MSR / MSE (this is test statistic) P-value is for test: H0: 1 = 2 = (both slopes = 0) HA: at least one is not 0 Overall test of whether or not the regression is useful.
Using the regression equation: taste = H2S acetic If H2S = 3 and acetic = 5, then what is the expected taste score? (NOTE that this is not an extrapolation…) For value, just plug H2S=3 and acetic=5 into equation. For “confidence interval” (CI): Stat: regression: regression, Options button: prediction interval for new obs (put in in order that they’re in the regression equation)| New Obs Fit SE Fit 95.0% CI 95.0% PI ( 10.60, 23.63) ( , 44.53) Prediction interval: wider than CI since prediction includes “error” variability and variability in estimating the parameters.
Dummy (or indicator) variables: When some predictor variables are categorical, then regression can still be used. Dummy variables are used to indicate fabric of each observation…
Regression Model for Burn Time Data Burn time = 1 if fabric 1 + 2 if fabric 2 + 3 if fabric 3 + 4 if fabric 4 + error or y i = 1 x 1i + 2 x 2i + 3 x 3i + 4 x 4i + i (x’s are “indicator variables”) x 1i = 1 if observation i is fabric 1 and 0 otherwise x 2i = 1 if observation i is fabric 2 and 0 otherwise x 3i = 1 if observation i is fabric 3 and 0 otherwise x 4i = 1 if observation i is fabric 4 and 0 otherwise Beta’s are fabric specific means. The model does not have an intercept. (stat:regression:regression,options: “Fit intercept” button)
An Equivalent Model: y i = 0 + 2 x 2i + 3 x 3i + 4 x 4i + i x 2i = 1 if observation i is fabric 2 and 0 otherwise x 3i = 1 if observation i is fabric 3 and 0 otherwise x 4i = 1 if observation i is fabric 4 and 0 otherwise Fabric 1 mean = 0 Fabric 2 mean = 0 + 2 Fabric 3 mean = 0 + 3 Fabric 4 mean = 0 + 4 This model does have an intercept. 0 is mean for fabric 1 Rest of the ’s are “offsets”
The regression equation is Burn Time = Fabric Fabric Fabric 4 Predictor Coef SE Coef T P Constant Fabric Fabric Fabric S = R-Sq = 87.2% R-Sq(adj) = 83.9% Analysis of Variance (Note that this is the same as before!) Source DF SS MS F P Regression Residual Error Total % CI’s for fabric means: (Point estimate of mean) +/- t 0.025,12 sqrt(MSE / 4) Fabric 2: (16.85 – 5.90) +/ sqrt(1.348 / 4) / (0.5806) ( is std dev of estimate of 0 + 2 ) (As usual, we’re assuming the errors are indep and normal with constant variance.)
Back to cheese Suppose the cheeses come from two regions of Australia and we want to include that info in the model: Taste i = 0 + 1 acetic i + 2 H 2 S i + 3 Region i + error i i = 1,…,n=30 Region i = 1 if i th sample comes from region 1 and 0 otherwise. 3 is effect of region 1… If b 3 is > 0, then region 1 tends to increase the mean score (and vice versa)