© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.

Slides:



Advertisements
Similar presentations
Things to do in Lecture 1 Outline basic concepts of causality
Advertisements

Qualitative predictor variables
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
5/18/ lecture 101 STATS 330: Lecture 10. 5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.
Generalized Linear Models (GLM)
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
CHAPTER 4 ECONOMETRICS x x x x x Multiple Regression = more than one explanatory variable Independent variables are X 2 and X 3. Y i = B 1 + B 2 X 2i +
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Empirical Estimation Review EconS 451: Lecture # 8 Describe in general terms what we are attempting to solve with empirical estimation. Understand why.
Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis.
Correlation and Regression Analysis
Model Checking in the Proportional Hazard model
A (second-order) multiple regression model with interaction terms.
Regression and Correlation Methods Judy Zhong Ph.D.
STA291 Statistical Methods Lecture 27. Inference for Regression.
© Department of Statistics 2012 STATS 330 Lecture 28: Slide 1 Stats 330: Lecture 28.
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18.
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
Multiple Linear Regression. Multiple Regression In multiple regression we have multiple predictors X 1, X 2, …, X p and we are interested in modeling.
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Applied Epidemiologic Analysis - P8400 Fall 2002
Logistic Regression Database Marketing Instructor: N. Kumar.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Logistic Regression Applications Hu Lunchao. 2 Contents 1 1 What Is Logistic Regression? 2 2 Modeling Categorical Responses 3 3 Modeling Ordinal Variables.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Chapter 13 Multiple Regression
Simple linear regression Tron Anders Moger
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
12/22/ lecture 171 STATS 330: Lecture /22/ lecture 172 Factors  In the models discussed so far, all explanatory variables have been.
Linear Models Alan Lee Sample presentation for STATS 760.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
29 October 2009 MRC CBU Graduate Statistics Lectures 4: GLM: The General Linear Model - ANOVA & ANCOVA1 MRC Cognition and Brain Sciences Unit Graduate.
Stats Methods at IC Lecture 3: Regression.
The simple linear regression model and parameter estimation
Lecture 11: Simple Linear Regression
EXCEL: Multiple Regression
Chapter 14 Introduction to Multiple Regression
Advanced Quantitative Techniques
Inference for Least Squares Lines
Advanced Quantitative Techniques
Logistic Regression APKC – STATS AFAC (2016).
Let’s Get It Straight! Re-expressing Data Curvilinear Regression
CHAPTER 7 Linear Correlation & Regression Methods
Correlation and Simple Linear Regression
CHAPTER 29: Multiple Regression*
Hypothesis testing and Estimation
Correlation and Simple Linear Regression
Simple Linear Regression
Simple Linear Regression and Correlation
Regression and Categorical Predictors
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 2 Plan of the day In today’s lecture, we look at some general strategies for choosing models having lots of continuous and categorical explanatory variables, and discuss an example.

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 3 General Principle For a problem with both continuous and categorical explanatory variables, the most general model is to fit separate regressions for each possible combination of the factor levels. That is, we allow the categorical variables to interact with each other and the continuous variables.

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 4 Illustration Two factors A and B, two continuous explanatory variables X and Z General model is y ~ A*B*X + A*B*Z Suppose A has a levels and B has b levels, so there are a  b factor level combinations Each combination has a separate regression with 3 parameters –Constant term –Coefficient of X –Coefficient of Z

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 5 Illustration (Cont) There are a  b constant terms, we can arrange them in a table Can split the table up into main effects and interactions as in 2 way anova Listed in output as Intercept, A, B and A:B

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 6 Illustration (Cont) There are a  b X-coefficients, we can also arrange them in a table Again, we can split the table up into main effects and interactions as in 2 way anova Listed in output as X, A:X, B:X and A:B:X Ditto for Z If all the A:X, B:X, A:B:X interactions are zero, coefficient of X is the same for all the a  b regressions

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 7 Model selection In these situations, the number of possible models is large Need variable selection techniques –Anova –stepwise Don’t include high order interactions unless you include lower order interactions

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 8 Caution Sometimes we don’t have enough data to fit a separate regression to each factor level combination (need at least one more data point than number of continuous variables per combination) In this case we drop out the higher level interactions, forcing coefficients to have common values.

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 9 Example: Risk factors for low birthweight These data were collected at Baystate Medical Center, Springfield, Mass. during 1986, as part of a study to identify risk factors for low-birthweight babies. The response variable was birthweight, and data was collected on a variety of continuous and categorical explanatory variables

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 10 Variables age : mother's age in years, continuous lwt: mother's weight in pounds, continuous race: mother's race (`1' = white, `2' = black, `3' = other), factor smoke: smoking during pregnancy ( 1 =smoked, 0=didn’t smoke), factor ht: history of hypertension (0=No, 1=Yes), factor ui: presence of uterine irritability (0=No, 1=Yes), factor bwt: birth weight in grams, continuous, response Must be a factor!!

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 11 Preliminary plots

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 12 Plotting conclusions some relationships between bwt and the covariates –Slight relationship with lwt –Small effects due to the categorical variables On to fitting models……

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 13 Factor level combinations There are 2 continuous explanatory variables, and 4 categorical explanatory variables, race (3 levels), smoke (2 levels) ht (2 levels) and ui (2 levels). There are 3x2x2x2=24 factor level combinations. 24 regressions in all !!

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 14 Models The most general model would fit separate regression surfaces to each of the 24 combinations Assuming planes are appropriate, this means 24 x 3 = 72 parameters. There are 189 observations, so this is rather a lot of parameters. (usually we want at least 5 observations per parameter). In fact not all factor level combinations have enough data to fit a plane (need at least 3 points) The model fitting separate planes to each combination is bwt ~ age*race*smoke*ht*ui + lwt*race*smoke*ht*ui

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 15 Fitting Can fit the model and use the anova function to reduce number of variables > births.lm<-lm(bwt~age*race*smoke*ui*ht +lwt*race*smoke*ui*ht, data=births.df) > anova(births.lm) Also use the stepwise function with the forward option > null.lm<-lm(bwt~1,data=births.df) > step(null.lm, formula(births.lm), direction="forward")

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 16 Results: anova Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) age race ** smoke e-05 *** ui e-05 *** ht * lwt ** age:race age:smoke * race:smoke age:ui race:ui smoke:ui age:ht * race:ht smoke:ht race:lwt smoke:lwt

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 17 Results: anova (cont) Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) ui:lwt ht:lwt age:race:smoke age:race:ui age:smoke:ui race:smoke:ui age:race:ht age:smoke:ht race:smoke:lwt race:ui:lwt smoke:ui:lwt race:ht:lwt age:race:smoke:ui * race:smoke:ui:lwt Residuals

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 18 Results: stepwise (forward/both) Step: AIC= bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke Df Sum of Sq RSS AIC race:smoke ui:lwt smoke:ht ht:lwt age smoke:lwt race:ht race:lwt ui

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 19 Comparisons 3 models to compare –Full model –Model indicated by anova (model 2) bwt ~ age +ui + race + smoke + ht + lwt + age:ht + age:smoke, –Model chosen by stepwise (model 3) bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke,

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 20 ModelAdj R2R2 Param- eters AIC Full Model Model Additive model extractAIC(model3.lm)

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 21 Deleting? Point 133 seems influential – big Cov ratio, HMD Refitting without 133 now makes model 3 the best – will go with model 3 Could also just use a purely additive model (i.e parallel planes) - but adjusted R 2 and AIC are slightly worse.

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 22 Summary Model 3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** ui e-05 *** race ** race *** smoke *** ht ** lwt ht1:lwt * race2:smoke race3:smoke *

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 23 Interpretation (cont) Other things being equal: Uterine irritability associated with lower birthweight Smoking associated with lower birthweight, but differently for different races Hypertension associated with lower birthweight Race associated with lower birthweight –Black lower than white –“Other” lower than white Higher mother’s weight associated with higher birthweight, for hypertension group Smoking lowers birthweight more for race 1 (white). These effects significant but small compared to variability.

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 24 Interpretation of interactions WhiteBlackOther Smoke No Smoke Yes =

© Department of Statistics 2012 STATS 330 Lecture 19: Slide 25 Diagnostics for model 2 Check for high-influence etc Point 133 !!