Regression Transformations for Normality and to Simplify Relationships

Slides:



Advertisements
Similar presentations
Linear regression models in R (session 1) Tom Price 3 March 2009.
Advertisements

Continued Psy 524 Ainsworth
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Regression with ARMA Errors. Example: Seat-belt legislation Story: In February 1983 seat-belt legislation was introduced in UK in the hope of reducing.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
Multiple Regression Predicting a response with multiple explanatory variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Chapter 12 Multiple Regression
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
9 - 1 Intrinsically Linear Regression Chapter Introduction In Chapter 7 we discussed some deviations from the assumptions of the regression model.
Chapter 12 Multiple Regression and Model Building.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Managerial Economics Demand Estimation. Scatter Diagram Regression Analysis.
23-1 Multiple Covariates and More Complicated Designs in ANCOVA (§16.4) The simple ANCOVA model discussed earlier with one treatment factor and one covariate.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Regression Model Building LPGA Golf Performance
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Environmental Modeling Basic Testing Methods - Statistics III.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Linear Models Alan Lee Sample presentation for STATS 760.
Linear Prediction Correlation can be used to make predictions – Values on X can be used to predict values on Y – Stronger relationships between X and Y.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Linear regression models. Purposes: To describe the linear relationship between two continuous variables, the response variable (y- axis) and a single.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
Transforming the data Modified from:
Chapter 15 Multiple Regression Model Building
Lecture 11: Simple Linear Regression
Regression Analysis AGEC 784.
Chapter 12 Simple Linear Regression and Correlation
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
CHAPTER 7 Linear Correlation & Regression Methods
Correlation and regression
REGRESI DENGAN VARABEL FAKTOR/ KUALLTATIF
Checking Regression Model Assumptions
Business Statistics, 4e by Ken Black
Chapter 12 Inference on the Least-squares Regression Line; ANOVA
Checking Regression Model Assumptions
Console Editeur : myProg.R 1
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Regression with ARMA Errors
Chapter 12 Simple Linear Regression and Correlation
Example 1 5. Use SPSS output ANOVAb Model Sum of Squares df
Hypothesis testing and Estimation
Model Comparison: some basic concepts
Linear regression Fitting a straight line to observations.
Simple Linear Regression
Obtaining the Regression Line in R
Business Statistics, 4e by Ken Black
Covariance and Correlation Assaf Oron, May 2008
Presentation transcript:

Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source: www.eia.gov

Data Description Coal Mine Production and Labor Effort for all Mines Producing Over 100,000 short tons of Coal in 2011 Units: Mine (n = 691) Response: Coal Production (100,000s of tons) Predictor Variables: Labor Effort (100,000s of Hours) Surface Mine Dummy (1 if Surface Mine, 0 if Underground) Appalachia Region Dummy (1 if Yes, 0 if Interior or Western) Interior Region Dummy (1 if Yes, 0 if Appalachia or Western) MinePrep Dummy (1 if Mine & Preparation Plant, 0 if Mine Only)

Model 1 – Non-Transformed with Interactions Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -63.0381 5.7418 -10.979 < 2e-16 *** labor 21.5790 1.1099 19.443 < 2e-16 *** surface -8.9941 2.3124 -3.890 0.000110 *** appalachia 65.9606 5.3749 12.272 < 2e-16 *** interior 70.2902 6.1094 11.505 < 2e-16 *** mineprep -15.1459 3.7579 -4.030 6.2e-05 *** I(labor * surface) 8.9305 0.7371 12.116 < 2e-16 *** I(labor * appalachia) -21.2830 0.9139 -23.288 < 2e-16 *** I(labor * interior) -21.0385 1.0601 -19.846 < 2e-16 *** I(labor * mineprep) 3.8508 0.6131 6.281 6.0e-10 *** --- Residual standard error: 21.76 on 681 degrees of freedom Multiple R-squared: 0.8915, Adjusted R-squared: 0.8901 F-statistic: 621.8 on 9 and 681 DF, p-value: < 2.2e-16

Residual Plots – Not Pretty!

Box-Cox Transformation of Y Goal: Transform Y to Normality – Box-Cox Transformation (Power Transformation) Goal: Choose power l that minimizes Error Sum of Squares (maximizes normal likelihood), typically evaluated over (-2,+2)

Plot of log-Likelihood vs l Choose l=0 – Logarithmic transformation: Y’ = ln(Y)

Model with Y’ = ln(Y) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.55942 0.14293 17.906 < 2e-16 *** labor 0.13648 0.02763 4.940 9.86e-07 *** surface -0.07384 0.05756 -1.283 0.2 appalachia -2.03450 0.13380 -15.205 < 2e-16 *** interior -1.52129 0.15209 -10.003 < 2e-16 *** mineprep 0.50231 0.09355 5.370 1.08e-07 *** I(labor * surface) 0.15908 0.01835 8.670 < 2e-16 *** I(labor * appalachia) 0.16475 0.02275 7.242 1.20e-12 *** I(labor * interior) 0.16721 0.02639 6.336 4.28e-10 *** I(labor * mineprep) -0.12685 0.01526 -8.311 5.13e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5417 on 681 degrees of freedom Multiple R-squared: 0.8079, Adjusted R-squared: 0.8054 F-statistic: 318.2 on 9 and 681 DF, p-value: < 2.2e-16

Residual Plots with Y’ = ln(Y)

Evidence of possibly nonlinear relation between ln(Y) and X Consider power transformation of X

Box-Tidwell Transformation of X Goal: Power Transformation of X to make relation with (transformed, in this case) Y linear Classify variables as to be transformed (Labor), and variables not to be transformed (regional and mine type dummies) Can be computed in R with car package, along with a test of whether power = 1 (no transformation) > boxTidwell(logprod ~ labor, other.x=~surface + appalachia + interior + mineprep) Score Statistic p-value MLE of lambda -21.75547 0 0.2768753 Choose to make X’ = X0.25 for labor (and labor interactions with regions and mine types

Full Model with Y’=ln(Y) and L’=L0.25 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.28897 0.35303 -3.651 0.000281 *** labor25 3.12581 0.25251 12.379 < 2e-16 *** surface -0.15534 0.15391 -1.009 0.313191 appalachia -0.93595 0.32755 -2.857 0.004402 ** interior -0.49939 0.36439 -1.370 0.170987 mineprep -0.43924 0.20918 -2.100 0.036110 * I(labor25 * surface) 0.53234 0.13157 4.046 5.8e-05 *** I(labor25 * appalachia) -0.18624 0.22728 -0.819 0.412831 I(labor25 * interior) -0.09431 0.25320 -0.372 0.709658 I(labor25 * mineprep) 0.28679 0.14875 1.928 0.054266 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4508 on 681 degrees of freedom Multiple R-squared: 0.8669, Adjusted R-squared: 0.8652 F-statistic: 493 on 9 and 681 DF, p-value: < 2.2e-16 Note that neither interaction of transformed labor and regional dummies (appalachia and interior) appear important – refit simpler model.

Reduced Model with Y’=ln(Y) and L’=L0.25 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.04247 0.14758 -7.064 4.00e-12 *** labor25 2.94855 0.10597 27.824 < 2e-16 *** surface -0.19196 0.14803 -1.297 0.1951 appalachia -1.19221 0.07681 -15.522 < 2e-16 *** interior -0.64264 0.08357 -7.690 5.14e-14 *** mineprep -0.48673 0.20247 -2.404 0.0165 * I(labor25 * surface) 0.56751 0.12508 4.537 6.74e-06 *** I(labor25 * mineprep) 0.32409 0.14287 2.268 0.0236 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4504 on 683 degrees of freedom Multiple R-squared: 0.8668, Adjusted R-squared: 0.8654 F-statistic: 634.9 on 7 and 683 DF, p-value: < 2.2e-16 Res.Df RSS Df Sum of Sq F Pr(>F) 1 683 138.56 2 681 138.39 2 0.17023 0.4189 0.658 Drop the 2 interactions from the model

Residual Plots for Final Model