Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source: www.eia.gov.

Slides:



Advertisements
Similar presentations
Linear regression models in R (session 1) Tom Price 3 March 2009.
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
BA 275 Quantitative Business Methods
5/18/ lecture 101 STATS 330: Lecture 10. 5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.
Regression with ARMA Errors. Example: Seat-belt legislation Story: In February 1983 seat-belt legislation was introduced in UK in the hope of reducing.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Multiple Regression Predicting a response with multiple explanatory variables.
Zinc Data SPH 247 Statistical Analysis of Laboratory Data.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Chapter 12 Multiple Regression
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
MATH 3359 Introduction to Mathematical Modeling Linear System, Simple Linear Regression.
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Review Guess the correlation. A.-2.0 B.-0.9 C.-0.1 D.0.1 E.0.9.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
9 - 1 Intrinsically Linear Regression Chapter Introduction In Chapter 7 we discussed some deviations from the assumptions of the regression model.
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
PCA Example Air pollution in 41 cities in the USA.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
23-1 Multiple Covariates and More Complicated Designs in ANCOVA (§16.4) The simple ANCOVA model discussed earlier with one treatment factor and one covariate.
Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.
Regression and Analysis Variance Linear Models in R.
Collaboration and Data Sharing What have I been doing that’s so bad, and how could it be better? August 1 st, 2010.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Multiple Regression I 4/9/12 Transformations The model Individual coefficients R 2 ANOVA for regression Residual standard error Section 9.4, 9.5 Professor.
Regression Model Building LPGA Golf Performance
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Environmental Modeling Basic Testing Methods - Statistics III.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.
Linear Models Alan Lee Sample presentation for STATS 760.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
EPP 245 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
WSUG M AY 2012 EViews, S-Plus and R Damian Staszek Bristol Water.
Lecture 11: Simple Linear Regression
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
CHAPTER 7 Linear Correlation & Regression Methods
Correlation and regression
Checking Regression Model Assumptions
Checking Regression Model Assumptions
Console Editeur : myProg.R 1
Regression with ARMA Errors
Regression Transformations for Normality and to Simplify Relationships
Linear regression Fitting a straight line to observations.
Presentation transcript:

Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:

Data Description Coal Mine Production and Labor Effort for all Mines Producing Over 100,000 short tons of Coal in 2011 Units: Mine (n = 691) Response: Coal Production (100,000s of tons) Predictor Variables:  Labor Effort (100,000s of Hours)  Surface Mine Dummy (1 if Surface Mine, 0 if Underground)  Appalachia Region Dummy (1 if Yes, 0 if Interior or Western)  Interior Region Dummy (1 if Yes, 0 if Appalachia or Western)  MinePrep Dummy (1 if Mine & Preparation Plant, 0 if Mine Only)

Model 1 – Non-Transformed with Interactions Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** labor < 2e-16 *** surface *** appalachia < 2e-16 *** interior < 2e-16 *** mineprep e-05 *** I(labor * surface) < 2e-16 *** I(labor * appalachia) < 2e-16 *** I(labor * interior) < 2e-16 *** I(labor * mineprep) e-10 *** --- Residual standard error: on 681 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 681 DF, p-value: < 2.2e-16

Residual Plots – Not Pretty!

Box-Cox Transformation of Y Goal: Transform Y to Normality – Box-Cox Transformation (Power Transformation) Goal: Choose power that minimizes Error Sum of Squares (maximizes normal likelihood), typically evaluated over (-2,+2)

Plot of log-Likelihood vs Choose  – Logarithmic transformation: Y’ = ln(Y)

Model with Y’ = ln(Y) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** labor e-07 *** surface appalachia < 2e-16 *** interior < 2e-16 *** mineprep e-07 *** I(labor * surface) < 2e-16 *** I(labor * appalachia) e-12 *** I(labor * interior) e-10 *** I(labor * mineprep) e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 681 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 681 DF, p-value: < 2.2e-16

Residual Plots with Y’ = ln(Y)

Evidence of possibly nonlinear relation between ln(Y) and X Consider power transformation of X

Box-Tidwell Transformation of X Goal: Power Transformation of X to make relation with (transformed, in this case) Y linear Classify variables as to be transformed (Labor), and variables not to be transformed (regional and mine type dummies) Can be computed in R with car package, along with a test of whether power = 1 (no transformation) > boxTidwell(logprod ~ labor, other.x=~surface + appalachia + interior + mineprep) Score Statistic p-value MLE of lambda Choose to make X’ = X 0.25 for labor (and labor interactions with regions and mine types

Full Model with Y’=ln(Y) and L’=L 0.25 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** labor < 2e-16 *** surface appalachia ** interior mineprep * I(labor25 * surface) e-05 *** I(labor25 * appalachia) I(labor25 * interior) I(labor25 * mineprep) Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 681 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 493 on 9 and 681 DF, p-value: < 2.2e-16 Note that neither interaction of transformed labor and regional dummies (appalachia and interior) appear important – refit simpler model.

Reduced Model with Y’=ln(Y) and L’=L 0.25 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-12 *** labor < 2e-16 *** surface appalachia < 2e-16 *** interior e-14 *** mineprep * I(labor25 * surface) e-06 *** I(labor25 * mineprep) * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 683 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 7 and 683 DF, p-value: < 2.2e-16 Res.Df RSS Df Sum of Sq F Pr(>F) Drop the 2 interactions from the model

Residual Plots for Final Model