Polynomial Regression and Transformations STA 671 Summer 2008.

Slides:



Advertisements
Similar presentations
Simple linear models Straight line is simplest case, but key is that parameters appear linearly in the model Needs estimates of the model parameters (slope.
Advertisements

Things to do in Lecture 1 Outline basic concepts of causality
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Statistical Techniques I EXST7005 Multiple Regression.
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Copyright © 2010 Pearson Education, Inc. Slide
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Chapter 10 Re-expressing the data
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Statistics for the Social Sciences
Statistics for Managers Using Microsoft® Excel 5th Edition
Stat 112: Lecture 10 Notes Fitting Curvilinear Relationships –Polynomial Regression (Ch ) –Transformations (Ch ) Schedule: –Homework.
Statistics for Managers Using Microsoft® Excel 5th Edition
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 15: Model Building
Correlation and Regression Analysis
Spreadsheet Problem Solving
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Objectives of Multiple Regression
Physics 114: Lecture 17 Least Squares Fit to Polynomial
Inference for regression - Simple linear regression
3 CHAPTER Cost Behavior 3-1.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
Quantitative Methods Heteroskedasticity.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Inferences for Regression
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
1 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 5 Summarizing Bivariate Data.
Xuhua Xia Polynomial Regression A biologist is interested in the relationship between feeding time and body weight in the males of a mammalian species.
Summarizing Bivariate Data
M25- Growth & Transformations 1  Department of ISM, University of Alabama, Lesson Objectives: Recognize exponential growth or decay. Use log(Y.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Chapter 10 Re-expressing the data
Lecture 6 Re-expressing Data: It’s Easier Than You Think.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
CHAPTER 8 Linear Regression. Residuals Slide  The model won’t be perfect, regardless of the line we draw.  Some points will be above the line.
Quadratic Regression ©2005 Dr. B. C. Paul. Fitting Second Order Effects Can also use least square error formulation to fit an equation of the form Math.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 HETEROSCEDASTICITY: WEIGHTED AND LOGARITHMIC REGRESSIONS This sequence presents two methods for dealing with the problem of heteroscedasticity. We will.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
BPS - 5th Ed. Chapter 231 Inference for Regression.
Statistics 10 Re-Expressing Data Get it Straight.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Let’s Get It Straight! Re-expressing Data Curvilinear Regression
Correlation, Regression & Nested Models
Inference for Regression
(Residuals and
Regression and Residual Plots
CHAPTER 29: Multiple Regression*
So how do we know what type of re-expression to use?
Regression Models - Introduction
Lecture 6 Re-expressing Data: It’s Easier Than You Think
Regression Models - Introduction
Presentation transcript:

Polynomial Regression and Transformations STA 671 Summer 2008

Review The estimated residuals e 1,…,e n provide the best method for checking the assumptions. Remember the residuals ε i ~ N(0,σ). The estimated residuals should be close to that. In a residual plot, you are looking for outliers, curvature, or changing variance. In this lecture we will discuss polynomial regression and transformations, two separate methods. Both are possible solutions to curvature, and transformations have the added benefit they sometimes address changing variance.

Recall the Hooker data. There appears to be a small amount of curvature.

This curvature is seen more clearly in the residual plot.

Polynomial regression – one method for dealing with curvature. To account for curvature, we can perform something called “polynomial regression”, which consists of fitting a polynomial (a quadratic or cubic typically) instead of a line. Recall the linear model was Y i = β 0 + β 1 X i + ε i. The quadratic model is Y i = β 0 + β 1 X i + β 2 X i 2 + ε i. The cubic model is Y i = β 0 + β 1 X i + β 2 X i 2 + β 3 X i 3 + ε i. The higher the order of the polynomial, the more curvature it can account for.

Quadratic model accounts for the curvature Quadratic equation is – Temp Temp 2 If the quadratic model is better than the linear model, what about a cubic?

A cubic model produces no visual improvement equation is Pressure = – 1.69 Temp Temp Temp 3

Which to choose, quadratic or cubic? In general, choose the LOWEST order polynomial possible (i.e. prefer linear to quadratic, quadratic to cubic, etc.). This is aimed at 1) “Occam’s razor” meaning that simpler models are preferred, and 2) the higher the order, the more parameters to estimate. Statistically, it’s easier to estimate a few parameters than many.

P-values for selecting order The regression output provides a formal method for selecting the order of the polynomial. This method typically agrees with looking at the residual plot. The regression output provides p-values for each term in the regression. The p-value for the highest order term is the ONLY one that is used.

Using p-values to select order Begin by fitting the cubic model. If the cubic term is significant, use the cubic model (you can consider higher order models, but we do not in STA671) If the cubic term is NOT significant, remove it and RERUN the model (p-values change depending on what terms are in the model), then look to see if the quadratic term is significant. If the quadratic term is not significant, remove it and RERUN the model, resulting in a linear regression. If none of these models produce a reasonable residual plot, you may need another method.

For the boiling point data We first run the cubic model and acquire the following p-values The p-value is not significant, so remove the cubic term and RERUN the model (do NOT just remove the quadratic terms based on the p-value above) Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Type I SS Intercept Intercept Temperature Temperature_2 2nd power of TEMPERATURE Temperature_3 3rd power of TEMPERATURE

Quadratic model for boiling point data The quadratic model produces the following p-values The quadratic term is significant AND we observe a reasonable residual plot, so we stop here. This is our final model. Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Type I SS Intercept Intercept < Temperature < Temperature_2 2nd power of TEMPERATURE <

What if no polynomial model produces a reasonable residual plot? If none of our polynomial model produces a reasonable residual plot, we need another method. Another method to try is to transform the response variable. Transformations, like polynomial regression, can handle curvature, and in addition transformations have the potential to handle changing spread as well.

Example - the ethanol data Data comes from an engine exhaust study. NOx is a measure of the exhaust from the engine, while E is a measure of the fuel/air mixture (high values are almost all fuel, low values are almost all air) A cubic model does not fit the data. A quadratic of linear model would do worse.

A cubic fit to the ethanol data Scatterplot Residual plot shows clear curvature.

Transformations Instead of fitting Y as the response variable, we fit a function of Y as the response variable. Thus, instead of Y i = β 0 + β 1 X i + ε i, you can fit log(Y i ) = β 0 + β 1 X i + ε i, or sqrt(Y i ) = β 0 + β 1 X i + ε i, or cbrt(Y i ) = β 0 + β 1 X i + ε i, etc. Thus, you greatly expand the possible models you can fit. You can transform the X variable as well, but in the interest of time we do not discuss that in detail in STA671.

Transformations allow different errors structures. A quadratic regression looks like Y i = β 0 + β 1 X i + β 2 X i 2 + ε i. At any particular X, the variance is the same. Taking the square root transformation sqrt(Y i ) = β 0 + β 1 X i + ε i means that Y i = [β 0 + β 1 X i + ε i ] 2 = β β 1 2 X i 2 + ε i β 0 β 1 X i + 2 β 0 ε i + 2 β 1 X i ε i. There is a quadratic relationship between X and Y. Note the multiplication between X i and ε i, this allows the variance to change for each X i. Thus, in addition to handling curvature, transformations allow you to address changing variance.

Prototypical Data requiring transformation

After square root transformation

Which transformation? There are no hard and fast rules on which transformation to try, no guaranteed method for finding a good transformation (in some data, you seem to never find a great fit). Usually you have to perform trial and error, and remember you can combine polynomial regression with transformation. Thus for example, you can fit a cubic model in X to predict log(Y).

Some “typical” transformations If you have area data, a square root transform is often useful (converts area to something proportional to the radius or length). Similarly with volume, a cube root transformation may be appropriate. With financial data (incomes, etc.), a log transform may be appropriate. Logs change percentage increases to constant increases, thus if a unit increase in X results in a 10% increase in Y, it also results in a increase in Y.

A general strategy Fit the raw data (X and Y) with a least squares line. See if you get a good residual plot. If so, stop and be happy If not, try a polynomial regression (quadratic or cubic). If one of these fits, stop and be happy (remember, fit the smallest model possible). If a polynomial regression does not work, try transforming Y to log, sqrt, and cube root (i.e. perform three more regressions). Fit a cubic polynomial regression on each of these and determine the best outcome. Choose the transformation that provides the best residual plot. If none of those work, then regression might not be effective (there are more advanced techniques) or you may have to start transforming X as well. This becomes true trial and error. Consult your friendly local statistician.

Back to the ethanol data. We can see from the scatterplot that E and NOX are not linearly related. We tried a cubic regression and that didn’t work. Now off to the transformations. We fit cubic regressions with log(Y), sqrt(Y), and cbrt(Y) as the response variables. We may be able to get satisfactory results with something less than cubic, but if cubic doesn’t work the lower order models won’t either, thus we start with cubic models.

Square root transformation. Still clear curvature. ScatterplotResidual plot

Cube root transformation. Improved, but still some curvature.

Log transformation. Still some lack of fit, but best of the bunch.

Log transform is not perfect, but best we can do right now (I encourage you to play with the data on your own) After we have chosen the log transformation on the basis of the best residual plot (and decided it is “ok”, if certainly not a great residual plot), we look at the p-value for the cubic term to see if we can remove it. We can. Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Type I SS Intercept Intercept < E E_2 2nd power of E E_3 3rd power of E

Quadratic model for log(NOX) The quadratic model produces almost identical scatter and residual plots. The quadratic term is significant, so this is our final model. Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Type I SS Intercept Intercept < E < E_2 2nd power of E <

Extras There are more advanced ways of dealing with polynomial regression and transformations, which we do not address in STA671. Polynomial regression can be extended to handle more general curved models, such as splines (piecewise polynomials with desirable smoothness properties) Transformation can be selected automatically by using something called a Box-Cox transformation, which automatically determines the appropriate exponent to transform your data (with a tradeoff of some interpretability). Consult your friendly local statistician.