14-1 Transformations in Statistical Analysis Assumptions of linear statistical models. Types of Transformations Alternatives to Transformations Outline Model Assumptions Effect addivitity Normality Homoscedasticity Independence
14-2 Order of Importance Homoscedasticity Normality Additivity Independence Additivity Homoscedasticity Normality Independence Experimental Analysis Models (ANOVA) Observational Analysis Models (Regression) All four are so interrelated that which is “most” important may be immaterial!
14-3 Independence When is this important? Measurements over time on the same individual. Time series data (rainfall, temperature, etc). Repeated measures - split plots in time. Growth curves. Measurements near each other in space. Split plot designs. Spatial data. How do I know it’s a problem? Rectifying a dependence problem. By design - how the data were collected. Temporal/spatial autocorrelation analysis. Modify the type of model to be fitted to the data.
14-4 Homoscedasticity How do I know I have a problem? Plot predicted (fitted) values versus residuals. What is the pattern of the spread in the residuals as the predicted values increase? Spread constant. Spread increases. Spread decreases then increases. Acceptable Problems x x x x x x x x x x x x x x x x x x x x x x x x x x x
14-5 What to do? Attempt a transformation. Weighted regression. Incorporate additional covariates. Non-linear regression. Lack of Homogeneity in Regression What to do if the spread of the residuals plotted versus X looks like this? or this? x X Need another x variable.
14-6 Transforming the Response to achieve Linearity If a scatterplot of y versus x curves upward, proceed down on the scale to choose a transformation.
14-7
14-8 Handling Heterogeneity Regression? ANOVA no yes Test for Homoscedasticity reject accept OK Type of Transformation Transform Observations Box/Cox Family Power Family Traditional Fit linear model Plot residuals Group means OK
14-9 Transformations to Achieve Normality Regression? ANOVA no yes Fit linear model Estimate group means Residuals Normal? no yes OK Transform Different Model Q-Q plot Formal Tests
14-10 Transformations to Achieve Normality How can we determine if observations are normally distributed? Graphical examination: Normal quantile-quantile plot (QQ-plot). Histogram or boxplot. Goodness of fit tests: Kolmogorov-Smirnov test. Shapiro-Wilks test. D’Agostino’s test.
14-11 Non-normal! So what? Only very skewed distributions will have a marked effect on the significance level of the F-test for overall model or model effects. Often the same transformations which are used to achieve homoscedasticity will produce more normal- looking observations (residuals). Transformations to Achieve Model Simplicity GOAL: To provide as simple as possible a mathematical form for the relationship among response and explanatory variables. May require transforming both response and explanatory variables.
14-12 Alternative Models Generalized Linear Models Non-Linear Regression Non-Parametric Methods Weighted Least Squares complexitycomplexity high Regular Least Squares low
14-13 Example: Predicting brain weight from body weight in mammals via SLR Data are average brain (Y, g) and body (X, kg) weights for 62 species of mammals (2 omitted). Source: Allison & Chicchetti (1976), Science. Species (common name) body weight brain weight Arctic fox Owl monkey Horse Kangaroo Human African elephant Asian elephant … Chimpanzee Tree shrew Red fox Omit
14-14 Scatterplot of data is non-informative. Most species have small weights compared to the elephants. Viewing only those mammals with body weight below 300kgs suggests transforming to a log scale to linearize the relationship.
14-15 Scatterplot looks linear. Fitted regression equation is: Body weight is a very significant predictor of brain weight (p-value<0.0001). Also, R 2 =0.922.
14-16 Residual plot shows no obvious violations of the zero mean and constant variance assumption. QQ-Plot demonstrates that the normality assumption for the residuals is plausible. human opossum
14-17 Checking for influential observations (R) > fm_lm(log(y)~log(x)) > influence.measures(fm) Influence measures of lm(formula = log(y) ~ log(x)) : dfb.1. dfb.lg.. dffit cov.r cook.d hat inf e e e e (Owl Monk.) e e … e e * (Shrew) … e e * (Asian El.) … e e * (Human) e e * (African El.) e e * (Opossum) e e * (Rhesus Monk.) … e e * (Brown Bat) … e e In MTB: Stat > > Regression > Regression > Regression Storage
14-18 Decision: Leave out man (he doesn’t really fit in with the rest of the mammals) and re-run the analysis. FeatureFull ModelOmit Human R Slope p-value< < Even though results don’t change much, we will go with this last model:
14-19 This illustrates the idea of cross-validation in regression. It is often recommended that the data be split into two (equal?) portions; use one for model fitting; the other for model checking/verification. Mammal Predicted Brain Wt Prediction IntervalActual Brain Wt Tree Shrew (0.396, 5.667) Red Fox (6.359, ) Predicting the brain weights of the omitted mammals (R) > xh <- x[-32]; yh <- y[-32] > fmh <- lm(log(yh)~log(xh)) > new <- data.frame(xh=c(.104,4.235)) > predict(fmh, newdata=new, interval="prediction") fit lwr upr > exp(predict(fmh, newdata=new, interval="prediction")) fit lwr upr Exponentiate final results!
14-20 Predicting the brain weights of the omitted mammals (MTB) Influence measures can be selected here.
14-21 The regression equation is lbrain = lbody Predictor Coef SE Coef T P Constant lbody S = R-Sq = 92.2% R-Sq(adj) = 92.0% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Unusual Observations Obs lbody lbrain Fit SE Fit Residual St Resid R X R R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (0.1249, ) ( , ) (3.0201, ) ( , ) MTB output (with man) Only available influence measures are: standard/student residuals; hat matrix; Cook’s dist; and dffits.