Transforming the data Modified from:

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Kin 304 Regression Linear Regression Least Sum of Squares
Chapter 13 Multiple Regression
Chapter 12 Multiple Regression
MARE 250 Dr. Jason Turner Hypothesis Testing III.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Chapter 11 Multiple Regression.
Assumption and Data Transformation. Assumption of Anova The error terms are randomly, independently, and normally distributed The error terms are randomly,
More problem The Box-Cox Transformation Sometimes a transformation on the response fits the model better than the original response. A commonly.
Assumptions of the ANOVA The error terms are randomly, independently, and normally distributed, with a mean of zero and a common variance. –There should.
Transforming the data Modified from: Gotelli and Allison Chapter 8; Sokal and Rohlf 2000 Chapter 13.
Assumptions of the ANOVA
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Logistic Regression and Generalized Linear Models:
CHAPTER 8 Managing and Curating Data. The Second Step Storing and Curating Data.
BIOL 582 Lecture Set 18 Analysis of frequency and categorical data Part III: Tests of Independence (cont.) Odds Ratios Loglinear Models Logistic Models.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Correlation.
MARE 250 Dr. Jason Turner Hypothesis Testing III.
Statistical Evaluation of Data
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
AOV Assumption Checking and Transformations (§ )
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Logistic Regression and Odds Ratios Psych DeShon.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Stats Methods at IC Lecture 3: Regression.
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Logistic regression.
Chapter 7. Classification and Prediction
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
Model validation and prediction
Chapter 4. Inference about Process Quality
Kin 304 Regression Linear Regression Least Sum of Squares
Inference for Regression
12 Inferential Analysis.
Checking Regression Model Assumptions
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
Checking Regression Model Assumptions
SAME THING?.
Undergraduated Econometrics
Statistical Inference about Regression
Chapter 12 Review Inference for Regression
12 Inferential Analysis.
Simple Linear Regression
Product moment correlation
Regression Assumptions
Diagnostics and Remedial Measures
Logistic Regression with “Grouped” Data
3.2. SIMPLE LINEAR REGRESSION
Chapter 9 Estimation: Additional Topics
Regression Assumptions
Diagnostics and Remedial Measures
Presentation transcript:

Transforming the data Modified from: Gotelli and Allison 2004. Chapter 8; Sokal and Rohlf 2000 Chapter 13

What is a transformation? It is a mathematical function that is applied to all the observations of a given variable Y represents the original variable, Y* is the transformed variable, and f is a mathematical function that is applied to the data

Most are monotonic: Monotonic functions do not change the rank order of the data, but they do change their relative spacing, and therefore affect the variance and shape of the probability distribution

There are two legitimate reasons to transform your data before analysis The patterns in the transformed data may be easier to understand and communicate than patterns in the raw data. They may be necessary so that the analysis is valid

They are often useful for converting curves into straight lines: The logarithmic function is very useful when two variables are related to each other by multiplicative or exponential functions

Logarithmic (X):

Example: Asi’s growth (50 % each year) weight 1 10.0 2 15.0 3 22.5 4 33.8 5 50.6 6 75.9 7 113.9 8 170.9 9 256.3 10 384.4 11 576.7 12 865.0

Exponential:

Example: Species richness in the Galapagos Islands

Power:

Statistics and transformation Data to be analyzed using analysis of variance must meet to assumptions: The data must be homoscedastic: variances of treatment groups need to be approximately equal The residuals, or deviations from the mean must be normal random variables

Lets look an example A single variate of the simplest type of ANOVA (completely randomized, single classification) decomposes as follows: In this model the components are additive with the error term εij distributed normally

However… We might encounter a situation in which the components are multiplicative in effect, where If we fitted a standard ANOVA model, the observed deviations from the group means would lack normality and homoscedasticity

The logarithmic transformation We can correct this situation by transforming our model into logarithms Wherever the mean is positively correlated with the variance the logarithmic transformation is likely to remedy the situation and make the variance independent of the mean

We would obtain Which is additive and homoscedastic

The square root transformation It is used most frequently with count data. Such distributions are likely to be Poisson distributed rather than normally distributed. In the Poisson distribution the variance is the same as the mean. Transforming the variates to square roots generally makes the variances independents of the means for these type of data. When counts include zero values, it is desirable to code all variates by adding 0.5.

The box-cox transformation Often one do not have a-priori reason for selecting a specific transformation. Box and Cox (1964) developed a procedure for estimating the best transformation to normality within the family of power transformation

The box-cox transformation The value of lambda which maximizes the log-likelihood function: yields the best transformation to normality within the family of transformations s2T is the variance of the transformed values (based on v degrees of freedom). The second term involves the sum of the ln of untransformed values

box-cox in R (for a vector of data Y) >library(MASS) >lamb <- seq(0,2.5,0.5) >boxcox(Y_~1,lamb,plotit=T) >library(car) >transform_Y<-box.cox(Y,lamb) What do you conclude from this plot? Read more in Sokal and Rohlf 2000 page 417

The arcsine transformation Also known as the angular transformation It is especially appropriate to percentages

The arcsine transformation Transformed data It is appropriate only for data expressed as proportions Proportion original data

Since the transformations discussed are NON-LINEAR, confidence limits computed in the transformed scale and changed back to the original scale would be asymmetrical

Evaluating Ecological Responses to Hydrologic Changes in a Payment-for-environmental-services Program on Florida Ranchlands Patrick Bohlen, Elizabeth Boughton, John Fauth, David Jenkins, Pedro Quintana-Ascencio, Sanjay Shukla and Hilary Swain G08K10487

Palaez Ranch Wetland Water Retention

Call: glm(formula = mosqct ~ depth + depth^2, data = pointdata) Deviance Residuals: Min 1Q Median 3Q Max -28.1 -27.4 -25.5 -19.5 6388.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.40214 42.98439 0.451 0.652 depth 1.11973 4.07803 0.275 0.784 depth^2 -0.03597 0.07907 -0.455 0.649 (Dispersion parameter for gaussian family taken to be 117582.7) Null deviance: 77787101 on 663 degrees of freedom Residual deviance: 77722147 on 661 degrees of freedom AIC: 9641.5 Number of Fisher Scoring iterations: 2

Call: zeroinfl(formula = mosqct ~ depth + depth^2, data = pointdata, dist = "poisson", EM = TRUE) Pearson residuals: Min 1Q Median 3Q Max -6.765e-01 -5.630e-01 -5.316e-01 -4.768e-01 9.393e+05 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8550639 0.0498470 37.22 <2e-16 *** depth 0.4364981 0.0064139 68.06 <2e-16 *** depth^2 -0.0139134 0.0001914 -72.68 <2e-16 *** Zero-inflation model coefficients (binomial with logit link): (Intercept) 0.6400910 0.3274469 1.955 0.05061 . depth 0.0846763 0.0371673 2.278 0.02271 * depth^2 -0.0027798 0.0009356 -2.971 0.00297 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 1 Log-likelihood: -4.728e+04 on 6 Df >