Transformations Getting normal or using the linear model.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Copyright © 2010 Pearson Education, Inc. Slide A least squares regression line was fitted to the weights (in pounds) versus age (in months) of a.
 Objective: To determine whether or not a curved relationship can be salvaged and re-expressed into a linear relationship. If so, complete the re-expression.
Chapter 10: Re-expressing data –Get it straight!
Chapter 10 Re-Expressing data: Get it Straight
Get it Straight!! Chapter 10
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 10 Re-expressing Data: Get it Straight!
Chapter 10 Re-expressing the data
Re-expressing data CH. 10.
Re-expressing the Data: Get It Straight!
Multivariate Methods Pattern Recognition and Hypothesis Testing.
REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Chapter 6 Continuous Random Variables and Probability Distributions
1 Re-expressing Data  Chapter 6 – Normal Model –What if data do not follow a Normal model?  Chapters 8 & 9 – Linear Model –What if a relationship between.
REGRESSION Predict future scores on Y based on measured scores on X Predictions are based on a correlation from a sample where both X and Y were measured.
8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the.
Transforming the data Modified from: Gotelli and Allison Chapter 8; Sokal and Rohlf 2000 Chapter 13.
Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data.
Chapter 4 Continuous Random Variables and Probability Distributions
9 - 1 Intrinsically Linear Regression Chapter Introduction In Chapter 7 we discussed some deviations from the assumptions of the regression model.
Residuals and Residual Plots Most likely a linear regression will not fit the data perfectly. The residual (e) for each data point is the ________________________.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
Chapter 4: More on Two-Variable (Bivariate) Data.
© 1998, Geoff Kuenning Linear Regression Models What is a (good) model? Estimating model parameters Allocating variation Confidence intervals for regressions.
Ch5 Continuous Random Variables
Scatter plots and Regression Algebra II. Linear Regression  Linear regression is the relationship between two variables when the equation is linear.
Applied Quantitative Analysis and Practices LECTURE#11 By Dr. Osman Sadiq Paracha.
M25- Growth & Transformations 1  Department of ISM, University of Alabama, Lesson Objectives: Recognize exponential growth or decay. Use log(Y.
Skewness & Kurtosis: Reference
Chapter 10 Re-expressing the data
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Chapter 10 Re-expressing Data: Get It Straight!. Slide Straight to the Point We cannot use a linear model unless the relationship between the two.
Lecture 6 Re-expressing Data: It’s Easier Than You Think.
Copyright © 2010 Pearson Education, Inc. Slide A least squares regression line was fitted to the weights (in pounds) versus age (in months) of a.
Bivariate Data Analysis Bivariate Data analysis 4.
Analisis Non-Parametrik Antonius NW Pratama MK Metodologi Penelitian Bagian Farmasi Klinik dan Komunitas Fakultas Farmasi Universitas Jember.
If the scatter is curved, we can straighten it Then use a linear model Types of transformations for x, y, or both: 1.Square 2.Square root 3.Log 4.Negative.
Chapter 5 Lesson 5.4 Summarizing Bivariate Data 5.4: Nonlinear Relationships and Transformations.
Re-Expressing Data. Scatter Plot of: Weight of Vehicle vs. Fuel Efficiency Residual Plot of: Weight of Vehicle vs. Fuel Efficiency.
Variance Stabilizing Transformations. Variance is Related to Mean Usual Assumption in ANOVA and Regression is that the variance of each observation is.
More on data transformations No recipes, but some advice.
Copyright © 2010 Pearson Education, Inc. Chapter 10 Re-expressing Data: Get it Straight!
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 10 Re-expressing Data: Get it Straight!
Chapter 10 Notes AP Statistics. Re-expressing Data We cannot use a linear model unless the relationship between the two variables is linear. If the relationship.
Selecting Input Probability Distributions. 2 Introduction Part of modeling—what input probability distributions to use as input to simulation for: –Interarrival.
Statistics 10 Re-Expressing Data Get it Straight.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 10 Re-expressing Data: Get it Straight!
Statistical Hydrology 1 Dr. Muhammad Ajmal Lecturer, Agri. Engg. Dept. UET Peshawar DATA TRANSFORMATION.
Transforming the data Modified from:
Let’s Get It Straight! Re-expressing Data Curvilinear Regression
distance prediction observed y value predicted value zero
Chapter 10: Re-Expression of Curved Relationships
MATH 2311 Section 5.5.
Re-expressing the Data: Get It Straight!
ECONOMETRICS DR. DEEPTI.
Bell Ringer Make a scatterplot for the following data.
Chapter 10 Re-Expressing data: Get it Straight
Re-expressing Data: Get it Straight!
Re-expressing the Data: Get It Straight!
Ch. 12 More about regression
Re-expressing the Data: Get It Straight!
So how do we know what type of re-expression to use?
What is Regression Analysis?
MATH 2311 Section 5.5.
Lecture 6 Re-expressing Data: It’s Easier Than You Think
Regression Assumptions
Regression Assumptions
Re-expressing Data: Get it Straight!
Presentation transcript:

Transformations Getting normal or using the linear model

Two Reasons to Transform Variables do not fit a normal distribution and parametric tests are desired A relationship between two variables is non-linear but transformation would allow the use of linear regression

Non-Normal Data Reasons real data can fail to follow a normal distribution: –Errors in measurement are multiplicative rather than additive, e.g. ± 2% rather than ± 2mm –Constraints on the dimensions of an artifact feature are not symmetrical, e.g. point length must exceed haft length but can be as long as the material allows

Non-Normal Data 2 –Measurements are products rather than sums of other measurements, e.g. area, volume –Counts follow binomial, poisson, or negative binomial distributions which are often asymmetrical unless sample sizes are large

Solutions Use non-parametric methods that do not depend on the normality of the data (increasingly easy to do) Use data transformations that shift the distribution to one that is normal

Transformation The goal is to change the spacing of the data to compress a long tail and draw out a flat tail The transformation must preserve the order of the original data – we only change the spacing between data points

Transformation Right skewed data with many zeros cannot be transformed effectively since nothing can stretch out observations that have the same value – e.g. artifact counts by site, grid square are often poisson distributed with many zeros

An Example Using the DartPoints data set, we saw that Length was asymmetrical Plot the kernel density of Length with and without a log scale to see the difference To transform Length we would use –logLength <- log(DartPoints$Length)

plot(density(DartPoints$Length), main="Dart Point Length", xlab="Normal scale") plot(density(DartPoints$Length), main="Dart Point Length", xlab="Log scale", log="x")

Common Transformations Tail to the right –Natural or common (base 10) logarithm – no zero values –Square root, cube root, etc – zeros ok –Inverse, -1/x, -1/x 2, etc – no zero values Tail to the left –Exponential e x,10 x (low values) –Square, cube, etc

Other Transformations arctangent (inverse tangent) to handle values between 0 and 1 used for population studies of non-metric traits

Transforming to Linear By transforming variables before using linear regression we can fit nonlinear equations In some cases we can express the fitted equation in terms of the original untransformed variables

Polynomial Y = a + b1x + b2x 2 + b3x 3 + b4x 4... Create polynomial values or use the function poly() within lm() Begin with linear and then work up to quadratic, cubic, and so on until the new terms are not significant Eg. lm(y~x+I(x^2)+I(x^3))

Power Function Log-log transformation Use log() to transform dependent and independent variables Compute linear regression –log(y) = a + b * log(x) –y = Ax b (where A= exp(a)) If b = 1, same as the linear model x, y > 0

Exponential function Semi-log transformation Use log() to transform dependent variable, y > 0 Compute linear regression –log(y) = a + b * x –y = Ae bx (where A= exp(a)) Fits data with asymptotes

Inverse Function Reciprocal transformation – 1/x where x ≠ 0 Used for distance models – marriage, trade, social interaction declines with distance Fits data with asymptotes

Other Functions Logarithmic – no zeros in x –y = a + b * log(x) Square Root – no negative values in x –y = a + b * sqrt(x)

Examples Human cranial capacity over the last 1.8 million years Number of Identified Specimens (NISP) and Minimum Number of Individuals (MNI) at Chucalissa (Middle Misssissippian site)

# BrainsCC.RData # Explore logs with scatterplot RegModel.1 <- lm(BrainCC~AgeKa, data=BrainsCC) # Rcmdr summary(RegModel.1) # Rcmdr BrainsCC$logAge <- with(BrainsCC, log(AgeKa)) # Rcmdr BrainsCC$logBrain <- with(BrainsCC, log(BrainCC)) # Rcmdr RegModel.2 <- lm(logBrain~logAge, data=BrainsCC) # Rcmdr summary(RegModel.2) # Rcmdr RegModel.3 <- lm(BrainCC~logAge, data=BrainsCC) # Rcmdr summary(RegModel.3) # Rcmdr plot(BrainCC~AgeKa, data=BrainsCC, pch="+") abline(RegModel.1, lty=1, lwd=2, col="black") x <- seq(0, 1800, 10) logx <- log(x) lines(x, exp(predict(RegModel.2, data.frame(logAge=logx))), lty=1, lwd=2, col="red") lines(x, predict(RegModel.3, data.frame(logAge=logx)), lty=1, lwd=2, col="blue") legend("topright", c("Linear", "Power", "Logarithmic"), lty=1, lwd=2, col=c("black", "red", "blue"))

LinearModel.4 <- lm(BrainCC ~ AgeKa + I(AgeKa^2), data=BrainsCC) summary(LinearModel.4) LinearModel.5 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3), data=BrainsCC) summary(LinearModel.5) LinearModel.6 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3) + I(AgeKa^4), data=BrainsCC) summary(LinearModel.6) plot(BrainCC~AgeKa, data=BrainsCC, pch="+") abline(RegModel.1, lty=1, lwd=2, col="black") x <- seq(0, 1800, 10) lines(x, predict(LinearModel.4, data.frame(AgeKa=x)), lty=1, lwd=2, col="red") lines(x, predict(LinearModel.5, data.frame(AgeKa=x)), lty=1, lwd=2, col="blue") lines(x, predict(LinearModel.6, data.frame(AgeKa=x)), lty=1, lwd=2, col="green") legend("topright", c("Linear", "Quadratic", "Cubic", "Quartic"), lty=1, lwd=2, col=c("black", "red", "blue", "green"))

load("C:/Users/DCarlson/Documents/anth642/R/Data/Chucalissa.rda") #Rcmdr plot(mni~nisp, data=Chucalissa) RegModel.1 <- lm(mni~nisp, data=Chucalissa) #Rcmdr summary(RegModel.1) #Rcmdr abline(RegModel.1) plot(mni~nisp, data=Chucalissa, log="xy") # Plot log-log transform plot(mni~nisp, data=Chucalissa, log="y") # Plot semi-log transform Chucalissa$logMNI <- log(Chucalissa$mni) # Create logged variables Chucalissa$logNISP <- log(Chucalissa$nisp) plot(logMNI~logNISP, data=Chucalissa) RegModel.2 <- lm(logMNI~logNISP, data=Chucalissa) #Rcmdr summary(RegModel.2) #Rcmdr abline(RegModel.2) plot(mni~nisp, data=Chucalissa) # plot log-log equation on original data a2 <- exp(RegModel.2$coefficients[[1]]) # Convert a to exp(a) b2 <- RegModel.2$coefficients[[2]] a1 <- RegModel.1$coefficients[[1]] b1 <- RegModel.1$coefficients[[2]] curve(a2*x^b2, 0, 3250, add=TRUE) abline(RegModel.1, lty=3) text(locator(), as.expression(substitute(MNI == a*NISP^b, list(a=round(a2, 4), b=round(b2, 4)))), pos=2) text(locator(), as.expression(substitute(MNI == a+b*NISP, list(a=round(a1, 4), b=round(b1, 4)))), pos=4)