Correlation and Covariance
Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)
Correlation Covariance is High: r ~1 Covariance is Low: r ~0
It varies between -1 and +1 0 = no relationship It is an effect size ±.1 = small effect ±.3 = medium effect ±.5 = large effect Coefficient of determination, r 2 By squaring the value of r you get the proportion of variance in one variable shared by the other. Things to Know about the Correlation
Variables Y X’s Height Independent Variables Dependent Variables Y X4 X3 X2X1
Little Correlation
Correlation is For Linear Relationships
Outliers Can Skew Correlation Values
Correlation and Regression Are Related
Covariance Y X Persons 2,3, and 5 look to have similar magnitudes from their means
Covariance Calculate the error [deviation] between the mean and each subject’s score for the first variable (x). Calculate the error [deviation] between the mean and their score for the second variable (y). Multiply these error values. Add these values and you get the cross product deviations. The covariance is the average cross-product deviations:
Covariance AgeIncomeEducation Do they VARY the same way relative to their own means? 2.47
It depends upon the units of measurement. E.g. the covariance of two variables measured in miles might be 4.25, but if the same scores are converted to kilometres, the covariance is 11. One solution: standardize it! normalize the data Divide by the standard deviations of both variables. The standardized version of covariance is known as the correlation coefficient. It is relatively unaffected by units of measurement. Limitations of Covariance
The Correlation Coefficient
Correlation Covariance is High: r ~1 Covariance is Low: r ~0
Need inter-item/variable correlations >.30
Character Vector: b <- c("one","two","three") numeric vector character vector Numeric Vector: a <- c(1,2,5.3,6,-2,4) Matrix: y<-matrix(1:20, nrow=5,ncol=4) Dataframe: d <- c(1,2,3,4) e <- c("red", "white", "red", NA) f <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(d,e,f) names(mydata) <- c("ID","Color","Passed") List: w <- list(name="Fred", age=5.3) Data Structures Framework Source: Hadley Wickham
Correlation Matrix
Correlation and Covariance
Revisiting the Height Dataset
Galton: Height Dataset cor(heights) Error in cor(heights) : 'x' must be numeric Initial workaround: Create data.frame without the Factors h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids) cor() function does not handle Factors Later we will RECODE the variable into a 0, 1 Excel correl() does not either
Histogram of Correlation Coefficients +1
Correlations Matrix: Both Types library(car) scatterplotMatrix(heights) Zoom in on Gender
Correlation Matrix for Continuous Variables chart.Correlation(num2) PerformanceAnalytics package
Categorical: Revisit Box Plot Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors Note there is an equation here: Y = mx b Correlation will depend on spread of distributions
Manual Calculation: Note Stdev is Lower Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the Continuous Variable has a lot of variation, spread.
Categorical: Recode! Gender recoded as a 0= Female 1 = does not work with Factor Variables Formula now works!
Correlation: Continuous & Discrete More examples of cor.test()
Correlation Regression
Continuous Categorical Continuous Categorical Histogram Scatter Bar Cross Table Boxplot Predictor Variable (X-Axis) Pie Mosaic Cross Table Linear Regression Logistic Regression Regression Model Parents Height Gender Frequency 0 1 Outcome, Dependent Variable (Y-Axis) Mean, Median, Standard Deviation Proportions Summary