Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation and Covariance

Similar presentations


Presentation on theme: "Correlation and Covariance"— Presentation transcript:

1 Correlation and Covariance

2 Overview Continuous Outcome, Dependent Variable (Y-Axis) Height
Histogram Predictor Variable (X-Axis) Scatter Continuous Categorical Boxplot

3 Independent Variables
Y Y Height X1 X2 X3 X4 Independent Variables X’s

4 Correlation Matrix for Continuous Variables
PerformanceAnalytics package chart.Correlation(num2)

5 Calculating ‘Error’ A deviation is the difference between the mean and an actual data point. Deviations can be calculated by taking each score and subtracting the mean from it: Slide 5

6 Calculating ‘Error’

7 Use the Total Error? Deviation
Take the error between the mean and the data and add them???? Score Mean Deviation 1 2.6 -1.6 2 -0.6 3 0.4 4 1.4 Total = Slide 7

8 Sum of Squared Errors Deviation
We could add the deviations to find out the total error. Deviations cancel out because some are positive and others negative. Therefore, we square each deviation. If we add these squared deviations we get the sum of squared errors (SS). Why not just absolute value Slide 8

9 Sum of Squared Errors Score Mean Deviation Squared Deviation 1 2.6
-1.6 2.56 2 -0.6 0.36 3 0.4 0.16 4 1.4 1.96 Total 5.20 Slide 9

10 Standard Deviation The variance is measured in units squared.
This isn’t a very meaningful metric so we take the square root value. This is the standard deviation (s). Slide 10

11 Variance The sum of squares is a good measure of overall variability, but is dependent on the number of scores. We calculate the average variability by dividing by the number of scores (n). This value is called the variance (s2). Slide 11

12 Same Mean, Different Standard Deviation
Slide 12

13 Temperature Variation Across Cities
Austin Las Vegas San Diego San Francisco Tampa Bay Count of Hours

14 Covariance Y X Persons 2,3, and 5 look to have similar magnitudes from their means

15 Covariance Calculate the error [deviation] between the mean and each subject’s score for the first variable (x). Calculate the error [deviation] between the mean and their score for the second variable (y). Multiply these error values. Add these values and you get the cross product deviations. The covariance is the average cross-product deviations:

16 Do they VARY the same way relative to their own means?
Covariance Do they VARY the same way relative to their own means? Age Income Education 7 4 3 1 8 6 5 2 9 2.47

17 Limitations of Covariance
It depends upon the units of measurement. E.g. the covariance of two variables measured in miles might be 4.25, but if the same scores are converted to kilometres, the covariance is 11. One solution: standardize it! normalize the data Divide by the standard deviations of both variables. The standardized version of covariance is known as the correlation coefficient. It is relatively unaffected by units of measurement.

18 The Correlation Coefficient

19 Things to Know about the Correlation
It varies between -1 and +1 0 = no relationship It is an effect size ±.1 = small effect ±.3 = medium effect ±.5 = large effect Coefficient of determination, r2 By squaring the value of r you get the proportion of variance in one variable shared by the other.

20 Correlation Covariance is High: r ~1 Covariance is Low: r ~0

21 Correlation

22 Correlation Need inter-item/variable correlations > .30

23 Framework Source: Hadley Wickham
Data Structures numeric vector character vector Dataframe: d <- c(1,2,3,4) e <- c("red", "white", "red", NA) f <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(d,e,f) names(mydata) <- c("ID","Color","Passed") List: w <- list(name="Fred", age=5.3) Numeric Vector: a <- c(1,2,5.3,6,-2,4) Character Vector: b <- c("one","two","three") Matrix: y<-matrix(1:20, nrow=5,ncol=4) Framework Source: Hadley Wickham

24 Correlation Matrix

25 Correlation and Covariance

26 Revisiting the Height Dataset

27 Galton: Height Dataset
cor() function does not handle Factors cor(heights) Excel correl() does not either Error in cor(heights) : 'x' must be numeric Initial workaround: Create data.frame without the Factors h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids) Later we will RECODE the variable into a 0, 1

28 Histogram of Correlation Coefficients
-1 +1

29 Correlations Matrix: Both Types
Zoom in on Gender library(car) scatterplotMatrix(heights)

30 Correlation Matrix for Continuous Variables
PerformanceAnalytics package chart.Correlation(num2)

31 Categorical: Revisit Box Plot
Correlation will depend on spread of distributions Note there is an equation here: Y = mx b Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors

32 Manual Calculation: Note Stdev is Lower
Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the Continuous Variable has a lot of variation, spread.

33 Categorical: Recode! Gender recoded as a 0= Female Formula now works!
@correl does not work with Factor Variables

34 Correlation: Continuous & Discrete
More examples of cor.test()

35 Overview Too many variables is difficult to handle
Computing power to handle all that data. Principal components analysis seeks to identify and quantify those components by analyzing the original, observable variables In many cases, we can wind up working with just a few— on the order of, say, three to ten—principal components or factors instead of tens or hundreds of conventionally measured variables.

36 Principal Components Analysis
Which component explains the most variance? observable variables vectors Z1 X1 Z2 X2 Z3 X3 Image Source:

37 Principal Components Analysis

38 Principal Components

39 Correlation  Regression


Download ppt "Correlation and Covariance"

Similar presentations


Ads by Google