Correlation and Covariance

Name: Correlation and Covariance
Uploaded: 2017-07-13T07:52:20+00:00
Duration: PTM10S40
Channel: Susan Warner
Description: Correlation and Covariance

Correlation and Covariance

Overview Continuous Outcome, Dependent Variable (Y-Axis) Height
Histogram Predictor Variable (X-Axis) Scatter Continuous Categorical Boxplot

Independent Variables
Y Y Height X1 X2 X3 X4 Independent Variables X’s

Correlation Matrix for Continuous Variables
PerformanceAnalytics package chart.Correlation(num2)

Calculating ‘Error’ A deviation is the difference between the mean and an actual data point. Deviations can be calculated by taking each score and subtracting the mean from it: Slide 5

Calculating ‘Error’

Use the Total Error? Deviation
Take the error between the mean and the data and add them???? Score Mean Deviation 1 2.6 -1.6 2 -0.6 3 0.4 4 1.4 Total = Slide 7

Sum of Squared Errors Deviation
We could add the deviations to find out the total error. Deviations cancel out because some are positive and others negative. Therefore, we square each deviation. If we add these squared deviations we get the sum of squared errors (SS). Why not just absolute value Slide 8

Sum of Squared Errors Score Mean Deviation Squared Deviation 1 2.6
-1.6 2.56 2 -0.6 0.36 3 0.4 0.16 4 1.4 1.96 Total 5.20 Slide 9

Standard Deviation The variance is measured in units squared.
This isn’t a very meaningful metric so we take the square root value. This is the standard deviation (s). Slide 10

Variance The sum of squares is a good measure of overall variability, but is dependent on the number of scores. We calculate the average variability by dividing by the number of scores (n). This value is called the variance (s2). Slide 11

Same Mean, Different Standard Deviation
Slide 12

Temperature Variation Across Cities
Austin Las Vegas San Diego San Francisco Tampa Bay Count of Hours

Covariance Y X Persons 2,3, and 5 look to have similar magnitudes from their means

Covariance Calculate the error [deviation] between the mean and each subject’s score for the first variable (x). Calculate the error [deviation] between the mean and their score for the second variable (y). Multiply these error values. Add these values and you get the cross product deviations. The covariance is the average cross-product deviations:

Do they VARY the same way relative to their own means?
Covariance Do they VARY the same way relative to their own means? Age Income Education 7 4 3 1 8 6 5 2 9 2.47

Limitations of Covariance
It depends upon the units of measurement. E.g. the covariance of two variables measured in miles might be 4.25, but if the same scores are converted to kilometres, the covariance is 11. One solution: standardize it! normalize the data Divide by the standard deviations of both variables. The standardized version of covariance is known as the correlation coefficient. It is relatively unaffected by units of measurement.

The Correlation Coefficient

Things to Know about the Correlation
It varies between -1 and +1 0 = no relationship It is an effect size ±.1 = small effect ±.3 = medium effect ±.5 = large effect Coefficient of determination, r2 By squaring the value of r you get the proportion of variance in one variable shared by the other.

Correlation Covariance is High: r ~1 Covariance is Low: r ~0

Correlation

Correlation Need inter-item/variable correlations > .30

Framework Source: Hadley Wickham
Data Structures numeric vector character vector Dataframe: d <- c(1,2,3,4) e <- c("red", "white", "red", NA) f <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(d,e,f) names(mydata) <- c("ID","Color","Passed") List: w <- list(name="Fred", age=5.3) Numeric Vector: a <- c(1,2,5.3,6,-2,4) Character Vector: b <- c("one","two","three") Matrix: y<-matrix(1:20, nrow=5,ncol=4) Framework Source: Hadley Wickham

Correlation Matrix

Correlation and Covariance

Revisiting the Height Dataset

Galton: Height Dataset
cor() function does not handle Factors cor(heights) Excel correl() does not either Error in cor(heights) : 'x' must be numeric Initial workaround: Create data.frame without the Factors h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids) Later we will RECODE the variable into a 0, 1

Histogram of Correlation Coefficients
-1 +1

Correlations Matrix: Both Types
Zoom in on Gender library(car) scatterplotMatrix(heights)

Correlation Matrix for Continuous Variables
PerformanceAnalytics package chart.Correlation(num2)

Categorical: Revisit Box Plot
Correlation will depend on spread of distributions Note there is an equation here: Y = mx b Factors/Categorical work with Boxplots; however some functions are not set up to handle Factors

Manual Calculation: Note Stdev is Lower
Note that with 0 and 1 the Delta from Mean are low; and Standard Deviation is Lower. Whereas the Continuous Variable has a lot of variation, spread.

Categorical: Recode! Gender recoded as a 0= Female Formula now works!
@correl does not work with Factor Variables

Correlation: Continuous & Discrete
More examples of cor.test()

Overview Too many variables is difficult to handle
Computing power to handle all that data. Principal components analysis seeks to identify and quantify those components by analyzing the original, observable variables In many cases, we can wind up working with just a few— on the order of, say, three to ten—principal components or factors instead of tens or hundreds of conventionally measured variables.

Principal Components Analysis
Which component explains the most variance? observable variables vectors Z1 X1 Z2 X2 Z3 X3 Image Source:

Principal Components Analysis

Principal Components

Correlation  Regression

Correlation and Covariance

Similar presentations

Presentation on theme: "Correlation and Covariance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Correlation and Covariance

Similar presentations

Presentation on theme: "Correlation and Covariance"— Presentation transcript:

Similar presentations

About project

Feedback