Part 6: Correlation 6-1/49 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics
Part 6: Correlation 6-2/49 Statistics and Data Analysis Part 6 – Correlation
Part 6: Correlation 6-3/49 Correlated Variables
Part 6: Correlation 6-4/49 Correlated Variables
Part 6: Correlation 6-5/49 Correlation Agenda Two ‘Related’ Random Variables Dependence and Independence Conditional Distributions We’re interested in correlation We have to look at covariance first Regression is correlation Correlated Asset Returns
Part 6: Correlation 6-6/49 Probabilities for Two Events, A,B Marginal Probability = The probability of an event not considering any other events. P(A) Joint Probability = The probability that two events happen at the same time. P(A,B) Conditional Probability = The probability that one event happens given that another event has happened. P(A|B)
Part 6: Correlation 6-7/49 Probabilities: Inherited Color Blindness* Inherited color blindness has different incidence rates in men and women. Women usually carry the defective gene and men usually inherit it. Experiment: pick an individual at random from the population. CB = has inherited color blindness MALE = gender, Not-Male = FEMALE Marginal: P(CB) = 2.75% P(MALE)= 50.0% Joint: P(CB and MALE) = 2.5% P(CB and FEMALE) = 0.25% Conditional:P(CB|MALE) = 5.0% (1 in 20 men) P(CB|FEMALE) = 0.5% (1 in 200 women) * There are several types of color blindness and large variation in the incidence across different demographic groups. These are broad averages that are roughly in the neighborhood of the true incidence for particular groups.
Part 6: Correlation 6-8/49 Dependent Events Color Blind GenderNoYesTotal Male Female Total P(Color blind, Male) =.0250 P(Male) =.5000 P(Color blind) =.0275 P(Color blind) x P(Male) =.0275 x.500 = is not equal to.025 Gender and color blindness are not independent. Random variables X and Y are dependent if P XY (X,Y) ≠ P X (X)P Y (Y).
Part 6: Correlation 6-9/49 Independent Events Ace HeartYes=1No=0Total Yes=11/5212/5213/52 =1/4 No=03/5236/5239/52 Total4/52 =1/13 48/5252/52 P(Ace,Heart) = 1/52 P(Ace) = 1/13 P(Heart) = 1/4 P(Ace) x P(Heart) = (1/13)(1/4) = 1/52. Ace and Heart are independent Random variables X and Y are independent if P XY (X,Y) = P X (X)P Y (Y). “The joint probability equals the product of the marginal probabilities.”
Part 6: Correlation 6-10/49 Dependent Random Variables Random variables are dependent if the occurrence of one affects the probability distribution of the other. If P(Y|X) changes when X changes, then the variables are dependent. If P(Y|X) does not change when X changes, then the variables are independent.
Part 6: Correlation 6-11/49 Conditional Probability Prob(A | B) = P(A,B) / P(B) Prob(Color Blind | Male) = Prob(Color Blind,Male) P(Male) =.025 /.50 =.05 Color Blind GenderNoYesTotal Male Female Total What is P(Male | Color Blind)? A Theorem: For two random variables, P(X,Y) = P(X|Y) P(Y) P(Color blind, Male) = P(Color blind|Male)P(Male) =.05 x.5 =.025
Part 6: Correlation 6-12/49 Conditional Distributions Marginal Distribution of Color Blindness Color Blind Not Color Blind Distribution Among Men (Conditioned on Male) Color Blind|Male Not Color Blind|Male Distribution Among Women (Conditioned on Female) Color Blind|Female Not Color Blind|Female The distributions for the two genders are different. The variables are dependent.
Part 6: Correlation 6-13/49 Independent Random Variables Ace HeartYes=1No=0Total Yes=11/5212/5213/52 No=03/5236/5239/52 Total4/5248/5252/52 P(Ace|Heart) = 1/13 P(Ace|Not-Heart) = 3/39 = 1/13 P(Ace) = 4/52 = 1/13 P(Ace) does not depend on whether the card is a heart or not. P(Heart|Ace) = 1/4 P(Heart|Not-Ace) = 12/48 = 1/4 P(Heart) = 13/52 = 1/4 P(Heart) does not depend on whether the card is an ace or not. One card is drawn randomly from a deck of 52 cards A Theorem: For two independent random variables, P(X,Y) = P(X) P(Y) P(Ace, Heart) = P(Ace)P(Heart) = 1/13 x 1/4 = 1/52
Part 6: Correlation 6-14/49 Covariation and Expected Value Pick 10,325 people at random from the population. Predict how many will be color blind: 10,325 x.0275 = 284 Pick 10,325 MEN at random from the population. Predict how many will be color blind: 10,325 x.05 = 516 Pick 10,325 WOMEN at random from the population. Predict how many will be color blind: 10,325 x.005 = 52 The expected number of color blind people, given gender, depends on gender. Color Blindness covaries with Gender
Part 6: Correlation 6-15/49 Positive Covariation: The distribution of one variable depends on another variable. Distribution of fuel bills changes (moves upward) as the number of rooms changes (increases). The per capita number of cars varies (positively) with per capita income. The relationship varies by country as well.
Part 6: Correlation 6-16/49 Application – Legal Case Mix: Two kinds of cases show up each month, real estate (R) and financial (F) (sometimes together, usually separately). Real Estate Financial0123 P(F) P(R) Marginal Distribution for Real Estate Cases Marginal Distribution for Financial Cases Joint Distribution R = Real estate cases F = Financial cases Two Related Random Variables* * Adapted from example 4.16, p. 159 in your text
Part 6: Correlation 6-17/49 Legal Services Case Mix: Joint Probabilities Joint Discrete Distribution R = Real estate cases F = Financial cases Real Estate (R) Financial (F)0123P(F) P(R) Joint Distribution Prob(F=f and R=r) Marginal probabilities are obtained by summing across or down.
Part 6: Correlation 6-18/49 Legal Services Case Mix: Conditional Probabilities Real Estate (R) Financial (F) 0123P(f) 0.02/.20 =.10.05/.20 =.25.05/.20 =.25.08/.20 = /.33 =.10.05/.33 =.15.08/.33 =.24.17/.33 = /.47 =.09.06/.47 =.13.09/.47 =.19.28/.47 = Conditional Distributions Conditional probabilities are Prob(R=r and F=f)/P(F=f) Probabilities for R given the value of F Read across the rows. Joint Discrete Distribution R = Real estate cases F = Financial cases
Part 6: Correlation 6-19/49 Conditional Distributions The probability distribution of Real estate cases (R) given Financial cases (F) varies with the number of Financial cases. The probability that (R=3)|F goes up as F increases from 0 to 2. This means that the variables are dependent.
Part 6: Correlation 6-20/49 Covariation in Legal Services Real Estate Cases Financial= Financial= Financial= These are the conditional distributions P(R|F) How many real estate cases should the office expect if it knows (or predicts) the number of financial cases? E[R if F=0] = 0(.10) + 1(.25) + 2(.25) + 3(.40) = 1.95 (less than 2) E[R if F=1] = 0(.10) + 1(.15) + 2(.24) + 3(.51) = 2.16 (more than 2) E[R if F=2] = 0(.09) + 1(.13) + 2(.19) + 3(.59) = 2.28 (more than 2) This is how R and F covary.
Part 6: Correlation 6-21/49 Covariation and Regression Financial Cases Expected Number of Real Estate Cases Given Number of Financial Cases This is the “regression of R on F”
Part 6: Correlation 6-22/49 (Linear) Regression of Bills on Rooms
Part 6: Correlation 6-23/49 Measuring How Variables Move Together: Covariance Covariance can be positive or negative The measure will be positive if it is likely that Y is above its mean when X is above its mean. It is usually denoted σ XY.
Part 6: Correlation 6-24/49 Legal Services Case Mix Covariance Real Estate Cases Financial Cases 0123P(F) P(R) The two means are μ R = 0(.09)+1(.16)+2(.22)+3(.53) = 2.19 μ F = 0(.20)+1(.33)+2(.47) = 1.27 Compute the Covariance Σ F Σ R P(F,R)(F-1.27)(R-2.19)= (0-1.27)(0-2.19).02= (0-1.27)(1-2.19).05= (0-1.27)(2-2.19).05= (0-1.27)(3-2.19).08= (1-1.27)(0-2.19).03= (1-1.27)(1-2.19).05= (1-1.27)(2-2.19).08= (1-1.27)(3-2.19).17= (2-1.27)(0-2.19).04= (2-1.27)(1-2.19).06= (2-1.27)(2-2.19).09= (2-1.27)(3-2.19).28= Sum =
Part 6: Correlation 6-25/49 A Shortcut for Covariance
Part 6: Correlation 6-26/49 Computing the Covariance Using the Shortcut Compute the Covariance Σ F Σ R [(F-1.27)(R-2.19) * P(F,R)] = (0-1.27)(0-2.19).02= (0-1.27)(1-2.19).05= (0-1.27)(2-2.19).05= (0-1.27)(3-2.19).08= (1-1.27)(0-2.19).03= (1-1.27)(1-2.19).05= (1-1.27)(2-2.19).08= (1-1.27)(3-2.19).17= (2-1.27)(0-2.19).04= (2-1.27)(1-2.19).06= (2-1.27)(2-2.19).09= (2-1.27)(3-2.19).28= Sum = Compute the Covariance [Σ F Σ R FR * P(F,R)] – [μ F μ R ] (0)(0).02= 0 (0)(1).05= 0 (0)(2).05= 0 (0)(3).08= 0 (1)(0).03= 0 (1)(1).05=.05 (1)(2).08=.16 (1)(3).17=.51 (2)(0).04= 0 (2)(1).06=.12 (2)(2).09=.36 (2)(3).28= 1.68 Sum = – (1.27)(2.19) =
Part 6: Correlation 6-27/49 Independent Random Variables Have Zero Covariance A=Ace H=HeartYes=1No=0Total Yes=11/5212/5213/52 No=03/5236/5239/52 Total4/5248/5252/52 E[H] = 1(13/52)+0(49/52) = 1/4 E[A] = 1(4/52)+0(48/52) = 1/13 Covariance = Σ H Σ A P(H,A) (H – H )(A – A ) 1/52 (1 – 1/4)(1 – 1/13) = +36/52 2 3/52 (0 – 1/4)(1 – 1/13) = – 36/ /52 (1 – 1/4)(0 – 1/13) = – 36/ /52 (0 – 1/4)(0 – 1/13) = +36/52 2 SUM = 0 !! One card drawn randomly from a deck of 52 cards
Part 6: Correlation 6-28/49 Covariance and Units of Measurement Covariance takes the units of (units of X) times (units of Y) Consider Cov($Price of X,$Price of Y). Now, measure both prices in GBP, roughly $1.60 per £. The prices are divided by 1.60, and the covariance is divided by This is an unattractive result.
Part 6: Correlation 6-29/49 Covariance and Scaling Real Estate Lawyers Financial Lawyers 0 (was 0) 2 (was 1) 4 (was 2) 6 (was 3) P(F) 0 (was 0) (was 1) (was 2) P(R) μ NR = 0(.09)+1(.16)+2(.22)+3(.53 ) = 4.38 μ NF = 0(.20)+1(.33)+2(.47) = 3.81 We computed the covariance Cov(R,F) = What does the covariance mean? Suppose each real estate case requires 2 lawyers and each financial case requires 3 lawyers. Then the number of lawyers is N R = 2R and N F = 3F. The covariance of N R and N F will be 3(2)(.0987) = But, the “relationship” is the same. We just changed the units of measurement.
Part 6: Correlation 6-30/49 Correlation is Units Free
Part 6: Correlation 6-31/49 Correlation Real Estate Financial0123P(F) P(R) μ R = 2.19 μ F = 1.27 Var(F) = 0 2 (.20)+1 2 (.33)+2 2 (.47) = Standard deviation = Var(R) = 0 2 (.09)+1 2 (.16)+2 2 (.22) +3 2 (.53) – = Standard deviation = Covariance =
Part 6: Correlation 6-32/49 Uncorrelated Variables Independence implies zero correlation. If the variables are independent, then the numerator of the correlation coefficient is zero.
Part 6: Correlation 6-33/49 Sums of Two Random Variables Example 1: Total number of cases = F+R Example 2: Personnel needed = 3F+2R Find for Sums Expected Value Variance and Standard Deviation Application from Finance: Portfolio
Part 6: Correlation 6-34/49 Math Facts 1 – Mean of a Sum Mean of a sum. The Mean of X+Y = E[X+Y] = E[X]+E[Y] Mean of a weighted sum Mean of aX + bY = E[aX] + E[bY] = aE[X] + bE[Y]
Part 6: Correlation 6-35/49 Mean of a Sum Real Estate Financial0123P(F) P(R) μ R = 2.19 μ F = 1.27 What is the mean (expected) number of cases each month, R+F? E[R + F] = E[R] + E[F] = = 3.46
Part 6: Correlation 6-36/49 Mean of a Weighted Sum μ R = 2.19 μ F = 1.27 Suppose each Real Estate case requires 2 lawyers and each Financial case requires 3 lawyers. Then N R = 2R and N F = 3F. If N R = 2R and N F = 3F, then the mean number of lawyers is the mean of 2R+3F. E[2R + 3F] = 2E[R] + 3E[F] = 2(2.19) + 3(1.27) = 8.19 lawyers required.
Part 6: Correlation 6-37/49 Math Facts 2 – Variance of a Sum Variance of a Sum Var[x+y] = Var[x] + Var[y] +2Cov(x,y) Variance of a sum equals the sum of the variances only if the variables are uncorrelated. Standard deviation of a sum The standard deviation of x+y is not equal to the sum of the standard deviations.
Part 6: Correlation 6-38/49 Variance of a Sum μ R = 2.19, σ R 2 = μ F = 1.27, σ F 2 = σ RF = What is the variance of the total number of cases that occur each month? This is the variance of F+R = ( (.0987)) = The standard deviation is
Part 6: Correlation 6-39/49 Math Facts 3 – Variance of a Weighted Sum Var[ax+by] = Var[ax] + Var[by] +2Cov(ax,by) = a 2 Var[x] + b 2 Var[y] + 2ab Cov(x,y). Also, Cov(x,y) is the numerator in ρ xy, so Cov(x,y) = ρ xy σ x σ y.
Part 6: Correlation 6-40/49 Variance of a Weighted Sum What is the variance of the total number of lawyers needed each month? What is the standard deviation? This is the variance of 2R+3F = 2 2 (1.0139) ( ) + 2(2)(3)(.12416)( )( )= The standard deviation is the square root, Suppose each real estate case requires 2 lawyers and each financial case requires 3 lawyers. Then N R = 2R and N F = 3F. μ R = 2.19, σ R 2 = μ F = 1.27, σ F 2 = σ RF = , RF =.14216
Part 6: Correlation 6-41/49 Correlated Variables: Returns on Two Stocks* * Averaged yearly return
Part 6: Correlation 6-42/49 The two returns are positively correlated.
Part 6: Correlation 6-43/49
Part 6: Correlation 6-44/49 Application - Portfolio You have $1000 to allocate between assets A and B. The yearly returns on the two assets are random variables r A and r B. The means of the two returns are E[r A ] = μ A and E[r B ] = μ B The standard deviations (risks) of the returns are σ A and σ B. The correlation of the two returns is ρ AB
Part 6: Correlation 6-45/49 Portfolio You have $1000 to allocate to A and B. You will allocate proportions w of your $1000 to A and (1-w) to B.
Part 6: Correlation 6-46/49 Return and Risk Your expected return on each dollar is E[wr A + (1-w)r B ] = wμ A + (1-w)μ B The variance your return on each dollar is Var[wr A + (1-w)r B ] = w 2 σ A 2 + (1-w) 2 σ B 2 + 2w(1-w)ρ AB σ A σ B The standard deviation is the square root.
Part 6: Correlation 6-47/49 Risk and Return: Example Suppose you know μ A, μ B, ρ AB, σ A, and σ B (You have watched these stocks for over 6 years.) The mean and standard deviation are then just functions of w. I will then compute the mean and standard deviation for different values of w. For our Microsoft and Walmart example, μ A = , μ B, = σ A = , σ B,= , ρ AB = E[return] = w( ) + (1-w)( ) = w SD[return] = sqr[w 2 ( )+ (1-w) 2 ( ) + 2w(1-w)(.249)(.114)(.086)] = sqr[.013w (1-w) w(1-w)]
Part 6: Correlation 6-48/49 For different values of w, risk = sqr[.013w (1-w) w(1-w)] is on the horizontal axis return = w is on the vertical axis. W=1 W=0
Part 6: Correlation 6-49/49 Summary Random Variables – Dependent and Independent Conditional probabilities change with the values of dependent variables. Covariation and the covariance as a measure. (The regression) Correlation as a units free measure of covariation Math results Mean of a weighted sum Variance of a weighted sum Application to a portfolio problem.