Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes
Covariance x – x < 0 x (x,y) y – y > 0 y x and y axes
Covariance So what happens on balance? x Below average values of x are with above average values of y Above average values of x are also above average values of y So what happens on balance? y Below average values of x are also below average values of y Above average values of x are with below average values of y
Covariance x What happens on balance? Calculate the average of the squared deviations. y
Covariance x What happens on balance? Calculate the average of the squared deviations. y
Covariance Example x Sxy= 1.999 Wage y Aptitude
Correlation x rxy= 0.476 Wage y Aptitude
Perfect Correlation
Fit That Line ! y=2,500+1,800x y=10,000+1,000x y=13,000+750x
Fit That Line ! y=8,135 + 1,233x minimizes the squared errors
Word Problem Students in a small class were polled by a researcher attempting to establish a relationship between hours of study in a week preceding a test and the result of the test. If you get data on hours studied and exam results, which variable is the dependent variable? why?
Word Problem y=39.406 + 2.122x
Regression Statistics Word Problem Excel Regression Output (Data Analysis Add-In) Regression Statistics Multiple R 0.770 R Squared 0.594 Adj. R Squared 0.543 Standard Error 10.710 Obs. 10 ANOVA df SS MS F Significance Regression 1 1340.452 1341.452 11.686 0.009 Residual 8 917.648 114.706 Total 9 2258.100 Coeff. Std. Error t stat p value Lower 95% Upper 95% Intercept 39.401 12.153 3.242 0.012 11.375 67.426 hours 2.122 0.621 3.418 0.691 3.554
Word Problem Excel Regression Output (StatPad Add-In) Regression analysis to predict score from hours. The prediction equation is: Score = 39.401 2.122 hours 0.594 R squared 10.710 Standard error of estimate 10 Number of observations 11.686 F statistic 0.009 P value 95% Coeff LowerCI UpperCI StdErr t p Significant Constant 11.375 67.426 12.153 3.242 0.012 Yes (p<0.05) hours 2.122 0.691 3.554 0.621 3.418 Excel Regression Output (StatPad Add-In)
The Nine Lives of Goldfish Regression Statistics Multiple R 0.671 R Squared 0.450 Adj. R Squared 0.340 Standard Error 45.214 Obs. 7 ANOVA df SS MS F Significance Regression 1 8360.48 8360.048 4.089 0.099 Residual 5 10221.667 2044.333 Total 6 18581.714 Coeff. Std. Error t stat p value Lower 95% Upper 95% Intercept 91.500 22.607 4.047 0.010 33.387 149.613 filter -69.833 34.533 -2.022 -158.603 18.936
Predicting Job Performance Regression Statistics R Squared 0.107 Adj. R Squared Standard Error 1.955 Obs. 3525 ANOVA df SS MS F Significance Regression 3 1620.806 540.269 141.287 0.000 Residual 3521 13463.982 3.824 Total 3524 15084.788 Coeff. Std. Error t stat p value Lower 95% Upper 95% Intercept 4.865 0.171 28.423 4.529 5.200 Age -0.037 0.002 -20.263 -0.041 -0.034 Seniority 0.011 0.003 3.325 0.001 0.004 0.017 Cognitive -0.032 0.033 -0.983 0.326 -0.097 0.032 Simple Regression: Perform = 3.956 – 0.022 age
Predicting Job Performance Perform = 4.865 – 0.037 age + 0.011 seniority - 0.032 cognitive Age 35 36 Seniority 10 Cognitive 1 Predicted Performance 3.626 3.589 Net Difference -0.037 45 46 10 1 3.251 3.214 -0.037 Age 35 Seniority 20 21 Cognitive 1 Predicted Performance 3.731 3.742 Net Difference 0.011 Note importance of ceteris paribus (all else constant)
Predicting Job Performance Perform = 4.865 – 0.037 age + 0.011 seniority - 0.032 cognitive And holding seniority constant at 10 and cognitive constant at 1
Predicting Job Performance Perform = 4.865 – 0.037 age + 0.011 seniority - 0.032 cognitive And holding seniority constant at 20 and cognitive constant at -1 With linear models, other values don’t matter; just all else constant
Predicting Job Perf. With a Dummy Variable Regression Statistics R Squared 0.110 Adj. R Squared 0.109 Standard Error 1.953 Obs. 3525 ANOVA df SS MS F Significance Regression 34 1657.286 414.321 108.614 0.000 Residual 3520 13427.502 3.815 Total 3524 15084.788 Coeff. Std. Error t stat p value Lower 95% Upper 95% Intercept 4.820 0.172 28.096 4.484 5.156 Age -0.037 0.002 -20.231 -0.041 -0.034 Seniority 0.010 0.003 3.271 0.001 0.004 0.017 Cognitive -0.025 0.033 -0.756 0.450 -0.090 0.040 Structured int. 2.850 0.922 3.092 1.043 4.658 Structured Interview Dummy Variable: 1=yes, 0=no
Predicting Job Perf. With a Dummy Variable Perform = 4.820 – 0.037 age + 0.010 seniority - 0.025 cognitive + 2.850 structured interview Age 35 Seniority 10 Cognitive 1 Structured Interview Predicted Performance 3.600 6.450 Net Difference 2.850 45 5 2 1 3.155 6.005 2.850 Dummy variable turns “on” and “off” with all else constant.
Predicting Job Perf. With a Dummy Variable Perform = 4.865 – 0.037 age + 0.010 seniority - 0.025 cognitive + 2.850 structured interview And holding seniority constant at 10 and cognitive constant at 1
Predicting Job Perf. With a Dummy Variable Note new y-intercept Seniority=20, Cognitive=0
Multiple Dummy Variables Source | SS df MS Number of obs = 3525 ---------+------------------------------ F( 14, 3510) = 125.63 Model | 5035.58483 14 359.684631 Prob > F = 0.0000 Residual | 10049.2032 3510 2.86302087 R-squared = 0.3338 ---------+------------------------------ Adj R-squared = 0.3312 Total | 15084.7881 3524 4.28058685 Root MSE = 1.692 ------------------------------------------------------------------------------ perform | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | -.0301543 .0016933 -17.808 0.000 -.0334742 -.0268344 seniorty | .0016888 .002762 0.611 0.541 -.0037265 .007104 cognitve | .0119113 .0286362 0.416 0.677 -.0442339 .0680565 strucint | 3.665569 .7995184 4.585 0.000 2.098001 5.233137 job1 | 1.928286 .1277788 15.091 0.000 1.677758 2.178814 job2 | .426524 .1260009 3.385 0.001 .1794815 .6735664 job3 | .1407506 .1306411 1.077 0.281 -.1153896 .3968908 job4 | .2921016 .1347211 2.168 0.030 .0279621 .5562411 job5 | -1.069262 .1331017 -8.033 0.000 -1.330227 -.8082974 job6 | -1.179162 .1377497 -8.560 0.000 -1.449239 -.9090839 job7 | -1.304191 .1406734 -9.271 0.000 -1.580001 -1.028381 job8 | -.8530246 .1381293 -6.176 0.000 -1.123846 -.5822027 job9 | -.6652395 .1501504 -4.430 0.000 -.9596304 -.3708487 job10 | -1.012177 .1420816 -7.124 0.000 -1.290748 -.7336058 _cons | 5.021799 .1643372 30.558 0.000 4.699593 5.344005 Note: job1-job10 are dummy variables representing 10 different job classes (job11 is the omitted reference category)
Interaction Variables Source | SS df MS Number of obs = 3525 ---------+------------------------------ F( 6, 3518) = 121.08 Model | 2581.89927 6 430.316544 Prob > F = 0.0000 Residual | 12502.8888 3518 3.55397635 R-squared = 0.1712 ---------+------------------------------ Adj R-squared = 0.1697 Total | 15084.7881 3524 4.28058685 Root MSE = 1.8852 ------------------------------------------------------------------------------ perform | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | -.006 .0034204 -1.705 0.088 -.0125379 .0008743 seniorty | .011 .0030589 3.559 0.000 .0048879 .0168827 cognitve | -.005 .0318774 -0.167 0.867 -.0678283 .0571719 strucint | 2.129 .8937022 2.383 0.017 .3770909 3.881545 manual | -1.513 .2391962 -6.327 0.000 -1.982442 -1.044488 manl_age | -.042 .004011 -10.439 0.000 -.0497349 -.0340066 _cons | 6.009 .2354444 25.526 0.000 5.548275 6.471517 Note: manual is a dummy variable indicating a manual occupation; manl_age is age interacted with manual (i.e. manl_age = manual*age)
Interaction Variables Note different slopes, too. Seniority=20, Cognitive=0, StrucInt=0
Another Interaction Variable Example Source | SS df MS Number of obs = 15321 -------------+------------------------------ F( 5, 15315) = 800.50 Model | 804247599 5 160849520 Prob > F = 0.0000 Residual | 3.0773e+09 15315 200936.252 R-squared = 0.2072 -------------+------------------------------ Adj R-squared = 0.2069 Total | 3.8816e+09 15320 253367.252 Root MSE = 448.26 ------------------------------------------------------------------------------ earnwkly | Coef. -------------+---------------------------------------------------------------- married | 136.003 female | -169.837 exper | 2.946 parttime | -227.716 exp_pt | -1.896 _cons | 700.802 exper is potential labor market experience (age-educ-6) parttime is a dummy variable indicating a part-time worker exp_pt is exper interacted with perttime (i.e. exp_pt = exper*parttime)
Interaction Variables Married=1, Female=1
Adjusted R2 Source | SS df MS Number of obs = 3525 Model | 5035.58483 14 359.684631 Prob > F = 0.0000 Residual | 10049.2032 3510 2.86302087 R-squared = 0.3338 ---------+------------------------------ Adj R-squared = 0.3312 Total | 15084.7881 3524 4.28058685 Root MSE = 1.692 ------------------------------------------------------------------------------ perform | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age | -.0301543 .0016933 -17.808 0.000 -.0334742 -.0268344 seniorty | .0016888 .002762 0.611 0.541 -.0037265 .007104 cognitve | .0119113 .0286362 0.416 0.677 -.0442339 .0680565 strucint | 3.665569 .7995184 4.585 0.000 2.098001 5.233137 job1 | 1.928286 .1277788 15.091 0.000 1.677758 2.178814 job2 | .426524 .1260009 3.385 0.001 .1794815 .6735664 job3 | .1407506 .1306411 1.077 0.281 -.1153896 .3968908 job4 | .2921016 .1347211 2.168 0.030 .0279621 .5562411 job5 | -1.069262 .1331017 -8.033 0.000 -1.330227 -.8082974 job6 | -1.179162 .1377497 -8.560 0.000 -1.449239 -.9090839 job7 | -1.304191 .1406734 -9.271 0.000 -1.580001 -1.028381 job8 | -.8530246 .1381293 -6.176 0.000 -1.123846 -.5822027 job9 | -.6652395 .1501504 -4.430 0.000 -.9596304 -.3708487 job10 | -1.012177 .1420816 -7.124 0.000 -1.290748 -.7336058 _cons | 5.021799 .1643372 30.558 0.000 4.699593 5.344005 Note: job1-job10 are dummy variables representing 10 different job classes (job11 is the omitted reference category)
Causality ? Workforce Optimization Sue Bostrom: Leadership on IT—What’s It Worth? September 10, 2001 “For those who still doubt that Internet-related investments will pay off, consider this: A PricewaterhouseCoopers study released earlier this year found that productivity gains in 2000 were 2.7 times greater for Internet-enabled companies than for businesses that have not leveraged the Web.” http://business.cisco.com/prod/tree.taf%3Fpublic_view=true&kbns=1&asset_id=66966.html
Causality Reasons for an estimated statistical relationship The explanatory variable is the direct cause of the response (dependent) variable The response variable is causing a change in the explanatory variable (reverse causality) The explanatory variable is a contributing, but not sole, cause of the response variable Confounding variables may exist Both variables may stem from a common cause Both variables are changing over time Coincidence Source: Jessica M. Utts (1999) Seeing Through Statistics, 2nd ed., Pacific Grove, CA: Duxbury, p. 186.