Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright (c)Bani K. Mallick1 STAT 651 Lecture #21.

Similar presentations


Presentation on theme: "Copyright (c)Bani K. Mallick1 STAT 651 Lecture #21."— Presentation transcript:

1 Copyright (c)Bani K. Mallick1 STAT 651 Lecture #21

2 Copyright (c)Bani K. Mallick2 Topics in Lecture #21 Correlation

3 Copyright (c)Bani K. Mallick3 Book Sections Covered in Lecture #21 Chapter 11.7

4 Copyright (c)Bani K. Mallick4 Lecture 20 Review: Leverage and Outliers Outliers in Linear Regression are difficult to diagnose They depend crucially on where X is * * * * * A boxplot of Y would think this is an outlier, when in reality it fits the line quite well

5 Copyright (c)Bani K. Mallick5 Lecture 20 Review: Outliers and Leverage It’s also the case than one observation can have a dramatic impact on the fit * * * * * * * The slope of the line depends crucially on the value far to the right

6 Copyright (c)Bani K. Mallick6 Lecture 20 Review: Outliers and Leverage But Outliers can occur * * * * * * * This point is simply too high for its value of X Line with Outlier Line without Outlier

7 Copyright (c)Bani K. Mallick7 Lecture 20 Review: Outliers and Leverage A leverage point is an observation with a value of X that is outlying among the X values An outlier is an observation of Y that seems not to agree with the main trend of the data Outliers and leverage values can distort the fitted least squares line It is thus important to have diagnostics to detect when disaster might strike

8 Copyright (c)Bani K. Mallick8 Lecture 20 Review: Outliers and Leverage We have three methods for diagnosing high leverage values and outliers Leverage plots: For a single X, these are basically the same as boxplots of the X-space (leverage) Cook’s distance (measures how much the fitted line changes if the observation is deleted) Residual Plots

9 Copyright (c)Bani K. Mallick9 Correlation and Measures of Fit You all know the word “correlation”, as in “Height and Weight are positively correlated” Many of you may also have heard of R-squared denoted by R 2 Both are measures of how well an independent variable predicts a dependent variable

10 Copyright (c)Bani K. Mallick10 Correlation and Measures of Fit R 2 measures the fraction of variance explained by the least squares line The relevant sums of squares are The fraction of the total sum of squares explained by the fitted line is

11 Copyright (c)Bani K. Mallick11 Correlation and Measures of Fit R 2 measures the fraction of variance explained by the least squares line If Y and X are perfectly linearly related, then all the variation in Y is explained by the line, and thus R 2 = 1 If Y and X are completely independent, then the line explains nothing about Y, so R 2 = 0 However, Y and X can be perfectly related but not linearly, and R 2 is misleading in this case (see later on)

12 Copyright (c)Bani K. Mallick12 GPA and Height Note that this is a fairly weak relationship, so little variance explained: suggests R- squared is near zero

13 Copyright (c)Bani K. Mallick13 GPA and Height

14 Copyright (c)Bani K. Mallick14 Aortic Valve Area and Body Surface Area Note that this is a stronger relationship: suggests R- squared is higher

15 Copyright (c)Bani K. Mallick15 AVA and BSA in Healthy Kids

16 Copyright (c)Bani K. Mallick16 Correlation and Measures of Fit The (Pearson) correlation coefficient measures how well Y and X are linearly related The correlation is always between –1 and +1

17 Copyright (c)Bani K. Mallick17 Correlation and Measures of Fit If the correlation = +1, then Y and X are perfectly positively related If the correlation = -1, then Y and X are perfectly negatively related If the correlation = 0, then Y and X are not linearly related

18 Copyright (c)Bani K. Mallick18 Correlation and Measures of Fit The (Spearman) correlation coefficient measures how well Y and X are monotonically related Replace Y by its rank among the Y’s Replace X by its rank among the X’s Computer the (Pearson) correlation Why would someone do a Spearman correlation?

19 Copyright (c)Bani K. Mallick19 Correlation and Measures of Fit The (Spearman) correlation coefficient measures how well Y and X are monotonically related Replace Y by its rank among the Y’s Replace X by its rank among the X’s Computer the (Pearson) correlation Why would someone do a Spearman correlation? Because it is more robust to outliers, and it is not affected by transformations

20 Copyright (c)Bani K. Mallick20 Correlation and Measures of Fit Both types of correlations are easily obtained in SPSS Go to “Analyze”, “Correlation” and type in all the variables that you want correlations for You have to click on Spearman to get it, otherwise you get only Pearson Confidence intervals for the population correlations are not included SPSS Demonstration using aortic data

21 Copyright (c)Bani K. Mallick21 Correlation and Measures of Fit The diagonals are meaningless: Y is perfectly correlated with Y Correlations 1.000.866**.873**..000 70.866**1.000.982**.000. 70.873**.982**1.000.000. 70 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Body Surface Area Aortic Valve Area Log(1+Aortic Valve Area) Body Surface Area Aortic Valve Area Log(1+Aortic Valve Area) Correlation is significant at the 0.01 level (2-tailed). **.

22 Copyright (c)Bani K. Mallick22 Correlation and Measures of Fit Note how the Spearman correlation of BSA and AVA is the same as the Spearman correlation of bSA and log(1+AVA)

23 Copyright (c)Bani K. Mallick23 Correlation and Measures of Fit Both correlations are random variables, i.e., if you redo an experiment, you will get a different Pearson correlation The population Pearson correlation is The estimated standard error of the sample Pearson correlation is

24 Copyright (c)Bani K. Mallick24 Correlation and Measures of Fit Null hypothesis of no linear relationship: A (1  100% CI for the population Pearson correlation  is Since the population correlation must be between –1 and +1, you should restrict your interval to that range: reject null if interval does not include 0

25 Copyright (c)Bani K. Mallick25 Correlation and Measures of Fit Consider the aortic stenosis healthy kids n = 70, Pearson correlation = 0.866 The 95% CI is What is the meaning of this interval?

26 Copyright (c)Bani K. Mallick26 Correlation and Measures of Fit Consider the aortic stenosis healthy kids n = 70, Pearson correlation = 0.866 The 95% CI is What is the meaning of this interval? 95% certain that the population Pearson correlation  is between.747 and.985

27 Copyright (c)Bani K. Mallick27 Some Warnings About Correlation The Pearson correlation can be greatly affected by outliers and leverage values This is why it is good to have the Spearman

28 Copyright (c)Bani K. Mallick28 Aortic Stenosis Data: Note the outlier in the Stenotic Kids

29 Copyright (c)Bani K. Mallick29 Some Warnings About Correlation The Pearson correlation with the outlier in the Stenotic kids is 0.477 It is 0.648 without the outlier The Spearman correlations are 0.691 and 0.762 with and without the outlier I can make correlations dance

30 Copyright (c)Bani K. Mallick30 Some Warnings About Correlation The correlations are 0.058 (left) and 1.00 (right)! Only one point differs: high leverage outlier Linear Regression -10.00 -7.50-5.00-2.50 0.00 x -10.00 -5.00 0.00 5.00 10.00 Outlier added A A A A A A A A A A A 1.002.00 -10.00 -7.50-5.00-2.50 0.00 x A A A A A A A A A A A Made Up Data with (left) and without (right) a high leverage outlier

31 Copyright (c)Bani K. Mallick31 Some Warnings About Correlation The Pearson correlation only measures linear correlation If your relationship is not linear, then Pearson will get confused

32 Copyright (c)Bani K. Mallick32 Some Warnings About Correlation Note the perfect quadratic relationship Pearson corr = 0

33 Copyright (c)Bani K. Mallick33 Construction Data

34 Copyright (c)Bani K. Mallick34 Construction Data Correlations 1.000.178**.120*..000.011 447.178**1.000.896**.000. 447.120*.896**1.000.011.000. 447 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Age (Modified) Base Pay (modified) Log(Base Pay modified - $30,000) Age (Modified) Base Pay (modified) Log(Base Pay modified - $30,000) Correlation is significant at the 0.01 level (2-tailed). **. Correlation is significant at the 0.05 level (2-tailed). *.

35 Copyright (c)Bani K. Mallick35 Construction Data

36 Copyright (c)Bani K. Mallick36 Armspan Data (Males)

37 Copyright (c)Bani K. Mallick37 Armspan Data (Males)

38 Copyright (c)Bani K. Mallick38 Armspan Data (Males) A 95% confidence interval for the population Pearson correlation is Meaning?

39 Copyright (c)Bani K. Mallick39 Armspan Data (Males)


Download ppt "Copyright (c)Bani K. Mallick1 STAT 651 Lecture #21."

Similar presentations


Ads by Google