Presentation is loading. Please wait.

Presentation is loading. Please wait.

Describing Relationships: Scatterplots and Correlation.

Similar presentations


Presentation on theme: "Describing Relationships: Scatterplots and Correlation."— Presentation transcript:

1 Describing Relationships: Scatterplots and Correlation

2 2009-05-22 1 Ranking the states?  The news is full of “rankings”: – Best/worst dressed – Sexiest woman/man – Best/worst colleges – Most expensive/cheap cities to live in  Although the College Board strongly discourages it, there is such a list that ranks states by average SAT score  The slide on the next page shows a histogram of average SAT math scores for states

3 2009-05-22 2

4 3 Ranking the states? …  Iowa leads with 602, and DC trails with 476 – out of 800  Unusual shape: – One distinct peak, (500-525) – One possible small peak, (550-575)  Maybe there are two things going on here?  To unravel this we look at a scatterplot: – Plots values of variable against values of another and presents a “cloud” of points  Let’s look at the scatterplot of state average SAT math score vs. proportion of seniors taking the SAT …

5 2009-05-22 4

6 5 Ranking the states? …  If we ignore the point for Nevada (in the middle), we have two distinct groups: 1.In the first group no more than 33% of seniors take the SAT, and their average score seems to be about 550 2.In the second group about 70% of seniors take the SAT, and their average is much lower – near 500

7 2009-05-22 6 Ranking the states? …  So, the state average math SAT score seems to go down as the proportion of students who take the SAT goes up – States with very low proportions taking the SAT are states where most students take the ACT (instead of the SAT) and only (better) students who apply to selective colleges take the SAT  To understand a variable, we must often look at how it relates to other variables

8 2009-05-22 7 Introduction  Our topic today and in coming days is the relationship between two variables  A main theme will be the fact that to understand the relationship between two variables, it is necessary to understand how it is affected by third, fourth or more variables lurking in the background  For example, an medical study find that short women are more prone to heart attacks than tall women – The study must take into account the effects of weight, exercise habits and diet on this relationship before concluding that height is an important factor on its own

9 2009-05-22 8 Introduction …  Most statistical studies examine data on more than one variable; so-called multiple-variable data  This is an extension of the techniques used on one or two-variable data: – Plot the data, then add numerical descriptions – Look for overall patterns and deviations from those patterns – When the overall pattern is regular, there is sometimes a very brief, concise way to describe it

10 2009-05-22 9 Scatterplots  Scatterplots are the most common way to display a the relationship between two quantitative variables  The SAT scatterplot show the relationship between state average SAT math score (response variable) and proportion of students in the state who take the SAT (explanatory variable) – We want to see how the average changes as the proportion taking the test changes

11 2009-05-22 10 Scatterplots …  Always plot the explanatory variable (if there is one) on the horizontal axis and the response variable on the vertical axis

12 2009-05-22 11 Example 1: health and wealth  The next slide shows a scatterplot of data from the world bank – The response variable is life expectancy at birth – The explanatory variable is how rich a country is measured by GDP  People in richer countries should live longer, and the scatterplot shows this  But, the relationship has an odd shape: – There is a lot of variability at the very lowest levels of GDP – As GDP increases a bit, so does life expectancy – Then beyond a GDP of about $15,000, there is no more improvement in life expectancy

13 2009-05-22 12

14 2009-05-22 13 Example 1: Health and wealth …  So, that’s the overall pattern  But there are outliers: – Gabon, Equatorial Guinea and Sierra Leone all have slightly higher GDPs than their neighbors, but no better life expectancy – These countries have “extra” sources of income through natural resources that bump up their GDPs but appear to do little for health  Possibly because most of this extra income goes to a select few individuals and does not affect the bulk of the population

15 2009-05-22 14 Interpreting scatterplots

16 2009-05-22 15 Interpreting scatterplots …  In both the scatterplots we have seen so far, there is a clear “direction” to the relationship: – For the average SAT scores, as the proportion taking the test increases, the average score goes down – this is a negative association – For life expectancy and GDP, as GDP increases so does life expectancy (to an extent) – this is a positive association

17 2009-05-22 16 Interpreting scatterplots …

18 2009-05-22 17 Interpreting scatterplots …  Each of the scatterplots so far have a clear form – For the SAT scores, there are two clusters of states and an overall negative association – For life expectancy, there is a clear curved relationship  The strength of a relationship in a scatterplot has to do with how closely the points follow the clear form

19 2009-05-22 18

20 2009-05-22 19

21 2009-05-22 20 Example 2: Classifying fossils  The table above contains lengths (cm) of the leg and upper arm bones of Archaeopteryx – an extinct animal thought to connect modern birds with dinosaurs  Because they are of such different sizes, a debate exists as to whether the six known Archaeopteryx fossil skeletons are really all Archaeopteryx  This data may help us sort this out – if there is a very strong association between the lengths of these two bones, it is likely that they all come from the same type of animal …

22 2009-05-22 21

23 2009-05-22 22 Example 2: Classifying fossils …  The scatterplot shows a very strong, positive, straight- line association  The straight line is important because it is common and simple  The association is strong because the points lie very close to the line  It is positive because as the length of one bone increases so does the length of the other – The lengths of the two bones grow in the same proportion to each other  must come from the same type of animal because that type has a characteristic ratio of these lengths  These skeletons come from young and old animals

24 2009-05-22 23 Correlation  A scatterplot displays the direction, form and strength of the relationship between to variables  A straight line relation is important because it is simple and common  A straight line relation is strong if the points lie close to the line, and weak if the points are widely scattered from the line  The eye is not a good judge of strength …  The following two scatterplots are the same, just with different scales on the axes, but the one on the right looks stronger to the eye

25 2009-05-22 24

26 2009-05-22 25 Correlation …  We need to our earlier strategy for data analysis and use a numerical measure of strength  Correlation is the measure we use:

27 2009-05-22 26 Correlation …  Actually calculating the correlation r takes some work, work that is usually done by a calculator or computer  We’re going to go through an example of how to calculate r, but we want to stay focused on trying to understand it, rather than just how to calculate it

28 2009-05-22 27 Example 3: Calculating correlation  Let’s use our fossil data, let the femur length be x and the humerus length be y; there are n = 5 fossils in the data set 1.Step 1: Find both the mean and standard deviation for x and y. A calculator tells us:

29 2009-05-22 28 Example 3: Calculating correlation … 2.Step 2: Using the means and standard deviations from step 1, calculate the standard scores for each x-value and each y-value:

30 2009-05-22 29 Example 3: Calculating correlation … 3.Step 3: The correlation is the average of the products of these standard scores. As with the standard deviation, we “average” by dividing by (n – 1), one fewer than the number of individuals

31 2009-05-22 30 Example 3: Calculating correlation …  The formula is:

32 2009-05-22 31 Understanding correlation  Positive r indicates positive association between the variables  Negative r indicates negative association between the variables  The correlation r always falls between -1 and 1 – Values of r near 0 indicate a weak straight-line relationship – The strength of the relationship increases as r moves away from 0 toward either -1 or 1 – Value of r close to -1 or 1 indicate that points lie very close to the line

33 2009-05-22 32 Understanding correlation …  Because r uses standard scores, r does not change when we change the units of measurement of x, y or both – r is a unitless number  Correlation r ignores the distinction between explanatory and response variables – if we reverse them, we get the same value for r  Correlation measures the strength of straight-line relationships only – not curved relationships  Like the mean and standard deviation, r is strongly affected by outliers, for this reason use it with caution

34 2009-05-22 33 Understanding correlation …  The result of 0.994 that we just calculated for the fossils indicates a very strong positive relationship between femur size and humerus size  Here are some example scatterplots and their r values to give you some practice at thinking about r

35 2009-05-22 34

36 2009-05-22 35

37 2009-05-22 36

38 2009-05-22 37

39 2009-05-22 38

40 2009-05-22 39

41 2009-05-22 40

42 2009-05-22 41 Summary  Most statistical studies examine relationships between two or more variables  A scatterplot is a graph of the relationship between two quantitative variables – If you have response and explanatory variables, put the response on the vertical axis and the explanatory on the horizontal axis  When you examine a scatterplot, look for: – Direction (positive or negative) – Form (straight or curved) – Strength – Outliers

43 2009-05-22 42 Summary …  The correlation r measures the direction and strength of straight-line relationships between quantitative variables – r is between -1 and 1 – The sign of r shows whether the association is positive or negative – The value of r shows how strong the relationship is, stronger relationships have values closer to -1 or 1


Download ppt "Describing Relationships: Scatterplots and Correlation."

Similar presentations


Ads by Google