Presentation is loading. Please wait.

Presentation is loading. Please wait.

3. Relationships Scatterplots and correlation

Similar presentations


Presentation on theme: "3. Relationships Scatterplots and correlation"— Presentation transcript:

1 3. Relationships Scatterplots and correlation
The Practice of Statistics in the Life Sciences Third Edition © 2014 W.H. Freeman and Company

2 Objectives (PSLS Chapter 3)
Relationships: Scatterplots and correlation Bivariate data Scatterplots Interpreting scatterplots Adding categorical variables to scatterplots The correlation coefficient r Facts about correlation

3 Bivariate data For each individual studied, we record data on two variables. We then examine whether there is a relationship between these two variables: Do changes in one variable tend to be associated with specific changes in the other variables? Student ID Number of Beers Blood Alcohol Content 1 5 0.1 2 0.03 3 9 0.19 6 7 0.095 0.07 0.02 11 4 13 0.085 8 0.12 0.04 0.06 10 0.05 12 14 0.09 15 0.01 16 Here we have two quantitative variables recorded for each of 16 students: how many beers they drank their resulting blood alcohol content (BAC)

4 Scatterplots A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph. Student Beers BAC 1 5 0.1 2 0.03 3 9 0.19 6 7 0.095 0.07 0.02 11 4 13 0.085 8 0.12 0.04 0.06 10 0.05 12 14 0.09 15 0.01 16 Be careful that the student ID is not a quantitative variable. It is only a reference number, a label identifying the individuals. This information does not appear on the graph. Do not use a scatterplot if you have a data set with only one variable. Sometimes, students make a scatterplot of one variable against the ID number; this is incorrect (again, an ID number is not a variable).

5 Explanatory and response variables
A response (dependent) variable measures an outcome of a study. An explanatory (independent) variable may explain or influence changes in a response variable. When there is an obvious explanatory variable, it is plotted on the x (horizontal) axis of the scatterplot. Response BAC We are looking at the effects of number of beers on blood alcohol content. If you think about it, the response is obviously the resulting blood alcohol content, and we want see if we can explain it by the number of beers drunk. Place the explanatory variable on the x axis (beers), and response variable on the y axis (BAC). y x Explanatory number of beers

6 How to scale a scatterplot
Same data in all four plots Both variables should be given a similar amount of space: Plot is roughly square Points should occupy all the plot space (no blank space)

7 Interpreting scatterplots
After plotting two variables on a scatterplot, we describe the overall pattern of the relationship. Specifically, we look for … Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the “form” … and clear deviations from that pattern Outliers of the relationship This chapter and the next focus on linear relationships.

8 Form: Linear relationship (top left), no relationship (top right), non-linear/curved relationship (bottom). The form of the relationship between 2 quantitative variables refers to the overall pattern.

9 Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable. Direction: Both relationships are linear, the one on the left is positive, the one on the right is negative.

10 The strength of the relationship between 2 quantitative variables refers to how much variation, or scatter, there is around the main form. Strength: Both are linear relationships, both are negative, but the one on the left is strong and the one on the right is clearly weaker. Words like strong and weak are somewhat relative, so we measure the strength of a linear relationship with the linear correlation coefficient r.

11 An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship. Outlier: The point highlighted doesn’t quite fit the overall negative linear pattern. [We will see in chapters 4 and 23 that a regression outlier has a particularly large residual.]

12 Adding categorical variables to scatterplots
Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph. The graph compares the association between thorax length and longevity of male fruit flies that are allowed to reproduce (green) or not (purple). The pattern is similar in both groups (linear, positive association), but male fruit flies not allowed to reproduce tend to live longer than reproducing male fruit flies of the same size. We see that the pattern of purple dots is laying above the pattern of green triangles. So we can conclude that not only do non-reproductive males tend to live longer than reproducing males, but this holds even when we take size (thorax length) into account.

13 Describe this relationship.
Energy expended as a function of running speed for various treadmill inclines Describe this relationship. Ignoring incline, the scatterplot would look like a mess. However, for each incline, there is a very strong, positive, linear relationship between energy expenditure and speed. In addition, we find that the relationship between energy expenditure and speed is noticeably different for different inclines: More energy tends to be expended for a given running speed if the incline is steeper (uphill).

14 The correlation coefficient: r
The correlation coefficient is a measure of the direction and strength of a relationship. It is calculated using the mean and the standard deviation of both the x and y variables. 𝑟= 1 𝑛−1 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 𝑠 𝑥 𝑦 𝑖 − 𝑦 𝑠 𝑦 Learn to use technology to obtain r. Correlation can only be used to describe QUANTITATIVE variables. Categorical variables don’t have means and standard deviations. Time to swim: x = 35, sx = 0.7 Pulse rate: y = 140 sy = 9.5

15 r doesn’t distinguish explanatory and response variables
𝑟= 1 𝑛−1 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 𝑠 𝑥 𝑦 𝑖 − 𝑦 𝑠 𝑦 r treats x and y symmetrically “Time to swim” is the explanatory variable here and belongs on the x axis. However, in either plot r is the same (r = −0.75). r = -0.75

16 r has no unit r = -0.75 𝑟= 1 𝑛−1 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 𝑠 𝑥 𝑦 𝑖 − 𝑦 𝑠 𝑦
𝑟= 1 𝑛−1 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 𝑠 𝑥 𝑦 𝑖 − 𝑦 𝑠 𝑦 r = -0.75 standardized value of x (unitless) standardized value of y (unitless) Changing the units of variables does not change the correlation coefficient, r, because we get rid of all units when we standardize.

17 r ranges from −1 to +1 Strength is indicated by the absolute value of r Direction is indicated by the sign of r (+ or –) r quantifies the strength and direction of a linear relationship between two quantitative variables. r is positive for positive linear relationships, and negative for negative linear relationships. The closer r is to zero, the weaker the linear relationship is. Beware that r has this meaning for linear relationships only.

18 r is not resistant to outliers
Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers. r = –0.75 Just moving one point away from the linear pattern here weakens the correlation from −0.91 to −0.75 (closer to zero).


Download ppt "3. Relationships Scatterplots and correlation"

Similar presentations


Ads by Google