Download presentation
Presentation is loading. Please wait.
Published byBrooke Weaver Modified over 5 years ago
1
Scatterplots contd: Correlation The regression line
2
Learning Objectives By the end of this lecture, you should be able to:
Describe the concept of correlation, including the definition of the correlation coefficient, ‘r’, and its properties. Identify the name of the formula used to generate the regression line on a scatterplot. Identify the key uses of the regression line.
3
‘r’ is called the Correlation Coefficient.
‘r’ tells you both the strength and direction of a linear relationship between two quantitative variables. ‘r’ is always a number between -1 and +1. Recall: Strength: how closely the points follow a straight line. Direction: is positive when individuals with higher X values tend to have higher values of Y.
4
The correlation coefficient "r"
r is a number between -1 and +1. The + and – signifiy the direction: ‘-’ = negative, ‘+’ = positive Numbers close to 0 (either positive or negative) indicate weak relationships. Numbers close to 1 (either positive or negative) indicate strong relationships. The correlation coefficient is a very useful measure of both the direction and strength of a linear relationship. The correlation coefficient doesn’t apply to non-linear relationships Correlation can only be used to describe quantitative variables. Categorical variables don’t have means and standard deviations!
5
Correlation only describes linear relationships
No matter how strong the association, r does not describe curved relationships. REMEMBER TO DRAW! Here is another good example why: Because r only makes sense in the context of linear relationships, you should plot your data first to make sure you have a linear relationship before going to the trouble of calculating r. HOWEVER: You can sometimes transform a non-linear association to a linear form, for instance by taking the logarithm. You can then calculate a correlation using the transformed data. We may discuss transformation a little later on.
6
Calculation of the correlation coefficient "r"
STOP!! Did you remember to plot your data?? If this relationship does not look linear, then you should not be doing this type of analysis!!!!! Time (s) Rate (bpm) 33.9 148 156 34 152 34.1 146 153 etc Time to swim: x = 35, sx = 0.7 Pulse rate: y = 140 sy = 9.5 (You do not need to memorize the above formula!)
7
When variability (e.g. standard deviation) in one or both variables decreases, the correlation coefficient gets stronger ( closer to +1 or -1).
8
Example: Identify the explanatory and response variables.
Exp: # powerboats Resp: manatee deaths Describe the form, direction, and strength of the relationship Linear, Positive, Moderate Estimate r ??? 0.6 ?? (in 1000’s)
9
Recall from a previous discussion:
WHICH line is the “correct” regression line? Answer: We use a mathematical model called the “method of least squares” to select what we believe to be the “best” regression line.
10
Determining the best regression line
Recall that reasonable people can disagree on the “best line”. So how do we settle on one? There are several different mathematical techniques for choosing the “best” regression line. The most widely used technique (formula) by far is called the ‘least-squares method’. Your book explains a little bit about why this is so, but we won’t get into it in class. We won’t delve into how the least squares method generates the line. For now, we’ll allow the software to do it for us. What I do want you to appreciate: Be aware that there are several techniques that have been developed for doing so, and that the method of least-square is the current favorite. The key point here is that in mathematics, different techniques/models/formulas are proposed for solving different kinds of problems. Sometimes, such as in this case, a particular model becomes widely accepted, and even the de-facto standard.
11
Finding the best regression line using the least-squares method (You don’t need to memorize this)
The “least-squares” regression comes up with in which the sum of the squared vertical (y) distances between the data points and the line is as small as possible.
12
The regression line Okay, so we’ve used the least-square method to settle on a single regression line. Now: What does that line tells us??? A regression line is a straight line that describes how much a change in the explanatory variable will affect the response variable. Eg: If you increase x by, 5.3, how much will that change y? The regression line is of tremendous use to us in that it can be used as a model for making predictions. In addition, recall that identifying the regression line also helps us do things like identify outliers and other interesting datapoints such as influential points (later).
13
Example: (in 1000’s) If the state of Florida decides to make 700,000 powerboat licenses available, how many manatee deaths to we anticipate? Answer: Using our regression line, we estimate about 48. What if we decided to drop that number down to 500,000? Answer: Using our regression line, we estimate about 21. Summary: We take data of a suspected relationship and draw a scatterplot. Assuming there does seem to be a linear relationship, we generate a regression line. From our regression line we generate a model for making predictions.
14
Questions: (in 1000’s) How confident are you in this model? That is, assuming the data is itself legitimate, do you think the predictive value is good? There are a couple of points I would bring up here: We have only 14 observations that were used to generate this model. Typically, the more observations, the better. The strength of the relationship (i.e. ‘r’) looks decent but not terrific. When r is high (close to +1 or -1), we have considerably more confidence in our predictions than when it is weak (closer to 0). So, while I think this is a decent model for coming up with preliminary analysis, I would not use it to make high-stakes decisions.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.