Download presentation
Presentation is loading. Please wait.
1
Looking at Scatterplots
Scatterplots may be the most common and most effective display for data. In a scatterplot, you can see patterns, trends, relationships, and even the occasional extraordinary value sitting apart from the others. Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables.
2
Looking at Scatterplots
What does each dot represent?
3
Explanatory and Response Variables
It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. This determination is made based on the roles played by the variables. The response variable goes on the y-axis (a response variable measures an outcome of a study). The explanatory variable goes on the x-axis (an explanatory variable explains or influences changes in a response variable). sometimes there is no distinction
4
Example Each question below describes a relationship between two quantitative variables. Which variable should be plotted on the x-axis? How is the age of a teenager related to the average number of text messages they send and receive each age? Is the sales price of a townhouse related to the number of square feet in the townhouse? What is the relationship between their annual salary and the number of years of education. Is the amount of fat in a serving of cereal related to the amount of sugar in a serving?
5
Form (cont.) When looking at scatterplots, we will look for direction, form, strength, and unusual features. Direction: A pattern that runs from the upper left to the lower right is said to have a negative direction. A trend running the other way has a positive direction.
6
Looking at Scatterplots (cont.)
The figure shows a negative direction between the year since 1970 and the prediction errors made by NOAA. As the years have passed, the predictions have improved (errors have decreased).
7
Looking at Scatterplots (cont.)
Form: If there is a straight line it is a (linear) relationship. It will appear as a cloud or swarm of points stretched out in a generally consistent, straight form.
8
Form (cont.) These relationships aren’t linear. They are neither positive nor negative. Nonlinear (curvilinear) form: the data points appear scattered about a smooth curve. We will use a curve to summarize the pattern in the data.
9
Strength Strength: At one extreme, the points appear to follow a single stream (whether straight, curved, or bending all over the place). At the other extreme, the points appear as a vague cloud with no discernable trend or pattern:
10
Example Describe the association of this scatterplot.
11
Looking at Scatterplots (cont.)
Unusual features: Look for the unexpected. Often the most interesting thing to see in a scatterplot is the thing you never thought to look for. One example of such a surprise is an outlier standing away from the overall pattern of the scatterplot. Clusters or subgroups should also raise questions.
12
Correlation Data collected from students in Statistics classes included their heights (in inches) and weights (in pounds): Here we see a positive association and a fairly straight form, although there seems to be a high outlier.
13
Correlation (cont.) How strong is the association between weight and height of Statistics students? If we had to put a number on the strength, we would not want it to depend on the units we used. A scatterplot of heights (in centimeters) and weights (in kilograms) doesn’t change the shape of the pattern:
14
Correlation (cont.) The correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between the explanatory and response variables.
15
Correlation Properties
Correlation has no units. Correlation does not change when the units of measurement of either one of the variables change. Correlation only measures the strength of a linear relationship between two variables. Correlation is heavily influenced by outliers. An outlier can make an otherwise small correlation look big, or hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa). OLI
16
Correlation Properties
Correlation by itself is not enough to determine whether or not there is a relationship between the variables. Variables can have a strong association (in other words a strong pattern) but still have a small correlation if the association isn’t linear. For example: There is a strong association between two variables below, but the correlation happens to be 0 since it is not linear.
17
Correlation Properties (cont.)
Correlation by itself is not enough to determine whether there is a linear relationship between the variables. Don’t assume the relationship is linear just because the correlation coefficient is high. Here the correlation is 0.979, but the relationship is actually bent.
18
Correlation Properties
Correlation treats x and y symmetrically: The correlation of x with y is the same as the correlation of y with x. The sign of a correlation coefficient gives the direction of the association. A positive correlation means a positive association. A negative correlation means a negaitive association. The magnitude of the correlation tells us the strength of a linear association.
19
Correlation Properties
Correlation is always between -1 and +1. Correlation can be exactly equal to -1 or +1, but this is unusual in real data because it would mean that all the data points fall exactly on a single straight line. A correlation: near zero corresponds to a weak linear association. near 1 corresponds to a strong positive linear association. near -1 corresponds to a strong negative linear association.
20
Correlation ≠ Association
Don’t say “correlation” when you mean “association.” More often than not, people say correlation when they mean association. The word “correlation” should be reserved for measuring the strength and direction of the linear relationship between two quantitative variables.
21
Correlation ≠ Causation
Whenever we have a strong correlation, it is tempting to explain it by imagining that the predictor variable has caused the response to help. Scatterplots and correlation coefficients never prove causation. A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable.
22
Example In the early 1930s the relationship between the human population (response variable) of Oldenburg, Germany, and number of storks nesting in the town (explanatory variable) was investigated. The correlation coefficient turned out to be 0.97. Does this mean that storks bring babies? Can you give a possible explanation for this strong association?
23
Example There is a strong positive correlation between the foot length of K-12 students and reading scores. Can you give a possible explanation for this strong association?
24
Example Students who use tutors have lower test scores than students who don’t. Can you give a possible explanation for this strong association?
25
Linear Models Example:
According to a 1989 National Academy of Sciences Report, the recommended daily intake of calories for males between the ages of 7 and 15 can be calculated by the equation y = 125x , where x represents the boy’s age and y represents the recommended calorie intake. Continued.
26
Linear Models What is the recommended caloric intake for a 14-year-old boy? Continued.
27
Linear Models What is the age of a boy whose recommended caloric intake is 2375 calories? Continued.
28
Linear Models Interpret the slope of y = 125x in terms of rate of change. Continued.
29
An Example: Fat Versus Protein
Topic 3.2 Fitting a Line Learning Objectives For a linear relationship, use the least squares regression line to summarize the overall pattern and to make predictions. An Example: Fat Versus Protein The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:
30
The Linear Model A linear model is an equation of a straight line through the data. The model won’t be perfect, regardless of the line we draw. Some points will be above the line and some will be below. The estimate made from a model is the predicted value (denoted as ).
31
Residuals The difference between the observed value and its associated predicted value is called the residual. To find the residuals, we always subtract the predicted value from the observed one: Estimate the predicted fat for 37.5g of protein? Estimate the actual fat for 37.5g of protein?
32
Residuals (cont.) A negative residual means the predicted value’s too big (an overestimate). A positive residual means the predicted value’s too small (an underestimate).
33
“Best Fit” Means Least Squares
Some residuals are positive, others are negative, and, on average, they cancel each other out. So, we can’t assess how well the line fits by adding up all the residuals. Similar to what we did with deviations, we square the residuals and add the squares. The smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest.
34
Sum of Square Errors (SSE)
When we compare the sum of the areas of the yellow squares, the line on the left has an SSE of The line on the right has a smaller SSE of 43.9. So the line on the right fits the points better, but is it the best fit?
35
The Linear Model (cont.)
Remember from Algebra that a straight line can be written as: In Statistics we use a slightly different notation: 𝑦 =𝑎+𝑏𝑥 We write to emphasize that the points that satisfy this equation are just our predicted values, not the actual data values. The coefficient b is the slope, which tells us how rapidly changes with respect to x. The coefficient a is the y-intercept.
36
𝑏=𝑟 𝑠 𝑦 𝑠 𝑥 The Least Squares Line In our model, we have a slope (b1):
The slope is built from the correlation and the standard deviations: 𝑏=𝑟 𝑠 𝑦 𝑠 𝑥 Our slope is always in units of y per unit of x. In our model, we also have an intercept (a). The intercept is built from the means and the slope: a= 𝑦 −𝑏 𝑥
37
Fat Versus Protein: An Example
The regression line for the Burger King data fits the data well: The equation is What is the predicted fat content for a BK Broiler chicken sandwich that has 30g of protein?
38
Fat Versus Protein: An Example
Let’s interpret the slope and y-intercept. Slope: y-intercept:
39
Extrapolation: Reaching Beyond the Data
Linear models give a predicted value for each case in the data. We cannot assume that a linear relationship in the data exists beyond the range of the data. Once we venture into new x territory, such a prediction is called an extrapolation. You’re better off not making extrapolations. If you must extrapolate into the future, at least don’t believe that the prediction will come true.
40
Straight Enough for Regression
The linear model assumes that the relationship between the variables is linear. A scatterplot will let you check that the assumption is reasonable. If the scatterplot is not straight enough, stop here. They must have a linear association or the model won’t mean a thing.
41
Outliers in Regression
Watch out for outliers. Outlying points can dramatically change a regression model. Outliers can even change the sign of the slope, misleading us about the underlying relationship between the variables. The red regression line includes outliers. The blue one does not.
42
Let’s talk residuals again
Go to my website and down load the cereal data. Create a scatter plot with sugar and fat Put on the LSRL and determine the equation Go to edit click save residual Now graph the x-value and the residual value. If there is no pattern then it’s a good line of best fit.
43
How good is the line of best fit?
2 numeric measures 𝑟 2 𝑠 𝑒
44
𝑟 2 The value of r2 is the proportion of the variation in the response variable that is explained by the least-squares regression line. Let’s say r = .73 𝒓 𝟐 = .𝟓𝟑 We can say that our regression model explains 53% of the total variation in the response variable. That means 47% of the total variation is unexplained.
45
Try one: What if r = -.3
46
𝑠 𝑒 = 𝑆𝑆𝐸 𝑛 −2 SSE = Sum of the squared errors
𝑠 𝑒 = 𝑆𝑆𝐸 𝑛 −2 SSE = Sum of the squared errors n = the number of data points you have You will never have to calculate this by hand Very similar to standard deviation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.