Download presentation
Presentation is loading. Please wait.
Published byBetty Marsh Modified over 5 years ago
1
Algebra Review The equation of a straight line y = mx + b
m is the slope – the change in y over the change in x – or rise over run. b is the y-intercept – the value where the line cuts the y axis. Before we look at the application of straight lines to statistics, let’s briefly review a little algebra.
3
Review y = 3x + 2 x = 0 y = 2 (y-intercept) x = 3 y = 11
Change in y (+9) divided by the change in x (+3) gives the slope, 3.
4
Linear Regression Example: Tar (mg) and CO (mg) in cigarettes.
y, Response: CO (mg). x, Explanatory: Tar (mg). Cases: 25 brands of cigarettes. With linear regression, we will use the data on the relationship between two variables, the response variable, y and the explanatory variable, x to come up with values of the slope and intercept for a line that comes close to the data points on the scatter plot.
5
Correlation Coefficient
Tar and nicotine r = The calculation of the regression line depends on the value of the correlation coefficient. Because the correlation coefficient measures the strength of the linear association between two variables, it is not surprising that this coefficient plays an important role in determining the line that best fits the data.
6
Linear Regression There is a strong positive linear association between tar and nicotine. What is the equation of the line that models the relationship between tar and nicotine?
7
Linear Model The linear model is the equation of a straight line through the data. A point on the straight line through the data gives a predicted value of y, denoted . In Chapter 6 we talked about a Normal model to describe the distribution of a population. We are now going to consider a model for the relationship between two variables. The relationship will be modeled with a straight line.
8
Residual The difference between the observed value of y and the predicted value of y, , is called the residual. Residual = The residual, difference between an observed and a predicted value is the basis for coming up with an estimate of this linear model.
9
Residual A residual indicates how far away a point is from a model line.
10
Line of “Best Fit” There are lots of straight lines that go through the data. The line of “best fit” is the line for which the sum of squared residuals is the smallest – the least squares line. But how do we draw the line? There are any number of straight lines that would “fit” the data. Which one do we choose?
11
Line of “Best Fit” Least squares slope: intercept:
Where do these equations come from. Finding the smallest sum of squared residuals is a minimization problem. You can use calculus to solve that minimization problem. When you use calculus, you end up with two equations and two unknowns (the slope and the intercept). Solving for the slope and intercept gives you the equations above.
12
Summary of the Data Tar, x CO, y
13
Least Squares Estimates
14
Interpretations Slope – for every 1 mg increase in tar, the CO content increases, on average, mg. Intercept – there is not a reasonable interpretation of the intercept in this context because one wouldn’t see a cigarette with 0 mg of tar.
15
Predicted CO = *Tar A residual indicates how far away a point is from a model line.
16
Prediction Least squares line
17
Residual Tar, x = 16.0 mg CO, y = 16.6 mg Predicted, = 15.56 mg
Calculation of a residual for a particular x, y pair.
18
Residuals Residuals help us see if the linear model makes sense.
Plot residuals versus the explanatory variable. If the plot is a random scatter of points, then the linear model is the best we can do. Plotting residuals versus the explanatory variable help us see if a linear model is appropriate. We will see more of this in Chapter 10.
19
You will always have about the same number above zero as below zero
You will always have about the same number above zero as below zero. What is important is the lack of any trend or curvature in the residuals.
20
Interpretation of the Plot
The residuals appear to have a pattern. For values of Tar between 0 and 20 the residuals tend to increase. The brand with Tar = 30, appears to have a large residual.
21
(r)2 or R2 The square of the correlation coefficient gives the amount of variation in y, that is accounted for or explained by the linear relationship with x. The square of the correlation coefficient is on a scale from 0 to 1 where 0 is no linear association and 1 is a perfect linear association.
22
Tar and Nicotine r = 0.9575 (r)2 = (0.9575)2 = 0.917 or 91.7%
91.7% of the variation in CO content can be explained by the linear relationship with Tar content.
23
Regression Conditions
Quantitative variables – both variables should be quantitative. Linear model – does the scatter plot show a reasonably straight line? Outliers – watch out for outliers as they can be very influential.
24
Regression Cautions Beware of extraordinary points.
Don’t extrapolate beyond the data. Don’t infer x causes y just because there is a good linear model relating the two variables. Don’t choose a model based on R2 alone.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.