Sections 3.1 - 3.3 Review
Relationship between two variables Bivariate data
What three shapes are possible for a bivariate data relationship?
What three shapes are possible for a bivariate data relationship? Linear Curved No shape
Shape : Linear
Shape : Linear
Shape: Curved
Shape: Curved
Shape: Curved
Shape: None
Shape: None
The line on the plot is the ____________.
The line on the plot is the least squares regression line, LSRL, or regression line.
Two main reasons to fit a line to a set of data:
Two main reasons to fit a line to a set of data: 1) to find a summary or model that describes relationship between two variables 2) to use the line to predict value of y when you know value of x
To make a reasonable prediction, what needs to be true about: A) shape of data? B) strength of relationship?
To make a reasonable prediction, what needs to be true about: A) shape of data? linear B) strength of relationship?
To make a reasonable prediction, what needs to be true about: A) shape of data? linear B) strength of relationship? Stronger the better
Usually, the independent variable, x, is on the horizontal axis. Dependent variable, y, is on vertical axis,
Statistics, not Algebra! The variable on the x-axis is called the __________ or __________ variable. The variable on the y-axis is called the __________ or __________ variable.
Statistics, not Algebra! The variable on the x-axis is called the predictor or explanatory variable. The variable on the y-axis is called the __________ or __________ variable.
Statistics, not Algebra! The variable on the x-axis is called the predictor or explanatory variable. The variable on the y-axis is called the predicted or response variable.
Which is correct? Year vs Minimum Wage or Minimum Wage vs Year?
Which is correct? Year vs Minimum Wage or Minimum Wage vs Year?
Two types of predictions:
Two types of predictions: 1) interpolation – making prediction when value of x falls within range of the data
Two types of predictions: 1) interpolation – making prediction when value of x falls within range of the data 2) extrapolation – making prediction when value of x falls outside range of actual data
Two types of predictions: 1) interpolation – making prediction when value of x falls within range of the data 2) extrapolation – making prediction when value of x falls outside range of actual data Interpolation fairly safe Extrapolation risky especially the further x-value is outside range of actual data
Prediction error: difference between the actual value of y and value of y predicted from a regression line Usually unknown except for the points used to construct the regression line, whose prediction errors are called residuals
Residual = observed value of y – predicted value of y Residual = y - y
Residual is the signed vertical distance from an observed data point to the regression line. Positive if point above the line Negative if point below the line 0 if point on the line
Least squares regression line, also called least squares line or regression line, is the line for which the sum of the squared errors or SSE is as small as possible. SSE = (residuals)2
Find the least squares line for this passenger jets data.
Put explanatory values in L1 and response values in L2
Put explanatory values in L1 and response values in L2 STAT CALC 8. LinReg (a + bx) LinReg (a + bx) L1, L2, Y1 (Y1 needed if want to show LSRL on graph)
Put explanatory values in L1 and response values in L2 STAT CALC 8. LinReg (a + bx) LinReg (a + bx) L1, L2, Y1 To get Y1, go to VARS, Y-VARS, 1: Function, ENTER, 1: Y1, ENTER
LinReg y = a + bx a = 366.6666667 b = 16 r2 = .9795918367 r = .9897433186 So, what is equation for LSRL?
LinReg y = a + bx a = 366.6666667 b = 16 r2 = .9795918367 (Turn Diagnostic On) r = .9897433186 So, what is equation for LSRL?
y = 367 + 16x Is this it? LinReg y = a + bx a = 366.6666667 b = 16 So, what is equation for LSRL? y = 367 + 16x Is this it?
Is this it? No! Need equation in context! LinReg y = a + bx a = 366.6666667 b = 16 r2 = .9795918367 r = .9897433186 So, what is equation for LSRL? y = 367 + 16x Is this it? No! Need equation in context!
Is this it? No! Need equation in context! Cost = 367 + 16(seats) So, what is equation for LSRL? y = 367 + 16x Is this it? No! Need equation in context! Cost = 367 + 16(seats)
Cost = 367 + 16(seats) Interpret the slope and y-intercept.
Cost = 367 + 16(seats) Interpret the slope and y-intercept. Slope: For each additional seat, the cost increases by about $16 per hour
Cost = 367 + 16(seats) Interpret the slope and y-intercept. Slope: For each additional seat, the cost increases by about $16 per hour y-intercept: If a passenger jet had 0 seats, it would cost $367 per hour to operate.
Correlation What do you recall about correlation?
Correlation Measures strength and direction of a linear relationship between two variables Numerical value between -1 and 1, inclusive How tightly packed points of scatterplot are about the LSRL Correlation and slope always have the same sign
Sketch ellipse around points in scatterplot. If ellipse has points scattered throughout and points appear to follow a linear trend, then correlation is a reasonable measure of strength of the relationship.
No shape
Does a higher correlation mean the relationship is more like a line, less like a line, or neither?
Does a higher correlation mean the relationship is more like a line, less like a line, or neither? Neither if misused
r = 0.91 for this data but a linear model is not appropriate as growth is exponential.
Here r = 0.48. In spite of the scatter, a linear model is appropriate because there is no curvature in the pattern of data points.
Moral of this story: Always plot your data before deciding a linear model is appropriate for your data.
Moral of this story: Always plot your data before deciding a linear model is appropriate for your data. Correlation is only meaningful if a linear model is appropriate for your data.
When the correlation is small in absolute value, what does it mean for the prediction error?
When the correlation is small in absolute value, the error in prediction will be larger than if the correlation were larger.
When the correlation is small in absolute value, the error in prediction will be larger than if the correlation were larger. A larger correlation (near 1 or -1) means the points are generally closer to the LSRL, and predictions using the line will be relatively close to the observed values.
True or false: A high correlation means that a change in the explanatory variable causes a change in the response variable.
True or false: A high correlation means that a change in the explanatory variable causes a change in the response variable. False. Correlation does not imply causation as there may be a lurking variable involved.
r2 is the coefficient of determination. This tells us the proportion of total variation in the y-variable that is “explained” by the variation in the x-variable.
Enter the information about Fat and Calories for 7 kinds of pizza in calculator. Find LSRL equation, r, and r2.
Calories = 112 + 14.9(fat) r = 0.908 r2 = 0.824 Interpret slope, intercept, and r2.
Calories = 112 + 14.9(fat) Slope: For each 1 gram increase in fat, the calories increase by about 14.9
Calories = 112 + 14.9(fat) Slope: For each 1 gram increase in fat, the calories increase by about 14.9 Intercept: If there were 0 grams of fat in a pizza there would be 112 calories.
Calories = 112 + 14.9(fat) Slope: For each 1 gram increase in fat, the calories increase by about 14.9 Intercept: If there were 0 grams of fat in a pizza there would be 112 calories. r2 = 0.824: About 82% of the variation in calories among these brands of pizza can be attributed to fat content.
Both plots have a correlation of 0. 26 Both plots have a correlation of 0.26. For each plot is fitting a regression line appropriate, why or why not?
Left plot has strong curvature so LSRL not appropriate Left plot has strong curvature so LSRL not appropriate. Right plot is linear as cloud of points is roughly elliptical.
Residual plots may help you uncover more detailed patterns. A residual plot that shows nearly random scatter, with no obvious trends is the ideal shape for a residual plot. This indicates that a line is a reasonable model for the trend in the original data.
This model looks nearly linear, but is a line a suitable model?
Residual plot dramatically reveals the trend is not as linear as first thought.
Curvature in residual plot mimics curvature in original scatterplot, which is harder to see. So line is not a good model for these data.
Create residual plot for this data.
Compute LSRL
To get RESID, select 2nd, LIST, 7: RESID
No obvious trends, so line is reasonable model for this data.
Questions?