Independent Dependent Scatterplot Least Squares Variable Variable Regression Line (Trend Line) Positive Direction Negative Direction Causation Correlation No Correlation Weak Correlation Moderate Correlation Strong Correlation (Zero Correlation) Monotonic Monotonic Form: Linear Form: Exponential Increasing Decreasing
2 Variable Data Part 1
2 Variable Data: Independent Variables and Dependent Variables Two Variables Most statistical studies look at multiple variables. Often the studies try to show a relationship between one variable and another When one variable effects another, one variable will be referred to the explanatory variable and the other as the response variable Explanatory Variable “An explanatory variable attempts to explain the observed outcomes.” Response Variable A variable that measures an outcome of a study.
2 Variable Data: Independent Variables and Dependent Variables Example 1 A study looks at smoking and lung cancer. Which (if any) is the explanatory variable? Which (if any) is the response variable? Is smoking a quantitative or categorical variable? Is lung cancer a quantitative or categorical variable?
2 Variable Data: Independent Variables and Dependent Variables Example 2 A study looks at cavities and milk drinking. Which (if any) is the explanatory variable? Which (if any) is the response variable? Is cavities a quantitative or categorical variable? Is milk drinking a quantitative or categorical variable?
2 Variable Data: Independent Variables and Dependent Variables Example 3 A study looks at rain fall and SAT scores. Which (if any) is the explanatory variable? Which (if any) is the response variable? Is rainfall a quantitative or categorical variable? Is SAT scores a quantitative or categorical variable?
2 Variable Data: Independent Variables and Dependent Variables Scatterplots A good way to try to see if there is a relationship between two quantitative variables is through the use of scatterplot. On a scatterplot, we usually put what we think might be the explanatory variable on the x-axis and the response variable on the y-axis.
2 Variable Data: Scatterplots and Form Interpreting Scatterplots Form: Is there a pattern? linear, parabola, bell shaped Deviations from pattern: Are there areas where the data conform less to the pattern? Linear Parabolic
2 Variable Data: Scatterplots – Direction and Strength Interpreting Scatterplots Direction: If linear, is the data positively associated or negatively associated? Strength: Does the data tightly conform or loosely conform Moderate Strength High Strength
2 Variable Data: Constructing Scatterplots Scale both axes (make them uniform). Use a symbol to break from zero if needed. Label your axes. Make your plot large enough so that details can be seen. Use the entire grid if you’re using a grid (not a corner of it)
2 Variable Data: Constructing Scatterplots
2 Variable Data: Constructing Scatterplots
2 Variable Data: Constructing and Interpreting Scatterplots Form: Direction: Strength:
2 Variable Data Part 2
2 Variable Data: Constructing Least Squares Regression Lines Linear Regression Relationship between water and Tomatoes It would be great to be able to look at multi-variable data and reduce it to a single equation that might help us make predictions Example: “What would be the predicted number of Tomatoes Produced if we gave a plant 3.5 Gallons of water?” Tomato Production Gallons of Water
2 Variable Data: Constructing Least Squares Regression Lines Linear Regression Statisticians use a slightly different version of “slope-intercept” form. Algebraic form Statistical form
2 Variable Data: Constructing Least Squares Regression Lines Linear Regression Year 1991 1992 1993 1994 1995 1996 Rainfall (inches) 12.3 14 11.6 10.5 9.8 13.2 Crop yield (tons) 43.1 43.0 40.7 38.9 36.2 43.2
2 Variable Data: Correlation Interpreting r No Correlation Strong Negative Moderate Negative Weak Negative Strong Positive Moderate Positive Weak Positive -1 -.75 -.5 -.25 .25 .5 .75 1
2 Variable Data Part 3
2 Variable Data: Correlation Relationship between ERA and Wins Is there a “correlation” between a baseball team’s “earned run average” and the number of wins? Is the association positively associated or negatively associated? Is the association strong or weak? The symbol we use for correlation is r Number of Wins / Season ERA (Earned Run Average)
2 Variable Data: Correlation r – “Correlation” Correlation is the measure of the association between the explanatory and response variables. The calculation of correlation is based on mean and standard deviation. Both mean and standard deviation are “not resistant measures.” That is, they cannot resist the influence of outliers. This means that r is also not resistant.
2 Variable Data: Correlation Facts about correlation: Both variables need to be quantitative. Because the data values are standardized, it does not matter which units the variables are in (must be consistent though). The value of r is unitless (just like z scores are unitless). Correlation is blind to the relationship between explanatory and response variables. Even though you may get a r value close to -1 or 1, you may not say that explanatory variable causes the response variable. We will talk about this in detail in the second semester.
2 Variable Data: Finding the Equation, r, and 𝑟 2 Using the TI-83 to calculate r Procedure: 1. Put study Hours into L1 Put Grade on Test into L2 2. Graph a scatterplot 3. In CATELOG, select DiagnosticOn 4. Under Stat and Calc select LinReg (a+bx) Find r (correlation) for this data set: Sussy Mike Tommy Sarah Study Hours 5 3 7 1 Grade on Test 90% 85% 93% 82% This procedure is super important. It gives you correlation (r) and will also dump the Linear Regression line into Y1 for you so that you can see it on your graph. r = .995 Strong Positive Correlation
2 Variable Data: Finding the Equation, r, and 𝑟 2 Other facts about least-square regression Make sure you know which is the explanatory (x) variable and which is the response (y) variable. -Switching them gets a different regression line. Regression line always goes through the point (x-bar, y-bar) The coefficient of correlation (r) explains the strength of the linear relationship R-squared is known as the “coefficient of determination” The square of the correlation (r2) is the variation in the values of y that is explained by x. ___%(r2) of the variation of ______ (y) is explained by _____ (x). Facts about r2
2 Variable Data: Finding the Equation, r, and 𝑟 2 Checking for Understanding Find the line of best fit and describe the association between the variables: Age Height (inches) 7 51 11 56 15 61 8 50 6 49 12 53 10 63
2 Variable Data Part 4
r2 “coefficient of explanation” In the regression of ERA vs. WINS, we find a r2 value of .4512 We say “45% of the variation in WINS can be explained by ERA”
Outliers and Influential Data An outlier is an observation outside the overall pattern If an observation is influential it has a large effect on the regression line. Removing the observation changes the line.
Residuals It is important to note that the observed value almost never match the predicted values exactly The difference between the observed value and predicted has a special name: residual Predicted Value ( ): 5.3 ERA 67.03 Wins Residuals are negative when the observed value is below the predicted value Residuals are positive when the observed value is higher than the predicted value Residual: 43 - 67.03= -24.03 Observed Value (y): 5.3 ERA 43 Wins Residual = Observed - Predicted
Residual Plots You can plot the residuals to see if the there is any trends with the quality of the predictive model This residual shows no tendencies. It is equally bad throughout. This suggests that the original relationship is linear.
Not Linear because the residual plot is curved: Linear because the residual plot is well distributed:
Bear Age 4 7 12 10 6 9 kP 145 166 221 201 153 182 Use your calculator to make a scatter plot Find r and r2 3. Plot the trend line 4. Find the residual for the bear with age 6. 5. Plot a Residual Plot and verify the residual for age 6 p.bear 6. Is a linear trend the most appropriate? How do you know?
2 Variable Data Chapter Summary
Concepts to Remember: To make a scatterplot, and thus, work with correlation and least squares regression, requires that both variables be quantitative. You assign the explanatory variable to x and the response variable to y. * Keep in mind, however, that in some circumstances you might be asked to assign them in the reverse order simply for the sake of doing a problem. It is also possible that there is no clear explanatory or response variable (ie, Rain and Income) but you can still do least squares regression and try to calculate correlation! Correlation (r) is unitless and measures the strength and direction of a relationship between two variables. Negative or positive indicates direction while the number value indicates the strength to which the points are conforming to a line. r2 is also unitless and will always be positive. r2 means “The percent of variation of y that is due to x.” This applies to residuals! Every least squares regression line will pass through the mean of x and the mean of y. The least squares regression line is a model which we use to predict outcomes.
Interpreting r No Correlation -1 -.75 -.5 -.25 .25 .5 .75 1 Strong Negative Moderate Negative Weak Negative Strong Positive Moderate Positive Weak Positive -1 -.75 -.5 -.25 .25 .5 .75 1
Concepts to Remember: Least Squares Regression is also called Linear Regression and uses the form y = a + bx. A least squares regression line minimizes the area of the squares of the residuals. Thus the name “least squares!” A cloudlike residual plot is ideal because it shows the points are scattered randomly above and below the least squares regression line. A curved residual plot indicates the original data was not linear and the model is invalid. Just as the mean and standard deviation are not resistant, meaning they cannot resist the pull of outliers, correlation (r) and the least squares regression line are not resistant. Outliers, also called influential observations, will cause r to approach zero and will change the slope (b) and intercept (a) of the least squares regression line.
Monotonic Functions move in 1 direction as it’s argument (x) increases Monotonic Increasing Functions preserves the order of data: if a > b, then f(a) > f(b) [ie, trends upward] Monotonic Decreasing Functions reverses the order of the data: if a > b, then f(a) < f(b) [ie, trends downward]