Topic 13: Quantitative-Quantitative Association Part 1: Introduction to linear regression Finding the best fit line by least squares regression Linear Regression and Outliers
Introduction to linear regression
Poverty vs. high school graduation rate The scatterplot below shows the relationship between high school graduate rate in all 50 US states and DC and the percentage of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012). Explanatory variable? % HS grad Response variable? % in poverty Relationship? linear, negative, moderately strong
Quantifying the relationship The correlation coefficient (r) describes the strength of the linear association between two quantitative variables.
Strong, Moderate, or Weak? Interpreting r requires knowledge in one’s field. A value for r that implies a strong relationship in one field may not in another. Below is a table for serves as a starting point until you learn more about your field in particular. Positive Association 0.8 to 1.0 (very strong) 0.6 to 0.8 (strong) 0.4 to 0.6 (moderate) 0.2 to 0.4 (weak) 0.0 to 0.2 (very weak) Negative Association -0.8 to -1.0 (very strong) -0.6 to -0.8 (strong) -0.4 to -0.6 (moderate) -0.2 to -0.4 (weak) 0.0 to -0.2 (very weak)
Guessing the correlation Which of the following is the best guess for the correlation coefficient between % in poverty and % HS grad? 0.6 -0.75 -0.1 0.02 -1.5
Guessing the correlation Which of the following is the best guess for the correlation between % in poverty and % female householder with no husband present? 0.1 -0.6 -0.4 0.9 0.5
Assessing the correlations Which of the following has the strongest correlation, that is, the correlation coefficient is closest to 1 or -1
Finding the best fit line by least squares regression
Residuals Residuals are the distances of the observations to the line.
Method of least squares We find the line that minimizes the sum of the squares of the residuals. Consider the Geogebra applet. Conditions for the least squares line Linearity Nearly normal residuals Constant variability
Conditions: (1) Linearity The relationship between the explanatory and the response variable should be approximately linear.
Conditions: (2) Nearly normal residuals The residuals should be nearly normal.
Conditions: (3) Constant variability The variability of the points around the least squares line should be roughly constant.
Checking conditions What condition is this model obviously violating? Constant variability Linear Relationship Normal residuals
Checking conditions What condition is this model obviously violating? Constant variability Linear Relationship Normal residuals
r2 The strength of the fit of a linear model is most commonly evaluated using r2, that is, the square of the correlation coefficient. It tells us what percent of variability in the response variable is explained by the model. The remainder of the variability is due to other variables not included in the model or by inherent randomness in the data.
Interpretation of r2 r = -0.62 r2 = 0.38 38% of the variability in the % of residents living in poverty among the 51 states is explained by the model.
Linear Regression and Outliers
Outliers and direction of the association Data are available on the surface temperature and light intensity of 47 stars in the star cluster CYG oB1
Outliers and association strength r = 0.08, r2 = 0.0064 r = 0.79, r2 = 0.6241