Scatter Diagrams and Linear Correlation Chapter 1-3 single variable data Examples or two variables: age of person vs. time to master cell phone task , grade point average vs. time studying, grade point average vs. time playing video games, amount of smoking vs. rate of lung cancer Scatter diagram: (x,y) data plotted as individual points x – explanatory variable (independent) y – response variable (dependent) Evaluate scatterplot data y vs x values – shows relationship between 2 quantitative variables measured on the same individual
Scatter Diagrams and Linear Correlation Look at overall pattern Any striking deviation (outliers)? Describe by a) form (linear or curved) b) direction - positively associated +slope negatively associated – slope c) strength - how closely do points follow form Examples: age of person vs. time to master cell phone task , grade point average vs. time studying, grade point average vs. time playing video games, amount of smoking vs. rate of lung cancer
Degrees of correlation
Scatter Diagrams and Linear Correlation Tips for drawing scatterplot Scale axis: intervals for each axis must be the same; scale can be different for each axis Label both axis Adopt a scale that uses entire grid (do not compress plot into 1 corner of grid
Scatter Diagrams and Linear Correlation Correlation coefficient (r) Assesses strength and direction of linear relationship between x and y. Unit less -1≤ r ≤ 1 r = -1 or 1 perfect correlation (all points exactly on the line) Closer to 1or -1; better line describes relationship; better fit of data r > 0 positive association at x, y r < 0 negative association a x , y x and y are interchangeable in calculating r r does not change if either (or both) variables have unit changes (inches to cm, or F to C)
Linear and non-linear correlations
Scatter Diagrams and Linear Correlation r = 1 Σ( x-x . y-y_) n-1 sx sy Using TI-83 ex p.129 (number of police vs. muggings) Cautions : Association does not imply causation Lurking variables may play rate r only good for linear models Correlation between averages higher than between individual point.
Scatter Diagrams and Linear Correlation Facts No distinction between x and y variable. The value of r is unaffected by switching x and y Both x and y must be quantitative Only good for linear relationships Not resistant to outliers Correlation or r is not a complete description of 2-variable data, the x and y standard deviations and means should be included HW: p131 2,4,6,8 a,b,c, 10 a,b,c, 12 a,b,c For “c” use calculator to compute r
4.2 Least Squares Regression Method for finding a line (best fit) that summarizes the relationship between 2 variables a x (explanatory) and y (response) Use the line to predict value of y for a given x Must have specific response variable y and explanatory variable x (cannot switch like r)
4.2 Least Squares Regression Least Squares Regression Line (LSRL) Minimizes square of error (y-values) Error = observed –predicted value Σ(y-ŷ)2 (y actual value, ŷ is predicted value) (ŷ is called y hat) Line of y on x that makes the sum of the squares of data points to fitted line as small as possible
4.2 Least Squares Regression LSRL Equation ŷ = a + bx ŷ predicted value of y Slope b = r(sy/sx) y – intercept a = y – bx x and y are means for all x and y data, respectively and are on the LSLR (x, y) sy sx are std. deviations of x,y data r correlation
4.2 Least Squares Regression TI-83 – enter data into L1, L2 (x,y) Use STAT CALC , select #8:LinReg(a+bx) to get the best fit required Slope: important for interpretation of data Rate of change of y for each increase of x Intercept – may not be practically important for problems.
4.2 Least Squares Regression Plot LSLR: using formula ŷ = a + bx find 2 values on the line. (x1, ŷ1) and (x2, ŷ2) make sure x1 and x2 are near opposite ends of the data Influential observations and outliers Influential – extreme in the x-direction if we remove an influential point it will affect the LSLR significantly Outliers – extreme in the y-direction does not significantly change the LSLR
Coefficient of Determination r2 – coefficient of determination r – describes the strength and direction of a straight line relationship r2 - fraction of variation in values of y that is explained by LSRL of y on x r = 1, r2 = 1 perfect correlation 100% of the variation explained by LSRL r = 0.7, r2 = 0.49 about 49% of y is explained by LSLR
Residuals Residuals – difference between observed value and predicted value Residual = y –ŷ Mean of least square residuals = 0 Residual plots – scatterplot of regression residuals against explanatory variable (x) Useful in accessing fit of regression line i.e. do we have a straight line? Linear –uniform scatter Curved indicates relationship not linear Increasing/ decreasing indicates predicting of y will be less accurate for larger x