Regression in 1D: Fit a line to data by minimizing squared residuals.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Objectives (BPS chapter 24)
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Introduction to Probability and Statistics Linear Regression and Correlation.
SIMPLE LINEAR REGRESSION
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Linear Regression
SIMPLE LINEAR REGRESSION
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Simple Linear Regression
23. Inference for regression
Chapter 15 Multiple Regression Model Building
Chapter 20 Linear and Multiple Regression
Regression and Correlation
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Inference for Least Squares Lines
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Regression . . Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.
Statistics for Managers using Microsoft Excel 3rd Edition
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Correlation and Simple Linear Regression
Inference for Regression
Chapter 11: Simple Linear Regression
Comparing Three or More Means
Simple Linear Regression
Regression in 1D: Fit a line to data by minimizing squared residuals.
Chapter 11 Simple Regression
Chapter 13 Simple Linear Regression
Simple Linear Regression
The Practice of Statistics in the Life Sciences Fourth Edition
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Prepared by Lee Revere and John Large
Chapter 14 – Correlation and Simple Regression
Simple Linear Regression
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION
Multiple Linear Regression
Chapter Thirteen McGraw-Hill/Irwin
Chapter 12: Data Analysis by linear least squares
MGS 3100 Business Analysis Regression Feb 18, 2016
St. Edward’s University
STATISTICS INFORMED DECISIONS USING DATA
Chapter 13 Simple Linear Regression
Presentation transcript:

Regression in 1D: Fit a line to data by minimizing squared residuals

Overview: Calculating estimates of slope and intercept Outliers, high leverage, and influential data points Inference in regression coefficient of determination standard error of estimation t-test for relation between x and y confidence interval on slope confidence interval on y prediction given x Verifying assumptions required for inference Transformation to linearize data Examples

Regression: estimating functional relationships between predictors and response assume a particular relationship exist (line, for example) find parameters of assumed functional form by minimizing in-sample error (sum of squared residuals, for example) separates predictor-response relationship from noise in data How much confidence can have in my results? (inference) requires additional assumptions about the noise in the data

Example: fit y = ax + b to m data points Find unknowns a and b that minimize sum of squared residuals What is the objective function to be minimized? What are the equations that determine a and b?

Example: fit ax + b to m data points Find the values of a and b that minimize sum of squared residuals

Use the method on the previous slide to derive an expression for the best constant that fits a set of m data points.

b = <yk> expresses how much we know in the absence of “attributes” Example: {yk} = height of students at WSU Other attributes of students (Weight, Age, Country of origin, etc.) might be “predictors” of height. For any attribute xk, the absolute value of the slope, |a|, of linear regressions, tell us how important xk is as a predictor of yk The sign of the slope tells us the direction of the correlation between yk and xk

Matrix formulation of linear least squares: Given dataset {(tk,yk), k=1,...,n} and set of functions {fj(t), j=1,...,m}, find the linear combination of functions that best fits the data. Define matrix A where akj = fj(tk) (jth function evaluated at kth data point) Define column vector b = [y1, y2,...,yn]T of the response values Define column vector w = [w1, w2,...,wm]T of weights in linear combination of functions fit = Aw is the value of the fit at each data point r = fit-b is the deviation between fit and data at each data point Find best choice of w by minimizing the sum of squared residuals between fit and data

Let r = b – Aw and define f(w) = (||r||2)2 = rTr Normal Equations: Let r = b – Aw and define f(w) = (||r||2)2 = rTr f(w) = (b – Aw)T(b – Aw) = bTb –2wTATb + wTATAw A necessary condition for w0 to be minimum of f(w) is f(w0) = o, where f is an m-vector with components equal to the partial derivatives of f(w) with respect to the weights w f(w) = 2ATAw – 2ATb = o  the optimal set of weights is a solution of the mxm symmetric system ATAw = ATb called the “normal” equations of the linear least squares problem.

Example: Use the normal equations to fit a line y = w1x +w0 to n data points What are the functions {fj(t), j=1,2}, that determine the matrix elements in this case? Build the A matrix (akj = fj(tk) jth function evaluated at kth data point) Build the b vector (column vector of measure y values Construct and solve the normal equations.

Same set of equations as obtained by objective function method C1 is y intercept. C2 is slope.

Polynomial Regression: degree 1 with N data points 1D linear regression by linear algebra approach Polynomial Regression: degree 1 with N data points Solve VTVw = VTy for w1 and w0 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Simple example: distance hiked vs time (text p39) Index x=time y=distance fit=6+2x error (error)2 Sum of squared error (SSE) = 12

Coefficient of determination: Is Fit a better predictor than average? SST = sum of squares total

Sum of Squares Regression Measures variability from mean response explained by regression Sum of Squares Error Measures variability in y from all other sources (usually noise) after linear relationship between x and y has been accounted for SST = SSR + SSE follows from the identity

Coefficient of determination: r2= SSR/SST As SSE -> 0, SSR -> SST and r2 -> 1 means a perfect fit As SSR -> 0, r2 -> 0 fit is the same as the average r2 interpreted as fraction of response variation explained by predictor Correlation coefficient +sqrt(r2). Sign same as sign of b1 SST=228 SSE=12 SSR=216 R2 = 0.95 Add coefficient of determination to your linear fit code

The Standard Error of the Estimate Mean Square Error (MSE) m = number predictors, n = number observations Standard Error of the Estimate s is a“typical” residual or error in estimation From hiker vs time dataset m =1, n=10, SSE= 12 -> s = 1.225 km Linear regression estimate of hiking distance typically differs from actual distance by about 1.2 km Add Standard Error of the Estimate to your linear fit subroutine Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

ANOVA Table for Simple Linear Regression Regression statistics summarized in Analysis of Variability (ANOVA) table m = total predictors, n = total observations F is a test statistic used for “inference” (discussed later) Source of Variation Sum of Squares Degrees of Freedom Mean Square F Regression SSR m Error (or Residual) SSE Total Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

High Leverage Points Leverage hi for ith observation: As distance of an x-value from the mean of x-values increases, leverage increases 1/n <= Leverage <= 1.0 Observation with leverage > 2(m + 1)/n or 3(m + 1)/n are considered to have high leverage Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Standardized Residuals and Outliers Standard error of ith residual, with leverage hi Standardized Residual: Generally, observations with standardized residual > 2 are flagged as outliers Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example of outlier calculation Example: suppose 11th hiker traveled 20 km in 5 hours Including 11th observation changes regression results slightly with: b0 = 6.36, b1 = 2.00, s = 1.72 Leverage Standard error Standardized residue Conclusion? Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example from the Cereals dataset Scatter plot of nutritional rating against sugars from Cereals dataset (text p34) Minitab http://www.minitab.com/en-us/products/minitab/ reports standardized residual of All Bran Extra Fiber = 3.38 Outlier? Use your code to verify this result Outlier: All Bran Extra Fiber Sugars: 0 Rating: 93.7 Predicted Rating: 59.44 Residual: 34.26 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Influential Observations Influential observation significantly alters regression parameters based on absence/presence in data set Outlier may or may not be influential High leverage point may or may not be influential Example: 11th hiker walks 39 km in 16 hours Identified as high leverage point Likely has strong influence on slope Cook’s distance: test for influential observations Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Cook’s distance (yi - ŷi) ith residual s standard error of the estimate hi leverage of ith observation m number of predictors Combines elements representing outlier and leverage Example: Cook’s Distance for 11th (5, 20) hiker: Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

More on Cook’s distance In general, influential observations have Cook’s Distance > 1.0 Cook’s Distance also compared against Fm,n-m distribution Points with CD greater than 50th percentile of Fm,n-m are considered influential Example: 11th hiker (5, 20) with Cook’s Distance = 0.2465 not influential, lies within 37th percentile of F1,10 Example: hard-core hiker (16, 39) has high leverage = hi = 0.7007, and standardized residual = 0.46801 CD = 0.2564 (similar to previous example) shows it’s not influential Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example of influential observation Data on 11th hiker “pulls down” regression line: slope b1 decreases from 2.00 to 1.82 11th hiker (10, 23) has leverage hi = 0.36019, and standardized residual = -1.70831 Cook’s Distance = 0.821457, Not influential by CD >1 but it lies in the 62nd percentile of F1,10, which makes it influential F1,10 A Fstat 0.62 .50 0.490 0.82 .90 3.29 .95 4.96 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Assignment 3, part 1 Write a code for fitting a line to data using the method in slide 6. Include calculations of coefficient of determination using notes on slides 13-15. Include calculation of standard error of estimation using notes on slide 16. Include calculation of high-leverage data points using the notes on slide 18. Include calculation of outliers using notes on slide 19. Include calculation of influential points using notes on slide 23 and a lower bound of 1 on Cook’s distance.

Assignment 3, part 1 continued Fit a line to distance hiked vs time using data on slide 12. Report coefficient of determination and standard error of estimation. Flag of high-leverage points, outliers, and influential points. Check results against Table 2.6 text p47 Repeat with 11th data point (5,20). Report difference from results with 10 points. Repeat with 11th data point (16,39). Repeat with 11th data point (10,23). Fit a line to Rating as a function of Sugar from the Cereals dataset. Check results against Table 2.7 text p47.

Inference in Regression: Going beyond r2 and s Assume that a linear relationship between predictor and response exist but is obscured by normally-distributed noise with zero mean and variance that is independent of predictor and response values. To trust the results of “inference” these assumptions of about error must be verified Each example in the population is a random variable b0 and b1 are estimates of b0 and b1 obtained by minimizing in-sample error expressed as the sum of squared residuals Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

More details about model assumptions Assumptions about error in the data (1) Zero Mean Assumption Error term ε is random variable with mean, E(ε) = 0 (2) Constant Variance Assumption Variance of ε constant, regardless of x-value (3) Independence Assumption Values of ε independent of each other (4) Normality Assumption Error term, ε, is a normally distributed random variable Summary: εi are independent normal random variables, with mean = 0 and constant variance Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Implications for y of assumptions about e (1) Based on: Zero Mean Assumption For each x, mean of y’s (different amounts of error) lie on regression line (2) Based on: Constant Variance Assumption Regardless of x-value, variance of y’s is constant (3) Based on: Independence Assumption For any x, values of y independent (4) Based on: Normality Assumption y normally distributed random variable Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Distributions of y at different values of x Observed y-values corresponding to predictor values x = 5, 10, and 15 are shown as samples from normal distributions with means β0 + β1x Normal curves have exactly the same shape Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression models without inference Regression analysis can be applied in a descriptive manner Include r2, s, high leverage, outliers, influential data These outputs are not based on assumptions about error terms Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

When do we need Inference in Regression? Suppose minimizing squared residuals leads to r2 = 0.3 r2 value this small indicates a linear relationship between predictor and response is not useful Are we sure? Can a valid relationships between x and y exist when r2 is small? Inference offers systematic framework to assess significance of linear association between x and y Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Inference in Regression: methods Four inferential methods: (1) t-test for H0 that b1 = 0 (attribute is not a predictor of response) (2) Confidence interval for slope, β1 (3) Confidence interval for mean of response, given an x-value (4) Prediction interval for random response value, given an x-value Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

T-test for Relationship Between x and y Least squares estimate of slope b1, is a statistic Sampling distribution of b1 has mean = β1 , and standard error σb1 sb1 is an estimate of σb1, where s = standard error of the estimate: Add calculations of sb1 to your “ fit a line to data” code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

T-test for Relationship Between x and y Sb1 measures the variability of estimates of the slope Small values Sb1 indicate estimate of slope b1 is precise Large values Sb1 indicate estimate of slope b1 is unstable and true value of the slope, b1, could be zero. T-test is based on the statistic When null hypothesis true (b1 = 0), t = b1 / Sb1 follows t-distribution with n-2 degrees of freedom Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example from the Cereals dataset: Your code should yield the values b1 = -2.4193 Sb1 = 0.2376 T-statistic, t = b1 / Sb1 = -2.4193/0.2376 = -10.18 The probability of such an extreme t-value by chance alone (p-value of the t-statistic) is very small (<0.000 according to Minitab results on next slide) Search the web for a “p-value of the t-statistic” calculator and verify this result Reject the null hypothesis that b1 = 0. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

T-test for Relationship Between x and y Example: apply t-test to regression results of nutritional rating on sugar content Minitab results shown: The regression equation is Rating = 59.4 - 2.42 Sugars Predictor Coef SE Coef T P Constant 59.444 1.951 30.47 0.000 Sugars -2.4193 0.2376 -10.18 0.000 S = 9.16160 R-Sq = 58.0% R-Sq(adj) = 57.5% Analysis of Variance Source DF SS MS F P Regression 1 8701.7 8701.7 103.67 0.000 Residual Error 75 6295.1 83.9 Total 76 14996.8 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Slope of Regression Line From the p-value of the t-statistic, we have at least 95% confidence that sugar content a predictor of nutritional rating of cereals (rejected null hypothesis that slope=0). Find a confidence interval on our estimate of the slope. t-interval is based on the sampling distribution for b1 We are 100(1-alpha)% confident true slope β1 lies within where tn-2,1-a is a percentile point of the t-distribution with n-2 degrees of freedom. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Add a confidence interval for b1 to your code. Use interpolation on degrees of freedom for 95% confidence to get tdf,95 if df>30. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Slope of Regression Line Example from the Cereals dataset: b1 = –2.4193, Sb1 = 0.2376 T-critical value for 95% confidence and n – 2 = 75 degrees of freedom = t75,95% = 2.0 Slope estimate with confidence interval = –2.4193 ± (2.0) (0.2376) We have 95% confidence that true slope is between –2.8945 and –1.9441 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Mean Value of y Given x Regression equation estimates the value of response variable for a given predictor value does not provide probability statement regarding the accuracy of estimate Probability statements about accuracy can be obtained for (1) Confidence interval for mean value of y, given x (2) Prediction interval for response of a randomly chosen example Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Mean Value of y Given x Xp = value of x, for which prediction being made Yp = regression result for x = xp S = standard error of estimate tn-2,95 = percentile point of the t distribution h(xp) = leverage of xp Add this calculation to your code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Calculate 95% confidence interval on the average distance traveled by hikers that walk 5 hours xp = 5, yp = 16, s = 1.22474, n = 10, t8,95% = 2.306 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Prediction Interval for Response of a Randomly Chosen Example Distances walked by individual hikers are more variable than the mean of distances walked by a group of hikers. Estimates of a group’s average hiking distance are more precise than estimates of an individual’s hiking distance. In general, easier to predict mean value of a variable than to predict its value for a randomly chosen example. In general, prediction intervals for a randomly chosen example are more useful to data miners than confidence intervals on mean values Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Prediction Interval for Response of a Randomly Chosen Example Note: similar to confidence interval for mean value of y, given x At the same confidence level, prediction interval are always wider than confidence interval on means. Add this calculation to your code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Estimated distance traveled by randomly selected hiker who walks for 5 hours (95% confidence) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Summary: CI and PI for hikers that walk 5 hours 95% confident that the average distance traveled by hikers that walk 5 hours is between 15.11 and 16.89 km 95% confident that the distance traveled by a randomly chosen hiker that walks 5 hours is between 13.04 and 18.96 km Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Verifying Regression Assumptions Review: in-silico populations Choose values of b0 and b1 and a range of predictor values xL < x < xU Choose 1,000,000 values of x uniformly distributed between xL and xU Choose 1,000,000 values of e from a normal distribution with zero mean and given variance Generate 1,000,000 response values Randomly choose 100 records from the in-silico population as a sample dataset. Generate estimates b0 and b1 of b0 and b1 by minimizing in-sample error Repeat with 100 randomly chosen datasets We have 100 samples of b0 and b1 from which we can make statistical inference about uncertainty in parameter estimates and prediction of response From each dataset we have 100 residuals at optimal values of b0 and b1 that mainly reflect the value of e in each record. Linear regression is a model of data based on the assumption that the dataset has statistics like the 100 records drawn from our in-silico population. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Verifying Regression Assumptions Linear regression is a model of data based on the assumption that the dataset has statistics like the 100 records drawn from our in-silico population. The only handle that we have to test this assumption are residuals at optimum choice of parameters. Normally distributed residuals are evidence for the validity of model assumptions. In most cases, residuals (errors) in a 1D linear regression model are due to effects on response of attributes other than the predictor of the model. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Verifying Regression Assumptions Results from a linear-regression model cannot be trusted unless assumptions of the model are verified by the distribution of minimized residuals. Two graphical methods of verification are discussed in text: (1) Normal probability plot of residuals (2) Plot of standardized residuals against predicted values You are only responsible for second method Add a scatter plot of standardized residuals vs predicted response to your code. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Review: Standardized Residuals Leverage of ith observation: Standard error of ith residual Standardized Residual: Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Plot Standardized Residuals Against Predicted Values Example: regression of distance vs. time for hiker data set Discernable pattern would indicate that assumptions about error are not true Too few data points to make a determination in this case Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Under the assumptions of linear regression models, residuals are an indication of error in data. 4 types of standardized-residue vs fit plots (A) no pattern suggests assumptions about error are valid (B) suggests a functional relationship between error at different levels of response (i.e. not independent). (C) suggests that variance of error increases as response increases. (D) suggests that mean of error increases as response increases. A B C D Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

FYI: Diagnostic tests to verify regression assumptions Anderson-Darling test: Are residuals are normally distributed? Bartlett’s or Levene’s test: Do residuals have constant variance? Durban-Watson or Runs test: Are residuals independence of response? You are not responsible for these tests Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Example: Baseball Data Set Regression of number of home runs against batting average Players with less than 100 at bats are excluded Shorter data set has 209 records Apply your linear fit code to this data set Test your result by comparison with Table 2.14 text p 71 Make a scatter plot of standardized residual vs fit Compare your plot to Figure 2.16 text p70 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on home runs vs batting average t-stat=7.9 p-value<0.000 Even though r2 is small, p-value of t-statistic indicates high confidence that true slope is not zero. Can we believe these results? Apply graphical methods explored validity of regression-model assumptions Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Graphical tests of regression-model assumptions Probability plot indicates distribution of residuals is right-skewed Normality assumption violated Plot of standardized residuals vs fit shows “funnel” pattern Constant variance assumption is violated Confidence limits on true slope cannot be trusted Try to improve confidence in regression by transformation to ln(home runs) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on ln(home runs) vs batting average My results slightly different from Table 2.15 p73 where b0=-0.661, b1=11.6. 201 records after eliminating cases hr=0 Text 208 records Can we have more confidence in these results obtained after a natural log transformation? Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Graphical tests of regression-model assumptions home runs Graphical tests of regression-model assumptions home runs ln(home runs) Not perfect but significantly improved Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Accepting that model assumptions are valid With my standard error of the estimate es = 2.31 is a typical error in home runs predicted by regression on batting average (text 1.96). My coefficient of determination r2 = 21.9% (text 23.8%) Indicates batting average accounts for about 20% of variability in ln(home_runs) of players with more than 100 at bats. Other attributes like size, strength, number of at bats, affect player’s ability to hit home runs Correlation coefficient Home runs have a weak positive correlation with batting average Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Accepting that model assumptions are valid With my b1 = 13.64 and sb1 = 1.826, t-statistic = 7.47 (text 8.04) P-value, p (|t| > 7.47) ~ 0.000 Better than 95% confidence that batting average is a predictor of home runs (i.e. confident that the slope is not zero) My 95% confidence interval on slope is (10.0, 17.2) (text (8.73 ,14.4)) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Accepting that model assumptions are valid With my regression line, a player with .3 batting average is expected to hit e2.64 = 14.0 home runs My 95% confidence interval on mean number of home runs hit by players with batting average of 0.3 is (e2.4667, e2.8387) = (11.8, 17.1) Text (e2.6567, e2.9545) = (14.25, 19.19) My 95% prediction interval on number of home runs hit by a random player with batting average of 0.3 is (e0.9998, e4.1411) = (2.72, 74.1) is too wide to be useful Text (e1.4701, e4.1411) = (4.35, 62.87) also too wide to be useful Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Accepting that model assumptions are valid I find 9 outliers in dataset (text 7) all with low number of home runs Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Accepting that model assumptions are valid I find 7 high leverage points in dataset (text 7) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Last part of Assignment 3: Do regression on ln(home runs) vs batting average Tell me if I screwed up Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

California Data Set: example of skewed data California data set includes census information for 858 towns/cities Do towns with high fraction of senior citizens tend to be small/large towns? Scatter plot of percentage over 64 against population Values for large cities push data against y axis Spread data out by a ln(population) transformation Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on % seniors vs ln(population) Probability plot of standardized residuals slows deviation from normally distributed. Supported by Anderson-Darling test. Small p-value reject hypothesis that residuals are normally distributed. Plot of standardized residues vs fit shows “funnel” effect. Variance of residues not constant. Try regression on ln(% seniors) vs ln(population) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Regression on ln(% seniors) vs ln(population) Plot of standardized residuals vs. fits shows less “funnel” effect 8 of 9 outliers with low % seniors are town with military installations Exclude outliers and continue analysis Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Transformations to Achieve Linearity A, I N, R, T L, S, U G B, C, M, P F, H, V, W, Y J, X Q, Z K D Points vs frequency in Scrabble® is non-linear Apply Frederick, Mosteller, and Tukey “bulging rules” to find a transformation that achieves linearity Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Transformations to Achieve Linearity A, I N, R, T L, S, U G B, C, M, P F, H, V, W, Y J, X Q, Z K D Transformations to Achieve Linearity Shape of points vs frequency is like the curve in the lower-left quadrant “bulging rule” suggests transformations on both x and y that are to the left of t 1 Square root is not enough Try ln(t) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Transformations to Achieve Linearity Regression of ln(points) vs ln(freq) has r2 = 87.6% Standard error of the estimate s = e0.293745 = 1.34 points Estimates frequency = 4 to be 1.72 points. Actual either 1 or 2 points “E” is flagged as outlier E Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Box-Cox Transformations to Achieve Linearity Choose a mesh of l values. Try regression of x on f(y)=(yl -1)/l if l not zero; x on f(y)=ln(y) if l=0. Plot sum of squared residuals vs l. Use the value of l that gives the smallest sum of squared residuals. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Data with error bars More common in physical than social sciences Data with small error should have more influence on fit Pure calculus approach: change the objective function

m+1 data points indexed from zero a<x2> + b<x> = <xy> a<x> + b = <y>

Matrix formulation of weighted linear least squares Let Dy be a column vector of uncertainties in measured values w = 1./ Dy is a column vector of the square root of the weights W = diag(w) is a diagonal matrix V = coefficient matrix of the un-weighted least-squares problem (Vandermonde matrix in the case of polynomials fitting) y = column vector of observations A = W V = weighted coefficient matrix b = W y = weighted column vector of measured values Ax = b is over-determined linear system for weighted linear least squares normal equations ATAx = ATb become (WV)TWVx = (WV)T W y (note that w = 1./ Dy gets squared at this point)

Fit weighted parabola to data on surface tension vs temperature T = [0, 10, 20, 30, 40, 80, 90, 95]’; (note column vector) S = [68.0, 67.1, 66.4, 65.6, 64.6, 61.8, 61.0, 60.0]’; dS=[6, 2, 5, 3, 7, 8, 4,1]’; (note small uncertainty in last point) weights=1./dS (actually square root of weights) W=diag(weights); weightedS=W*S; Construct V matrix weightedV=W*V; Setup and solve normal equations (WV)TWVx = (WV)T WS (weights are squared) Evaluate the fit Sfit = Vx Calculate the sum of squared residuals R = Sfit – S Plot the data and fit on the same set of axes

Plot using MatLab’s “errorbar” plotting function Fit bends slightly to go through last data point

Tuning polynomial regression in 1D Trend in the data is seems to be parabolic but could I get a better fit with a polynomial of higher degree? How do I know where to stop?

Polynomial Regression: degree k with N data points Solve VTVw = VTb for k+1 coefficients w yfit = Vw are the values of the fit at the locations data points R = yfit – b are the residuals at the data points Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Find the best polynomial for fit sin(x) + normally distributed noise Divide data into training and validation sets training data fit polynomials degree 1-8 “elbow” in validation error indicates best degree of polynomial Use all the data to refine coefficients of cubic 84

Quiz #3 10-26-17 Review for quiz 10-24-17 Look over questions in text at the end of chapter 2 Ask about anything you have doubts about Be prepared to answer questions

A B C D Review questions 1) 7 items that can be reported from “descriptive” regression in 1D: b0, b1, r2, s, outliers, high leverage points, and influential points 2) 4 items that can be reported from inference on regression in 1D test H0 that slope=0, confidence interval on slope, confidence interval on average response at a given predictor value, prediction interval on response at randomly chosen predictor value. 3) Why do we investigate the distribution of standardized residues before considering inference on regression? 4) What do we conclude about the distribution of standard residues from plots with shapes A, B, C, and D?

Example of influential observation 11th hiker (10, 23) has Cook’s Distance = 0.82 that is in the 62nd percentile of F1,10. Explain based on table. F1,10 A Fstat 0.62 .50 0.490 0.82 .90 3.29 .95 4.96 If the criterion for influential observation is Cook’s Distance in, at least, the 50th percentile of the F distribution, what is the threshold with 1 and 10 degrees of freedom?

Regression Term Definition a. Influential observation E >> Measures the typical difference between the predicted response value and the actual response value. b. SSE H >> Represents the total variability in the values of the response variable alone, without reference to the predictor. c. r2 I >> An observation which has a very large standardized residual in absolute value. d. Residual G >> Measures the strength of the linear relationship between two quantitative variables, with values ranging from – 1 to 1. e. s F >> An observation which significantly alters the regression parameters based on its presence or absence in the data set. f. High leverage point K >> Measures the level of influence of an observation, by taking into account both the size of the residual and the amount of leverage for that observation. g. r B >> Represents an overall measure of the error in prediction resulting from the use of the estimated regression equation. h. SST F >> An observation which is extreme in the predictor space, without reference to the response variable. i. Outlier J >> Measures the overall improvement in prediction accuracy when using the regression as opposed to ignoring the predictor information. j. SSR D >> The vertical distance between the predicted response and the actual response. k. Cook’s Distance C >> The proportion of the variability in the response that is explained by the linear relationship between the predictor and response variables.

2.9 p88 Regression with very low r2. What statistic in Table 2:11 suggests we might get useful results from regression? Table 2.11 p57 x When null hypothesis true (b1 = 0), t = b1 / Sb1 follows t-distribution with n-2 degrees of freedom. Test does not tell us a value of n.

2.12 p88 If H0 is true a statistic that follows the F1,5 distribution has a value of 5. Can I reject H0 with 95% confidence?

2.19 p90 Based on minitab output below, what is b0, b1, r2, r, s, sb1, 95% confidence interval on slope, etc.

2.20 p90 Based on scatter plot of data, what bulge rule should be applied to attempt a transformation to linearity.

Question from the text book. “A colleague would like to use linear regression to predict whether or not customers will make a purchase, based on some predictor variable. What would you explain to your colleague?” Answer: “Because the decision to purchase is categorical (yes or no) rather than having a continuous response value, applying a linear regression model would not be appropriate in this situation.” Actually this in not the case. Regression is frequently used for classification.

How does minimizing in-sample error expresses as the sum of squared residuals lead to the normal equations?