Multiple Linear Regression

Multiple Linear Regression
Simple linear regression examines relationship between single predictor and response Multiple Regression models a set of predictors to single response Multiple Linear Regression uses plane or hyper-plane to approximate relationship between a set of predictor set and a single response Predictors and response are usually continuous Categorical predictors are not excluded and response could be class labels when regression is used for classification. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

All linear models optimize a linear combination of attributes and
a bias node with a value of one. For convenience we include the bias node as the x0 component of the attribute vector, x. x0 always = 1. x1, x2, … xm are attributes (possible predictors of response, y). Most convenient to represent the linear combination attributes and bias as the dot product, wTx, where w0 is the weight of the bias node (y intercept of the regression line in 1D linear regression).

Polynomial Regression: degree 1 with N data points
Review: 1D linear regression (fit a line to data) xk are attribute vectors with bias node=1; hence, renamed V to X Solve XTXw = XTy for w1 and w0 Yfit = Xw Polynomial Regression: degree 1 with N data points Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

2D linear regression: fit a plane to data
Solve XTXw = XTy for w2, w1, and w0 Each row of the matrix X, is a vector that combines x0=1 with the 2 attributes from a record in the dataset. Yfit = Xw 2 attributes determine the value of y In general, {(y1, x1), (y2, x2), …,(yN , xN)} is dataset with N examples of real numbers determined by 2 attributes Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

dD linear regression: fit a hyperplane to data
Solve XTXw = XTy for w0, w1, … wd Not easy to illustrate a hyperplane d attributes determine the value of y In general, {(y1, x1), (y2, x2), …,(yN , xN)} is dataset with N examples of real numbers determined by d attributes Each row of the matrix X, is a vector that combines x0=1 with the d attributes from a record in the dataset. Yfit = Xw Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Regardless of the value of d, optimum weights are determined by minimizing in-sample error
expressed as the sum of squared residuals. Setting the gradient of in-sample error to zero gives the “normal equations”, a linear system of size (d+1). To solve the normal equations, define A= XTX, b=Xty and solve Aw=b by any efficient method. (w=A\b in MatLab). yfit = Xw = are values of fit at the data points. yfit - y are the residuals at the data points. Standardized residuals plotted as a function of yfit are a visual method for testing the assumptions about error in the data that underlie a regression model. Write a code for regression of rating vs sugar and fiber with records from Cereals dataset

Matlab code for regression on rating vs sugar and
fiber My results for b0, bs, and bf are slightly different from values on the next slide. Probably due to difference in datasets. Note: Code easily changed to include more predictors of rating

Example of Multiple Linear Regression: Cereals dataset, nutritional rating vs sugars and fiber
Write a code for multiple linear regression. Test using cereals dataset Visualization is still possible in 2D but not very useful for showing the quality of fit. A 3D plot shows the tilt of the plane, which agrees with the sign of the optimum slope of the 2 predictors. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Cereals dataset, nutritional rating vs sugars and fiber
A slice perpendicular to a predictor axis shows the regression line that result when one predictor is held constant. Projecting data in to that plane gives a scatter plot that may be misleading regrading the quality of fit. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Residual equals vertical distance between data point and regression plane
Example: Spoon Size Shredded Wheat has x1 = 0 gm sugars, x2 = 3 gm fiber, and rating y = fit = – (0) (3) = Residual = (y – fit) = ( – ) = Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

As in 1D regression Add calculation of R2 and s to your multivariate regression code Compare to R2 = 80.8% and s = 6.24 for regression on Rating vs Sugar and Fiber How are R2 and s interpreted in a regression model?

R2 is always increased by including additional predictor
Where new predictor useful, R2 increases substantially Otherwise, R2 may increase small or negligible amount Recall, simple linear regression of rating on sugar case, where R2 = 58% Therefore, adding fiber to regression model accounts for additional 80.8% – 58% = 22.8% of variability in rating Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

s can be lower or higher when more predictors are added
Recall, simple linear regression of rating on sugars, where s = 9.16 s = 6.24 for multiple regression model with sugar and fiber Therefore, including fiber to model decreases s by (9.16 – 6.24) = 2.92 rating points In general, change is s reflects the usefulness of new predictor useful, s decreases not useful, s may increase Therefore, s more suitable than R2, in determining whether additional predictor should be added to model Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

As in 1D regression, outlier has standardized residual > 2
Add identification of outliers to your multivariate regression code using hi=1/n Compare to outliers identified in rating vs sugar and fiber: (8) Spoon Size Shredded Wheat (41) Frosted Mini - Wheats (76) Golden Crisp All residuals positive, indicating actual nutritional rating higher than regression estimate, given sugars and fiber Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Multivariate regression based on same assumptions as 1D regression
Multivariate regression assumes that response in population described by where, β1, β2, …, βm are model parameters whose true value remains unknown Values estimated from a sample dataset using linear least squares ε is the error in a record that we assume is reflected in the residual of the regression estimate of that record Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

As in 1D regression Four Error Term ε Assumptions:
(1) Zero Mean Assumption Error term ε random variable with mean, E(ε) = 0 (2) Constant Variance Assumption Variance of ε constant, regardless of value of x1, x2, …, xm (3) Independence Assumption Values of ε independent (4) Normality Assumption Error term ε normally distributed random variable Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

As in 1D regression Implications for Behavior of Response Variable y:
(1) Based on: Zero Mean Assumption For each set of values for x1, x2, …, xm, mean of y’s lie on regression line (2) Based on: Constant Variance Assumption Regardless of values for x1, x2, …, xm, variance of y’s constant Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

As in 1D regression (3) Based on: Independence Assumption
For any set of values for x1, x2, …, xm, values of y independent (4) Based on: Normality Assumption y normally distributed random variable Summary: Values of yi in the population are independent normal random variables, with mean = β0 + β1x1 + β2x2 + … + βmxm and variance = σ2 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Inference in Multiple Regression
Five inferential methods: (1) T-test for relationship between response y and specific predictor xi, in presence of other predictors x1, x2, …, xi-1, xi+1, …, xm (2) F-test for significance of entire regression equation (3) Confidence interval for βi, slope of ith predictor (4) Confidence interval for mean of y, given set of values for predictors x1, x2, …, xm (5) Prediction interval for random value of response y, given set of values for predictors x1, x2, …, xm Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

T-test for Relationship Between xi and y
Hypothesis Test for model: Ho: βi = 0 Ha: βi ≠ 0 Under Ha: Under Ho: Only difference between models is absence/presence of ith term Under null hypothesis, t = bi/ sbi follows t-distribution with n-m-1 degrees freedom, where sbi is standard error of slope for ith predictor Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

T-test for Relationship Between rating and sugars
Ho: β1 = 0, Model: y = β0 + β2(fiber) + ε Ha: β1 ≠ 0, Model: y = β0 + β1(sugars) + β2(fiber) + ε b1 = – and sb1=0.1633 T-statistic t = b1/ sb1= –2.2090/ = –13.53 P-value = p(t-statistic as least as extreme as –13.53 by chance alone) ~ 0.000 Null hypothesis rejected with high confidence when p-value is this small Confident that regression provides evidence of linear relationship between rating and sugar, in presence of fiber Add calculation of sbs to your code. Add calculation of T-statistics Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

T-test for Relationship Between rating and fiber
Ho: β2 = 0, Model: y = β0 + β1(sugars) + ε Ha: β2 ≠ 0, Model: y = β0 + β1(sugars) + β2(fiber) + ε b2 = and sb2=0.3032 T-statistic t = b2/ sb2= / = 9.37 P-value = p(t-statistic as least as extreme as 9.37 by chance alone) ~ 0.000 Null hypothesis rejected with high confidence when p-value is this small Confident that regression provides evidence of linear relationship between rating and fiber, in presence of sugars Add calculation of sbf to your code. Add calculation of T-statistics Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

F-test for Significance of Overall Regression Model
Example: model with three predictors x1, x2, and x3 Three separate t-tests used to examine linear relationship between response and each predictor when other 2 are held constant. F-test examines linear relationship between response and hold set of predictors {x1, x2, x3}. Hypotheses for F-Test: Ho: β1 = β2 = … = βm = 0 Ha: At least one βi ≠ 0 Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

F-statistic is a ratio of two mean squares, MSR/MSE, where each mean square equals a sum of squares divided by its degrees of freedom ANOVA table is a standardized summary of regression statistics Note that F-statistic has different degrees of freedom associated with numerator and denominator. Source of Variation Sum of Squares Degrees of Freedom Mean Square F Regression SSR m Error (or Residual) SSE Total Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Hypotheses for F-Test: Ho: β1 = β2 = … = βm = 0 Ha: At least one βi ≠ 0 If Ho is true, yhat ~ ybar and SSR is small, but n>>m, so MSR ~ MSE, and F is about 1. If Ha is true, SSR is at least as large as SSE, so MSR>>MSE, and F>>1.

The shape of the Fm, n-m-1 distribution varies somewhat depending on degrees of freedom (m and n-m-1) Nevertheless, if Fm, n-m-1 is large, A is near 1, and area in tail is small. Hence, the p-value for large Fobs = MSR/MSE will be small. Therefore, reject Ho. Confident of a linear relationship between response and at least one predictor

F-test for relationship between rating and {sugars, fiber}
Ho: β1 = β2 = 0, Model: y = β0 + ε Ha: At least one βi ≠ 0, where Model: y = β0 + β1(sugars) + ε, or y = β0 + β2(fiber) + ε, or y = β0 + β1(sugars) + β2(fiber) + ε MSR = , and MSE = 38.9 F-statistic F = /38.9 = Degrees of freedom n – m – 1 = 74 P-value = p(F2,74 > ) ~ Reject Ho. Confident of linear relationship between rating and set of predictors, sugars and fiber Add calculation of F-statistic to your multivariate code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Confidence Interval for Particular Coefficient, βi
100(1-alpha)% confidence interval for coefficient βi, bi is estimate of βi with standard error sbi tcritical is based on n-m-1 degrees of freedom and desired confidence 100(1-alpha)% confident that true slope βi lies within Example: construct 95% confidence interval for βs, for sugars bs = –2.2090, and sbs= T-critical value t74,95% = 2.0 (my interpolation method yields ) 95% confidence interval = – ± (2.0) (0.1633) = (–2.54, –1.88) Add confidence intervals on slopes to your multivariate code Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Interpretation of Confidence Interval on βs
We are 95% confident βs lies between and -1.88 Suppose researcher claims rating falls two points, for every additional gram of sugar, when fiber held constant Because 2.0 lies within 95% confidence interval, we would not reject this assertion, with 95 percent confidence. Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

MatLab code for analysis of regression of rating vs
sugar and fiber. Note: m=2, n=rows Note: hi approximated by 1/n

Assignment 4: due Write a code for regression of rating vs sugar and fiber with records from Cereals dataset Report b0, bs, bf, r2, s, F-stat, outliers, sbs, sbf, t74,95 and confidence intervals on bs and bf

Confidence Interval for mean value of y, given x1, x2, …, xm
Extension of 1D formula to multivariate case not given in text (see Draper and Smith) You are not responsible for this diagnostic Table 3.1 text p95 see “Values of Predictors for New Observations”, where mean value of y found for cereal with sugars = 5 gm and fiber = 5 gm We are 95% confident mean rating for all cereals with sugars = 5 gm and fiber = 5 gm lie in (52.709, ) Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Prediction Interval for Randomly Chosen Value of y, Given x1, x2, …, xm
Again, formulas not given and you are not responsible. Table 3.1 text p95 see “95% PI” for Minitab results. We are 95% confident rating for randomly chosen cereal with sugars = 5 gm and fiber = 5 gm lies between and points As expected, prediction interval wider than confidence interval, given same confidence level Discovering Knowledge in Data: Data Mining Methods and Models, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Multiple Linear Regression

Similar presentations

Presentation on theme: "Multiple Linear Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Linear Regression

Similar presentations

Presentation on theme: "Multiple Linear Regression"— Presentation transcript:

Similar presentations

About project

Feedback