Multiple regression analysis

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Regression and correlation methods
Copyright © 2010 Pearson Education, Inc. Slide
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
EPI 809/Spring Probability Distribution of Random Error.
Ch11 Curve Fitting Dr. Deshi Ye
Objectives (BPS chapter 24)
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Multiple Regression models Estimation Goodness of fit tests
Chapter 10 Simple Regression.
The Simple Regression Model
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
Chapter 11 Multiple Regression.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Simple Linear Regression and Correlation
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Chapter 12 Section 1 Inference for Linear Regression.
Simple Linear Regression Analysis
Correlation & Regression
Active Learning Lecture Slides
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Simple Linear Regression Models
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Inferences for Regression
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Introduction to Linear Regression
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
1 Lecture 4 Main Tasks Today 1. Review of Lecture 3 2. Accuracy of the LS estimators 3. Significance Tests of the Parameters 4. Confidence Interval 5.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
1 Experimental Statistics - week 12 Chapter 11: Linear Regression and Correlation Chapter 12: Multiple Regression.
Linear Regression Hypothesis testing and Estimation.
The simple linear regression model and parameter estimation
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Correlation and Simple Linear Regression
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Correlation and Simple Linear Regression
Simple Linear Regression
CHAPTER 12 More About Regression
Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Multiple regression analysis Week 4 Multiple regression analysis

More general regression model Consider one Y variable and n independent variables Xi, e.g. X1, X2, X3. Data on n tuples (yi, xi1,xi2,xi3). Scatter plots show linear association between Y and the X-variables The observations on y can be assumed to satisfy the following model Data error Prediction

Assumptions on the regression model The relationship between the Y-variable and the X-variables is linear The error terms ei (measured by the residuals) have zero mean (E(ei)=0) have the same standard deviation e for each fixed x are approximately normally distributed - Typically true for large samples! are independent (true if sample is S.R.S.) Such assumptions are necessary to derive the inferential methods for testing and prediction (to be seen later)! WARNING: if the sample size is small (n<50) and errors are not normal, you can’t use regression methods!

Parameter estimates Suppose we have a random sample of n observations on Y and on p-1 independent X-variables. How do we estimate the values of the coefficients ’s in the regression model of Y versus the X-variables?

Regression Parameter estimates The parameter estimates are those values for ’s that minimize the sum of the square errors: Thus the parameter estimates are those values for ’s that will make the model residuals as small as possible! The fitted model to compute predictions for Y is

Using Linear Algebra for model estimation (section 12.9) Let be the parameter vector for the regression model in p variables. The parameter estimates for each beta can be efficiently found using linear algebra as where X is the data matrix for the X-variables and Y is the data vector for the response variable. Hard to compute by hand – better use a computer! XT is the transpose of matrix X X-1 denotes the inverse of matrix X

EXAMPLE: CPU usage A study was conducted to examine what factors affect the CPU usage. A set of 38 processes written in a programming language was considered. For each program, data were collected on the Y = CPU usage (time) in seconds of time, X1= the number of lines (linet) in thousands generated by the process execution. X2 = number of programs (step) forming the process X3 = number of mounted computer devices (device). Problem: Estimate the regression model of Y on X1,X2 and X3

I) Exploratory data step: Are the associations between Y and the x-variables linear? Draw the scatter plot for each pair (Y, Xi) CPU time CPU time Lines executed in process Number of programs CPU time Do the plots show linearity? Mounted devices

The REG Procedure Parameter Estimates PROC REG - SAS OUTPUT The REG Procedure Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept --- 1 0.00147 0.01071 0.14 0.8920 Linet --- 1 0.02109 0.00271 7.79 <.0001 step --- 1 0.00924 0.00210 4.41 <.0001 device --- 1 0.01218 0.00288 4.23 0.0002 The fitted regression model is

Fitted model The fitted regression model is The ’s estimated values measure the changes in Y for changes in X’s. For instance, for each increase of 1000 lines executed by the process (keeping the other variables fixed), the CPU usage time will increase of 0.021 seconds. Fixing the other variables, what happens on the CPU time if I add another device?

Interpretation of model parameters In multiple regression coefficient value of an X variable measures the predicted change in Y for any unit increase in that X-variable while the other independent variables stay constant. For instance: measures the changes in Y for a unit increase of the variable X2 if the other x-variables X1 and X3 are fixed.

Are the estimated values accurate? Residual Standard Deviation (pg. 632) Testing effects of individual variables (pg. 652- 655)

How do we measure the accuracy of the estimated parameter values How do we measure the accuracy of the estimated parameter values? (page 632) For a simple linear regression with one X, the standard deviation of the parameter estimates are defined as: They are both functions of the error variance regarded as a sort of standard deviation (spread) of the points around the line! The error variance is estimated by the residual standard deviation se (a.k.a. root mean square error ) Residuals!

How do we interpret residual standard deviation? Used as a coarse approximation of the prediction error for new y-values. Probable error in new predictions is +/- 2 se se also used in the formula of standard errors of parameter estimates: they can be computed from the data and measure the noise in the parameter estimates

For general regression models with k x-variables For k predictors, the standard errors of the parameter estimates have a complicated form…but they still depend on the error standard deviation ! The residual standard deviation or root mean square error is defined as k+1 = number of parameters ’s This measures the precision of our predictions!

The REG Procedure Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 0.59705 0.19902 166.38 <.0001 Error 34 0.04067 0.00120 Corrected Total 37 0.63772 Root MSE 0.03459 R-Square 0.9362 Dependent Mean 0.15710 Adj R-Sq 0.9306 Coeff Var 22.01536 The root mean square error for the CPU usage regression model is computed above. That gives an estimate of the error standard deviation se=0.03459

Inference about regression parameters! Regression estimates are affected by random error The concepts of hypothesis testing and confidence intervals apply! The t-distribution is used to construct significance tests and confidence intervals for the “true” parameters of the population. Tests are often used to select those x-variables that have a significant effect on Y

Tests on the slope for straight line regression Consider the simple straight line case. A common test on the slope is the test on the hypothesis “X has no effect on Y” or the slope is equal to zero! Or in statistical terms : X has a negative effect Ho: vs Ha: X has a significant effect X has a positive effect The test is given by the t-statistic With t-distribution with n-2 degrees of freedom!

Tests on the parameters in multiple regression Assumptions on the data: 1. e1, e2, … en are independent of each other. 2. The ei are normally distributed with mean zero and have common variance s. Significance Tests on parameter test hypothesis: “Xj has no effect on Y” or in statistical terms Xj has a negative effect on Y Xj has a significant effect on Y Xj has a positive effect on Y The test is given by the t-statistic with t-distribution with n-(k+1) degrees of freedom Computed by SAS

Tests in SAS The test p-values for regression coefficients are computed by PROC REG SAS will produce the two-sided p-value. If your alternative hypothesis is one-sided (either > or < ), then find the one-sided p-value dividing by 2 the p-value computed by SAS one-sided p-value = (two-sided p-value)/2

The REG Procedure Parameter Estimates SAS Output The REG Procedure Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept --- 1 0.00147 0.01071 0.14 0.8920 Linet --- 1 0.02109 0.00271 7.79 <.0001 step --- 1 0.00924 0.00210 4.41 <.0001 device --- 1 0.01218 0.00288 4.23 0.0002 P-value T-statistic value T-tests on each parameter value show that all the x-variables in the model are significant at 5% level (p-values <0.05). The null hypothesis of no effect can be rejected, and we conclude that there is a significant association between Y and each x-variable.

Test on the intercept 0 The REG Procedure Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept --- 1 0.00147 0.01071 0.14 0.8920 Linet --- 1 0.02109 0.00271 7.79 <.0001 step --- 1 0.00924 0.00210 4.41 <.0001 device --- 1 0.01218 0.00288 4.23 0.0002 The test on the intercept says that the null hypothesis of 0=0 should be accepted. The test p-value is 0.8920. This means that the model should have no intercept ! This is not recommended though – unless you know that Y=0 if all the x-variables are equal to zero.

What do we do if a model parameter is not significant? If the t-test on a parameter bj shows that the parameter value is not significantly different from zero, we should refit the regression model without the x-variable corresponding to bj.

SAS Code for the CPU usage data Data cpu; infile "C:\week5\cpudat.txt"; input time line step device; linet=line/1000; label time="CPU time in seconds" line="lines in program execution" step="number of computer programs" device="mounted devices" linet="lines in program (thousand)"; /*Exploratory data analysis */ /* computes correlation values between all variables in dataset */ proc corr data=cpu; run; /* creates scatterplots between time vs linet, time vs step and time vs device, respectively */ proc gplot data=cpu; plot time*(linet step device);

model time=linet step device / noint; /* Regression analysis: fits model to predict time using linet, step and device*/ proc reg data=cpu; model time=linet step device; plot time*linet /nostat ; run; quit; If you want to fit a model with no intercept use the following model statement: model time=linet step device / noint;