Chapter 11 Multiple Linear Regression Group Project AMS 572.

Slides:



Advertisements
Similar presentations
3.3 Hypothesis Testing in Multiple Linear Regression
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
The Multiple Regression Model.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Best subsets regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
EPI 809/Spring Probability Distribution of Random Error.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Chapter 13 Multiple Regression
Multiple regression analysis
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
The Simple Regression Model
Chapter 11 Multiple Regression.
REGRESSION AND CORRELATION
Multiple Linear Regression
Inferences About Process Quality
EPI809/Spring Testing Individual Coefficients.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Multiple Linear Regression and the General Linear Model
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
With Thanks to My Students in AMS 572: Data Analysis
Regression and Correlation Methods Judy Zhong Ph.D.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Introduction to Linear Regression and Correlation Analysis
Chapter 13: Inference in Regression
Chapter 12 Multiple Regression and Model Building.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Multiple Linear Regression and the General Linear Model 1.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 11 Multiple Linear Regression Chapter 11 Multiple Linear Regression.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Chapter 13 Multiple Regression
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Business Research Methods
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Model selection and model building. Model selection Selection of predictor variables.
Chapter 13 Simple Linear Regression
Chapter 15 Multiple Regression Model Building
Chapter 7. Classification and Prediction
Regression Analysis Week 4.
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Prepared by Lee Revere and John Large
Multivariate Linear Regression
Presentation transcript:

Chapter 11 Multiple Linear Regression Group Project AMS 572

Group Members From Left to Right: Yongjun Cheng, William Ho, Katy Sharpe, Renyuan Luo, Farahnaz Maroof, Shuqiang Wang, Cai Rong, Jianping Zhang, Lingling Wu. Yuanhua Li

Overview 1-3 Multiple Linear Regression --William Ho 4 Statistical Inference ---Katy Sharpe & Farahnaz Maroof 6 Topics in Regression Modeling -- Renyuan Luo & Yongjun Cheng 7 Variable Selection Methods & SAS ---Lingling Wu, Yuanhua Li & Shuqiang Wang 5, 8 Regression Diagnostic and Strategy for Building a Model ---Cai Rong Summary Jianping Zhang

Multiple Linear Regression Intro William Ho

Multiple Linear Regression Historical Background Regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variables Multiple linear regression extends simple linear regression model to the case of two or more predictor variable Francis Galton started using the term regression in his biology research Karl Pearson and Udny Yule extended Galton’s work to statistical context Gauss developed the method of least squares used in regression analysis Example: A multiple regression analysis might show us that the demand of a product varies directly with the change in demographic characteristics (age, income) of a market area.

Probabilistic Model is the observed value of the r.v. which depends on fixed predictor values according to the following model: whereare unknown parameters. the random error,, are assumed to be independent r.v.’s then the are independent r.v.’s with

Fitting the model The least squares (LS) method is used to find a line that fits the equation Specifically, LS provides estimates of the unknown parameters, which minimizes,, the sum of difference of the observed values,, and the corresponding points on the line The LS can be found by taking partial derivatives of Q with respect to unknown variables and setting them equal to 0. The result is a set of simultaneous linear equations, usually solved by computer The resulting solutions,are the least squares (LS) estimates of, respectively

Goodness of Fit of the Model To evaluate the goodness of fit of the LS model, we use the residuals defined by are the fitted values: An overall measure of the goodness of fit is the error sum of squares (SSE) A few other definition similar to those in simple linear regression: total sum of squares (SST)- regression sum of squares (SSR) –

coefficient of multiple determination: values closer to 1 represent better fits adding predictor variables never decreases and generally increases multiple correlation coefficient (positive square root of ): only positive square root is used r is a measure of the strength of the association between the predictor variables and the one response variable

Multiple Regression Model in Matrix Notation The multiple regression model can be presented in a compact form by using matrix notation Let: be the n x 1 vectors of the r.v.’s, their observed values, and random errors, respectively Let: be the n x (k + 1) matrix of the values of predictor variables (the first column corresponds to the constant term )

Let: and be the (k + 1) x 1 vectors of unknown parameters and their LS estimates, respectively The model can be rewritten as : The simultaneous linear equations whose solutions yields the LS estimates: If the inverse of the matrix exists, then the solution is given by:

11.4 Statistical Inference Katy Sharpe & Farahnaz Maroof

Determining the statistical significance of the predictor variables: We test the hypotheses: If we can’t reject vs., then the corresponding variable is not a useful predictor of y. Recall that eachis normally distributed with mean and variance, whereis the jth diagonal entry of the matrix

Deriving a pivotal quantity Note that Since the error variance is unknown, we employ its unbiased estimate, which is given by We know that, and thatand the are statistically independent. Using and definition of the t-distribution, we obtain the pivotal quantity and the,

Derivation of a Confidence Interval for Note that

Deriving the Hypothesis Test: Hypotheses: P (Reject H 0 | H 0 is true) =    Therefore, we reject H 0 if 

Another Hypothesis Test Now consider: for all for at least one When H 0 is true, F is our pivotal quantity for this test. Compute the p-value of the test. Compare p to , and reject H 0 if p  . If we reject H 0, we know that at least one  j  0, and we refer to the previous test in this case.

The General Hypothesis Test Consider the full model: Now consider partial model: We test: Reject H 0 when (i=1,2,…n) vs. for at least one (i=1,2,…n) Hypotheses:

Predicting Future Observations Let Our pivotal quantity becomes Using this pivotal quantity, we can derive a CI to estimate  * : Additionally, we can derive a prediction interval (PI) to predict Y * : and let

Topics in Regression Modeling Renyuan Luo

Multicollinearity Def.The predictor variables are linearly dependent. This can cause serious numerical and statistical difficulties in fitting the regression model unless “extra” predictor variables are deleted.

How does the multicollinearity cause difficulties? If the approximate multicollinearity happens: 1. is nearly singular, which makes numerically unstable. This reflected in large changes in their magnitudes with small changes in data. 2.The matrix has very large elements. Therefore are large, which makes statistically nonsignificant.

Measures of Multicollinearity 1.The correlation matrix R. Easy but can’t reflect linear relationships between more than two variables. 2.Determinant of R can be used as measurement of singularity of. 3.Variance Inflation Factors (VIF): the diagonal elements of. VIF>10 is regarded as unacceptable.

Polynomial Regression A special case of a linear model: Problems: 1.The powers of x, i.e., tend to be highly correlated. 2.If k is large, the magnitudes of these powers tend to vary over a rather wide range. So let k 5.

Solutions 1.Centering the x-variable: Effect: removing the non-essential multicollinearity in the data. 2.Further more, we can do standardizing: divided by the standard deviation of x. Effect: helping to alleviate the second problem.

Dummy Predictor Variables What to do with the categorical predictor variables? 1.If we have categories of an ordinal variable, such as the prognosis of a patient (poor, average, good), just assign numerical scores to the categories. (poor=1, average=2, good=3)

2.If we have nominal variable with c>=2 categories. Use c-1 indicator variables,, called Dummy Variables, to code. for the i th category, for the c th category.

Why don’t we just use c indicator variables: ? If we use this, there will be a linear dependency among them: This will cause multicollinearity.

Example If we have four years of quarterly sale data of a certain brand of soda cans. How can we model the time trend by fitting a multiple regression equation? Solution: We use quarter as a predictor variable x1. To model the seasonal trend, we use indicator variables x2, x3, x4, for Winter, Spring and Summer, respectively. For Fall, all three equal zero. That means: Winter-(1,0,0), Spring-(0,1,0), Summer-(0,0,1), Fall-(0,0,0). Then we have the model:

Topics in Regression Modeling Yongjun Cheng

Logistic Regression Model 1938, R. A. Fisher and Frank Yates suggested the logistic transform for analyzing binary data.

Why is it important ? Logistic regression model is the most popular model for binary data. Logistic regression model is generally used for binary response variables. Y = 1 (true, success, YES, etc.) or Y = 0( false, failure, NO, etc.)

What is Logistic Regression Model? Consider a response variable Y=0 or 1and a single predictor variable x. We want to model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x). This model may be rewritten as Example

Some properties of logistic model E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x.This is not true if we use model: P(Y=1|x) = In ordinary regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x. For multiple predictor variables, the logistic regression model is

Standardized Regression Coefficients Why we need standardize regression coefficients? The regression equation for linear regression model: 1. The magnitudes of the can NOT be directly compared to judge the relative effects of on y. 2. Standardized regression coefficients may be used to judge the importance of different predictors

How to standardize regression coefficients? Example: Industrial sales data Linear Model: The regression equation: NOTE: but thus has a much larger effect than on y.

Summary for general case Standardized Transform Standardized Regression Coefficients

Variables selection method Stepwise Regression LingLing Wu

Variables selection method (1)Why we need variable selection method? (2)How we select variables? * stepwise regression * best subsets regression

Stepwise Regression (p-1)-variable model: P-varaible model

Partial correlation coefficients

Stepwise regression algorithm

Example about SAS output

Variables selection method SAS Example Yuanhua Li, Jianping Zhang

No.X1X2X3X4Y Example 11.5 (pg. 416), 11.9 (pg. 431) Following table gives data on the heat evolved in calories during hardening of cement on a per gram basis (y) along with the percentages of four ingredients: tricalcium aluminate (x1), tricalcium silicate (x2), tetracalcium alumino ferrite (x3), and dicalcium silicate (x4).

SAS Program (stepwise selection is used) data example115; input x1 x2 x3 x4 y; datalines; ; run; proc reg data=example115; model y = x1 x2 x3 x4 /selection=stepwise; run;

Selected SAS output The SAS System 22:10 Monday, November 26, The REG Procedure Model: MODEL1 Dependent Variable: y Stepwise Selection: Step 4 Parameter Standard VariableEstimate Error Type II SS F Value Pr > F Intercept <.0001 x <.0001 x <.0001 Bounds on condition number: ,

SAS Output (cont) All variables left in the model are significant at the level. No other variable met the significance level for entry into the model. Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 x x < x x

Variables selection method Best Subsets Regression Shuqiang Wang

Best Subsets Regression For the stepwise regression algorithm The final model is not guaranteed to be optimal in any specified sense. In the best subsets regression, subset of variables is chosen from the collection of all subsets of k predictor variables) that optimizes a well-defined objective criterion

Best Subsets Regression In the stepwise regression, We get only one single final models. In the best subsets regression, The investor could specify a size for the predictors for the model.

Best Subsets Regression Optimality Criteria r p 2 -Criterion: The sample estimator, Mallows’ C p -statistic, is given by C p -Criterion (recommended for its ease of computation and its ability to judge the predictive power of a model) Adjusted r p 2 -Criterion:

Best Subsets Regression Algorithm Note that our problem is to find the minimum of a given function. Use the stepwise subsets regression algorithm and replace the partial F criterion with other criterion such as C p. Enumerate all possible cases and find the minimum of the criterion functions. Other possibility?

Best Subsets Regression & SAS proc reg data=example115; model y = x1 x2 x3 x4 /selection=stepwise; run; For the selection option, SAS has implemented 9 methods in total. For best subset method, we have the following options: Maximum R 2 Improvement (MAXR) Minimum R 2 (MINR) Improvement R 2 Selection (RSQUARE) Adjusted R 2 Selection (ADJRSQ) Mallows' C p Selection (CP)

11.5, 11.8 Building A Multiple Regression Model Steps and Strategy By Rong Cai

Modeling is an iterative process. Several cycles of the steps maybe needed before arriving at the final model. The basic process consists of seven steps

Get started and Follow the Steps Categorization by Usage Collect the Data Divide the Data Explore the Data Fit Candidate Models Select and Evaluate Select the Final Model

Step I Decide the type of model needed, according to different usage. Main categories include:  Predictive  Theoretical  Control  Inferential  Data Summary Sometimes, models are involved in multiple purposes.

Step II Collect the Data Predictor (X) Response (Y) Data should be relevant and bias-free Reference: Chapter 3

Step III Explore the Data Linear Regression Model is sensitive to the noise. Thus, we should treat outliers and influential observations cautiously. Reference: Chapter 4 Chapter 10

Step IV Divide the Data Training Sets: building Test Sets: checking How to divide? Large sample Half-Half Small sample size of training set >16

Step V Fit several Candidate Models Using Training Set.

Step VI Select and Evaluate a Good Model To improve the violations of model assumptions.

Step VII Select the Final Model Use test set to compare competing models by cross-validating them.

Regression Diagnostics (Step VI) Graphical Analysis of Residuals  Plot Estimated Errors vs. X i Values Difference Between Actual Y i & Predicted Y i Estimated Errors Are Called Residuals  Plot Histogram or Stem-&-Leaf of Residuals Purposes  Examine Functional Form (Linearity )  Evaluate Violations of Assumptions

Linear Regression Assumptions Mean of Probability Distribution of Error Is 0 Probability Distribution of Error Has Constant Variance Probability Distribution of Error is Normal Errors Are Independent

Residual Plot for Functional Form (Linearity) Add X^2 Term Correct Specification

Residual Plot for Equal Variance Unequal Variance Correct Specification Fan-shaped. Standardized residuals used typically (residual divided by standard error of prediction)

Residual Plot for Independence Not Independent Correct Specification

Summary Jianping Zhang

Chapter Summary Multiple linear regression mode How to fit multiple regression model (LS Fit) How to evaluate the goodness of fitting How to select predictor variables How to use SAS to do multiple regression How to building a multiple regression model

Thank you! Happy holidays!