Download presentation
Presentation is loading. Please wait.
Published byGervase Robbins Modified over 8 years ago
1
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied Approach.
2
L01A MGS 8110 - Regression - Basic 2 Part A – Basic Model & Parameter Estimation Part B – Calculation Procedures Part C – Inference: Confidence Intervals & Hypothesis Testing Part D – Goodness of Fit Part E – Model Building Part F – Transformed Variables Part G – Standardized Variables Part H – Dummy Variables Part I – Eliminating Intercept Part J - Outliers Part K – Regression Example #1 Part L – Regression Example #2 Part N – Non-linear Regression Part P – Non-linear Example Regression Analysis Modules
3
L01A MGS 8110 - Regression - Basic 3 -the derivation of an algebraic equation to describe the relationship of one or more variables (X 1, X 2,... X P ) with respect to one other variable (Y). -The X’s can be quantitative or qualitative. -The Y must be quantitative. Definition of Regression Analysis
4
L01A MGS 8110 - Regression - Basic 4 -Linear Regression – the algebraic equation is linear in the parameters. Estimates of the parameters are very easy to derive. -Non-Linear Regression – the algebraic equation is not linear in the parameters and estimates of the parameters are more difficult to derive. Types of Regression Analysis A linear in the parameters equations can provide a non-linear curve.
5
L01A MGS 8110 - Regression - Basic 5 -Simple Regression versus Multiple Regression – depends on whether the regression equation has one independent variable (X 1 ) or more than one independent variables (X 1, X 2, … X p ). The theory remains the same in both situations, but -1) the numerical computations are slightly more complex in the multiple regression case (need to use a matrix formulation of the problem). -2) some sets of independent variables are computationally invalid (a condition called multicollinearity). Other Regression Categorizations
6
L01A MGS 8110 - Regression - Basic 6 -Cross-Sectional Regression versus Time Series Regression – depends on whether the dependent and independent variables are jointly “ordered” in time. The Y j and the X j s are recorded at a specific point in time and Y j+k and the X j+k s are recorded k time periods later. The theory remains the same in both situations. However, for Time Series data, prior values of Y and X may be used in the regression equation for future values of Y. Other Regression Categorizations
7
L01A MGS 8110 - Regression - Basic 7 -The one Y variable is called the dependent variable or the response variable. -The set of X variables are called the independent variables or explanatory variables. -If there are p independent variables and a data sample of size N, the variables can be denoted as denoted as Notation for Regression Data
8
L01A MGS 8110 - Regression - Basic 8 -The value of Y i for a given set of equal X values equals the mean value for Y, i, at that set of X values plus some random error, i. The ‘ i ’ in this case refers to the specific set of X values. -Regression analysis determines an algebraic equation that can be used to predict i. Notation for Regression Equation
9
L01A MGS 8110 - Regression - Basic 9 -If the regression equation is denoted as -The estimates of the regression parameters (a, b, c, … h) and the estimated regression equation are then denoted as -An alternative notation that is frequently used is which is estimated as Notation for Regression Equation (continued)
10
L01A MGS 8110 - Regression - Basic 10 Sometimes a distinction is made between Y and X 1, X 2, …X p the random variables and Y i and X i,1, X i,2, … X i,p the observed values in a sample or y i and x i,1, x i,2, … x i,p the observed values in a sample. If this distinction is desired, the observed sample values are frequently denoted by lower case y i and x i,1, x i,2, … x i,p. Notation for Regression Equation (continued) pis the number of variables nis the number of observations
11
L01A MGS 8110 - Regression - Basic 11 Example of Regression Data single independent variable.
12
L01A MGS 8110 - Regression - Basic 12 Plotted Regression Equations
13
L01A MGS 8110 - Regression - Basic 13 Uses of Regression Analysis 1) Summarizes the data - seeing the data plotted on an X-Y chart is informative. Seeing the equation plotted on the graph may be even more valuable. The plotted curve summarizes the scatter of the points. All observers are focused on the same interpretation of the data. 2) Allows predictions - inserting a set of X i in the right half of the equation allows the predicted Y i to be calculated. Predictions made from the equation are better than reviewing historical data to find a value that was close to the desired X i value and seeing what the corresponding Y i values was. The selected Y i may by chance have a large residual and not be representative the the majority of the data. 3) Interpret the Coefficients - in many cases the regression parameters have a useful physical interpretation. The coefficient for X i indicates how much change occurs in Y for each unit change in X i.
14
L01A MGS 8110 - Regression - Basic 14 Regression Line - Fitting Procedure Least Squares Criteria Teaching Point
15
L01A MGS 8110 - Regression - Basic 15 Derivation of Regression Parameters single independent variable.
16
L01A MGS 8110 - Regression - Basic 16 Derivation of Regression Parameters single independent variable.(continued)
17
L01A MGS 8110 - Regression - Basic 17 Relationship between Regression Coefficients and Correlation Coefficient Teaching Point
18
L01A MGS 8110 - Regression - Basic 18 Derivation of Regression Parameters multiple independent variables. Not shown is the fact that this algebraic manipulation is also the Least Squares solution. Teaching Point
19
L01A MGS 8110 - Regression - Basic 19 Characteristic of Parameter Estimates, b=(X’X) -1 (X’Y) Least Squares Estimates - matrix differentiation is outside the scope of the course, but it can be shown that b=(X’X) -1 (X’Y) minimizes SSE = ’ = (Y-X )(Y-X ). Unbiased, Minimum Variance Estimates Maximum Likelihood Estimates – if the errors are independent and normally distributed, that is i ~ N(0, 2 ).
20
L01A MGS 8110 - Regression - Basic 20 Limitations on Regression Analysis Restriction – the basic model must be linear in the parameters. Restriction – the sample must be representative of the population for the inference prediction. Assertion - the equation being fit to the data is a correct (valid) representation of the underlying process. Consequence - if there are a large number of Y i values for the same sets of X i values, the regression equation will predict the mean Y i values, that is the regression equation will pass through the points, i.
21
L01A MGS 8110 - Regression - Basic 21 Assumption underlying Regression The required assumptions are: 1.The dependent variable is subject to error. This error is assumed to be a random variable, with a mean of zero, E( )=0. 2.The independent variable is error-free. 3.The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others. See Multicollinearity. 4.The errors are uncorrelated, that is, the variance-covariance matrix of the errors is diagonal and each non-zero element is the variance of the error. 5.The variance of the errors is constant (homoscedasticity). 6.The errors follow a normal distribution. Teaching Points will discuss later
22
L01A MGS 8110 - Regression - Basic 22 Multicollinearity Condition Y predicted by X 1, X 2, X 3, X p moderate correlation is good large correlation is great Y predicted by X 1, X 2, X 3, X p moderate correlation is possible problem large correlation is disaster
23
L01A MGS 8110 - Regression - Basic 23 Multicollinearity Condition A Multicollinearity condition exists if the independent variables are linearly related. Example – using stature (standing height) and sitting height as independent variables when estimating weight. Transformed data example – Definite multicollinearity condition X 1 and X 2 =2X 1 X 1, X 2 and X 3 =X 1 -X 2 X 1, X 2 and any X 3 =aX 1 +bX 2 Potential (possible) multicollinearity X 1 and X 2 =X 1 2 X 1, and X 2 =X 1 1/2 Particularly if 1)N is small compared to p. 2)the original Xs have moderate correlation. 3)Many transformations of one original variable are being used.
24
L01A MGS 8110 - Regression - Basic 24 Basic Test for Multicollinearity Calculate a correlation matrix for all of the independent variable (original variables and transformed variables). Any correlation greater than.9 represent a possible multicollinearity condition. A p i,j >.97 is a highly likely multicollinearity condition.
25
L01A MGS 8110 - Regression - Basic 25 Sophisticated Test for Multicollinearity Calculate the Coefficient of Determination p times, each time using the jth variable as the dependent variable and the p-1 remaining variables as the independent variables (ignore the original dependent variable). Denote these partial R 2 as R j 2 for j=1 to p. Calculate the Variance Inflation Factor and the Mean Variance Inflation Factor as Potential multicollinearity condition: –If largest R j 2 >.9 –If largest VIF j > 10 –If Mean VIF >>> 1 (a very loose rule of thumb, can’t be less than 1)
26
L01A MGS 8110 - Regression - Basic 26 Differences in Multicollinearity Tests Correlation coefficients measure the multicollinearity between each pair of independent variables. VIF measures the multicollinearity between one independent variable and the remaining independent variables taken as a group. A multicollinearity condition does not necessarily mean that a variable should be removed from the data base. It does indicate that caution should be exercised when fitting equations with these variables.
27
L01A MGS 8110 - Regression - Basic 27 Multi- collinearity Example Implies Size could be removed from database: Corr(Size,Size 2 )=.997, Corr(Size,1/Size)=-.985 and VIF(Size)=47,669. The effect of Size is being accounted for by Size 2, 1/Size and in general all of the remaining variables. Similarly, Age.9 could be removed from the database because it is being accounted for by Age and the cross product (Size)(Age).
28
L01A MGS 8110 - Regression - Basic 28 Consequences of Extreme Multicollinearity the computer program may crash. The matrix being inverted will be singular and will cause a “divide by zero” error. the derived regression coefficients will be very sensitive to small changes in the data. The coefficients are unstable. the derived regression coefficients may have unexpected physical interpretations. For example, increasing ‘Advertising’ could incorrectly imply a reduction in ‘Sales’. Confusion can occur when deriving the “best” regression model. Two variables may seem of marginal value when they are both in the regression equation. Yet, if one of these variables is removed from the regression equation, the remaining variables becomes highly significant (desirable). The t-tests of the significance of the regression coefficients may be incorrect and variables that are not needed may be retained in the regression equation. Teaching Point
29
L01A MGS 8110 - Regression - Basic 29 Skip’s Approach to a Multicollinearity Condition I do not eliminate potential multicollinearity variables before doing the regression analysis. If the computer crashes, then I eliminate the multicollinearity variables one at a time and by trial-&-error. If the computer does not crash, then I continue my regression analysis. Usually, typical regression analysis procedures will by themselves eliminate multicollinearity variables. When I have a final model I do a correlation and VIF analysis to make sure that a multicollinearity condition has not survived the analysis. Based on the sets of multicollinearity variables, I try different subsets of variables as the starting point in my regression analysis.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.