Basics of Regression Analysis
Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting Error
Two Predictor Variables Population Regression Model: Y = 0 + 1 X 1 + 2 X 2 + e e following N(0, ) Unknown parameters: 0, 1, 2 ;
From Data to Estimates of Coefficients Principle: Least Squares Normal Equation Systems Estimates of Coefficients Mathematics Computing Algorithm
Least Squares Method Simple RegressionMultiple Regression
Matrix Computation for b Normal Equation System: (X T X) b = X T Y –See Text Appendix D.3 Solution for b: b = (X T X) -1 (X T Y)
Standardized Regression Coefficients, Definition –b 0 = 0 –the beta coefficient Used to show relative weights of predictors. for k = 1, 2
Estimation of s e - Standard Deviation of Disturbance e Forecasting Equation SS of Residuals Mean SS SSE =Y i -Y i 2 i=1 n MSE =sese 2 = SSE (n-3)
Standard Error of Coefficients The variance matrix of b (K+1 x 1)is
The Variability Explained First, determine the base variability for explanation by the regression Unconditional mean model: Y = y + e e follows N(0, y ) LS fit of the model: Pred_Y = Y SS of Residuals: MSS (DF=n-1):
The Variability Explained – cont. Second, by subtraction of the variability for still left. In SS: In Variance :
Creating ANOVA Table Regression Model Unexplained Variability in SS DF Unexplained Variability in Variance (MSE) Un- conditional SST (n-1) Conditional SSE (n-3) Variability Explained SSR= SST - SSE 2 Proportion Explained
Test of Significance F test of significance T- Test of significance –Two sided alternative –One sided alternative
F - Test of Significance of the variability explained by the regression H 0 : 1 = 2 = 0 H a : At least one coefficient is not 0 P-Value of F-stat = P{F (2, n-3) > F-stat}
t-Test of Significance of significance of a variable, X 1 - two sided H 0 : 1 = 0 H a : 1 = 0 P-Value of t-stat = P{ t ( n-3) > |t-stat|}
One Sided Test of Significance of significance of a variable, X 1 H 0 : 1 = 0 H a : 1 > 0 (using the prior knowledge) p-Value of t-stat = P{ t ( n-3) > t-stat}
Forecasting Point forecasting Sources of forecasting error Interval forecasting
Forecasting at x m Data of X for regressionValue of X for prediction
Sources of Forecasting Error Data: Y|x m = 0 + 1 x 1m + 2 x 2m + e m Forecast: Forecast Error:
Computing Standard Errors
Forecasting Performance Analysis R 2 _pred = 1 – Press / SST Press = SS of {y i – y i (i)} (deleted residual) Sample splitting –Analysis sample (n 1 ) –Validation sample (n 2 )
Generalization to K Independent Variables Use n – K – 1 for n – 3 for DF for t. Use K for the numerator DF and n-K-1 for the denominator DF for F.
Diagnostics Assumptions for Disturbance Multi-collinearity Outliers and Influential Observations
Problematic Data Conditions Regression Coefficients Are Sensitive to: –Highly Collinear Independent Variables –Contamination By Outliers and Influential Observations
Detecting Outliers and Influential Data Outliers –Leverage (X-space) distance from the mean –Tresid (Y-space) forecasting error Influential Data –Idea: with / without comparison –Cook’D –Dfbetas –Dfits
Modeling Techniques Transformation of Variables –Log –Others Using Dummy Variables –Symbolic representation –Dummy variables for qualitative variables Using Scores for Ordinal Variables Selection of Independent Variables –Forecasting –Computer intensive –Analysis of correlation structure of independent variables
Dummy Variables DK= “If (X=k,1,0)” Can be used nominal and also ordinal variables # of DK = c-1 where c is the number of categories.
Using Scores for Ordinal Variable Scoring Systems – 1, 2, 3, …c – -2, -1, 0, 1, 2 c:odd
Implications of Variable Selection
Selection of Variables - 1 Backward elimination Stepwise (forward) inclusion All X’s Final Regression T-test Best simple Best Two variables Best …. variables Max Increase in R 2 Max Increase in R 2
Selection of Variables - 2 All Possible Regression K independent variables K simple K (K-1) two variable 1 K variable Final Regression
Selection Criteria R2___________________________ Adj. R 2 ______________________ R 2 PRED ______________________ Se __________________________ Cp___________________________
C p (= # of coefficients) Select a combination with Cp close to p
What to Look for in Good Regression? Remember the three functions of regression –Estimation of the effect of each X –Explaining the variability of Y –Forecasting Populations regressions are assumptions –Needs testing Data might be contaminated
Extensions For Other Variable Types of Y
Types of Variable Variable Quantitative Qualitative Continuous Discrete (counting) Ordinal Nominal
Generalized Linear Models (GLM) Regression model: Y = 0 + 1 X 1 + 2 X 2 + e e following N(0, ) GLM Formulation: 1.Model for Y: Y is N( , ) 2.Model for predictors (Link Function): = 0 + 1 X 1 + 2 X
Forecasting Counting Data Model for Y: Poisson Distribution ( ) Link Function: