Exploring Ridge Regression Case Study: Predicting Mortality Rates Group E: Anna Cojocaru, Ellen Bryer, Irina Matcas
Project Overview and Data Purpose: Understand the use of Ridge Regression in comparison with MLR. Response Variable: Mortality Rate Explanatory Variables: Socioeconomic, Weather, and Pollution related factors Observational Units: 60 US Cities in 1963 Data: from McDonald, G.C. and Schwing, R.C.(1973)’s work entitled “Instabilities of regression estimates relating air pollution to mortality”. We use this problem / this example to talk about ridge regression and explain why it is a good technique
Problems with Data: Multicollinearity- several highly correlated explanatory variables. VIF test: Because there is a lot of correlation in pollution data, it’s often valuable to investigate several highly correlated (non-orthogonal) variables simultaneously. This study addresses the chronic effects of pollution (measured with Hydrocarbons, NO and Sulfur Dioxide) Note: there are other variables (pollutants) which are highly correlated with each of these so cause and effect is not the goal Uses observational data because of sampling issues and the fact that the distribution of all exposures can’t be characterized by a single variable The data is calculated relative pollution potentials (amounts which change by city and dispersion factors which are constant across cities) in each metropolitan area as well as weather factors.
Results: Multiple Linear Regression Best subsets using Mallows Cp (3.62): Precipitation Average temperature in January Average temperature in July Years of Education Percentage non-white Sulfur Dioxide Pollution R2 = 73.48% If Multicollinearity is a problem, then: OLSR won’t give proper weight to the explanatory variables used as predictors. OLS technique yields coefficient estimates too large or with wrong sign. Ridge avoids distortions by adjusting the weights of coefficients according to their stability.A qay to look at relationships between variables when several are highly correlated
Ridge Regression Technique Ridge regression (RR) minimizes the RSS to create an optimal fit. RR uses a ridge trace to determine how correlation between predictors affect coefficient estimates using k. k is a tuning parameter that adjusts the coefficients of the predictors according to their stability k is a constant [0,1] The optimal k gives more reliable coefficients. If k=0 => unbiased estimators, identical to those determined by OLS. If k>0 => OLS overestimated coefficients (biased) and RR shrinks them. By applying ridge regression we shrink the estimated coefficients towards zero - this introduces bias, but reduces variance in the estimated coefficients. As λ increases, the bias increases and the variance decreases. Ridge regression performs particularly well when there is a subset of true coefficients that are small or even zero. It doesn’t do as well when all of the true coefficients are moderately large; however, in this case it can still outperform linear regression over a pretty narrow range of (small) λ values (From other ppt) Resolves the issue of multicollinearity
Applying Ridge to our Data Determine“optimal k”: k≅0.14 Perform variable selection using ridge trace Eliminates stable variables with least predicting power Eliminates two highly unstable variables Eliminate further unstable variables (including July Temperature) The variables # 12&13 are unstable, therefore they can’t hold their predicting power and need to be eliminated.
Results: Ridge Regression Ridge Trace suggests: Precipitation Average temperature in January Population Density Years of Education Percentage non-white Sulfur Dioxide Pollution. R2 = 72.43%
MLR vs. Ridge Regression Comparing Model Coefficients OLSR Ridge R Intercept 1180.356 988.408 PREC 1.797 1.487 JANT -1.484 -1.633 JULT -2.355 ___ ___ EDUC -13.619 -11.533 DENS 0.004 NONW 4.585 4.145 SO. 0.26 0.245 MAKE CONCLUSIONS HERE MInimizes the risk of over predicting especially as the mortality rate increases Similarities: The coefficients in both models have the same signs. Differences: The coefficient estimates in MLR are larger than the ones estimated by Ridge Regression (due to multicollinearity).