Logistic Regression KNN Ch. 14 (pp. 555-618) MINITAB User’s Guide SAS EM documentation
Regression Models with Binary Response Variable In many applications the response variable has only two possible outcomes (0/1): In a study of liability insurance possession, using Age of head of household, Amount of liquid assets, and Type of occupation of head of household as predictors, the response variable had two possible outcomes: House has liability insurance (=1), or Household does not have liability insurance (=0) The financial status of a firm (sound status, headed toward insolvency) can be coded as 0/1 Blood pressure status (high blood pressure, not high blood pressure) can be coded as 0/1
Meaning of the Response Function for Binary Outcomes Consider the simple linear regression model In this case, the expected response E{Yi} has a special meaning. Consider Yi to be a Bernoulli random variable:
Meaning of the Response Function for Binary Outcomes Using the definition of expected value of a random variable, Therefore, the mean response E{Yi} is the probability that Yi =1 when the level of the predictor variable is Xi. E{Y} 1 X E{Y} = b0 + b1X
Problems when Response Variable is Binary 1. Error Terms are not normal: At each X level, the error cannot be normally distributed since it takes only 2 possible values, depending on whether Y is 0 or 1 2. Error Variance is not constant: Error Variance is a function of X, therefore not constant 3. Constraints with the response function: We need to find response functions that do not exceed the value of 1, and that is not easy
Link Functions Inverse of distribution functions have a sigmoid shape that can be helpful as a response function of a regression model with binary outcome. Such a function is called Link Function. We want to choose a link function that best fits our data. Goodness-of-fit statistics can be used to compare fits using different link functions:
Logistic Regression Assumption logit transformation Assumption: The logit transformation of the probabilities of the target value results in a linear relationship with the input variables.
Linear versus Logistic Regression Linear Regression Logistic Regression Target is an interval variable. Target is a discrete (binary or ordinal) variable. Input variables have any measurement level. Input variables have any measurement level. Predicted values are the mean of the target variable at the given values of the input variables. Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables.
Interpretation of Parameter Estimates The interpretation of the parameter estimates depends on The link function The reference event (1 or 0) The reference factor levels (for numerical factors, reference level is the smallest value) The logit link function provides the most natural interpretation of the estimated coefficients: The odds of a reference event is the ratio of P(event) to P(not event). The estimated coefficient of a predictor (factor or covariate) is the estimated change in the log of P(event)/P(not event) for each unit change in the predictor, assuming the other predictors remain constant
Parametric Models E(Y | X=x) = g(x;w) w0 + w1x1 +…+ wpxp) w1 w2 Generalized Linear Model Training Data
Logistic Regression Models log(odds) logit(p) 0.0 1.0 p 0.5 logit(p ) ( ) p 1 - p log g-1( ) p = w0 + w1x1 +…+ wpxp Training Data
Changing the Odds ( ) ( ) ( ) ´ p 1 - p log = w0 + w1x1 +…+ wpxp = p ( ) p 1 - p log = w0 + w1x1 +…+ wpxp = ( ) p 1 - p log w0 + w1(x1+1)+…+ wpxp ´ ( ) p 1 - p log exp(w1) w1 + w0 + w1x1 +…+ wpxp odds ratio Training Data
Regression diagnostics – Residual Analysis
The Home Equity Loan Case HMEQ Overview Determine who should be approved for a home equity loan. The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.
HMEQ The consumer credit department of a bank wants to automate the decision-making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections). The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.
Original HMEQ data BAD REASON JOB LOAN MORTDUE VALUE DEBTINC YOJ DEROG Name Model Role Measurement Level Description BAD Target Binary 1=defaulted on loan, 0=paid back loan REASON Input HomeImp=home improvement, DebtCon=debt consolidation JOB Nominal Six occupational categories LOAN Interval Amount of loan request MORTDUE Amount due on existing mortgage VALUE Value of current property DEBTINC Debt-to-income ratio YOJ Years at present job DEROG Number of major derogatory reports CLNO Number of trade lines DELINQ Number of delinquent trade lines CLAGE Age of oldest trade line in months NINQ Number of recent credit inquiries
HMEQ: Modeling Goal The credit scoring model computes a probability of a given loan applicant defaulting on loan repayment. A threshold is selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection.
HMEQ: two added variables For model comparison purposes, we added two variables: BEHAVIOR (good/bad), which precisely mirrors the 0/1 values in BAD, to see how we can perfectly predict BAD using insider information FLIPCOIN (Head/Tail), which is completely random, to see if we can predict BAD using random flips of a coin
Introducing SAS Enterprise Miner v.5.3 Enterprise-grade (and expensive!) Data Mining package Implemented Methodology: Sample-Explore-Modify-Model-Assess (SEMMA) Available Modeling Tools: Logistic Regression Many others, such as Decision Trees, Neural Networks, Clustering, Market-Basket, etc.
Analysis of HMEQ in SAS EM Three logistic Regression nodes were added to the Analysis Diagram. In order to compare them, a Compare node was added.
SAS EM 4.3: A more accessible version Accessible through base SAS at UNT CoB Start SAS 9.3. From the SAS menu bar, select Solutions > Analysis > Enterprise Miner
Logistic Regression results (all predictors)
Logistic Regression results (stepwise, final model)
Interpretation of Odds Ratio results Predictors that cause the probability to default on the loan to increase (=odds ratio coeff. > 1): DEBTINC DELINQ DEROG NINQ Predictors that cause the probability to default on the loan to decrease (=odds ratio coeff. < 1): CLNO YOJ
Model Comparison Perfect Regression is, well, perfect. In Baseline Regression, 20% of the borrowers default, regardless of fitted value Stepwise Regression is somewhere between the other two models