Logistic Regression KNN Ch. 14 (pp ) MINITAB User’s Guide

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
The Home Equity Loan Case
Section 2.1 Introduction to Enterprise Miner. 2 Objectives Open Enterprise Miner. Explore the workspace components of Enterprise Miner. Set up a project.
Some slide material taken from: Groth, Han and Kamber, SAS Institute Data Mining A special presentation for BCIS 4660 Spring 2012 Dr. Nick Evangelopoulos,
Models with Discrete Dependent Variables
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Multiple Regression and Correlation Analysis
An Introduction to Logistic Regression
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Generalized Linear Models
Logistic regression for binary response variables.
Decision Tree Models in Data Mining
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
MODELS OF QUALITATIVE CHOICE by Bambang Juanda.  Models in which the dependent variable involves two ore more qualitative choices.  Valuable for the.
Application of SAS®! Enterprise Miner™ in Credit Risk Analytics
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
Premiere Products Team Project SAS Enterprise Miner (Part I)
Introduction to Linear Regression
5.2 Input Selection 5.3 Stopped Training
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
The Demand for Home Equity Loans at Bank X* An MBA 555 Project Laura Brown Richard Brown Jason Vanderploeg *bank name withheld for proprietary reasons.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
CS 478 – Tools for Machine Learning and Data Mining Linear and Logistic Regression (Adapted from various sources) (e.g., Luiz Pessoa PY 206 class at Brown.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Chapter 4: Introduction to Predictive Modeling: Regressions
Introduction to logistic regression and Generalized Linear Models July 14, 2011 Introduction to Statistical Measurement and Modeling Karen Bandeen-Roche,
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
LOGISTIC REGRESSION Binary dependent variable (pass-fail) Odds ratio: p/(1-p) eg. 1/9 means 1 time in 10 pass, 9 times fail Log-odds ratio: y = ln[p/(1-p)]
Logistic regression (when you have a binary response variable)
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Regression Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Linear Models Binary Logistic Regression. One-Way IVR 2 Hawaiian Bats Examine data.frame on HO Section 1.1 Questions –Is subspecies related to canine.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Logistic Regression Categorical Data Analysis.
Logistic Regression and Odds Ratios Psych DeShon.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
1 BUSI 6220 By Dr. Nick Evangelopoulos, © 2012 Brief overview of Linear Regression Models (Pre-MBA level)
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Logistic Regression APKC – STATS AFAC (2016).
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
John Loucks St. Edward’s University . SLIDES . BY.
Generalized Linear Models
Introduction to Data Mining and Classification
Introduction to logistic regression a.k.a. Varbrul
Presentation transcript:

Logistic Regression KNN Ch. 14 (pp. 555-618) MINITAB User’s Guide SAS EM documentation

Regression Models with Binary Response Variable In many applications the response variable has only two possible outcomes (0/1): In a study of liability insurance possession, using Age of head of household, Amount of liquid assets, and Type of occupation of head of household as predictors, the response variable had two possible outcomes: House has liability insurance (=1), or Household does not have liability insurance (=0) The financial status of a firm (sound status, headed toward insolvency) can be coded as 0/1 Blood pressure status (high blood pressure, not high blood pressure) can be coded as 0/1

Meaning of the Response Function for Binary Outcomes Consider the simple linear regression model In this case, the expected response E{Yi} has a special meaning. Consider Yi to be a Bernoulli random variable:

Meaning of the Response Function for Binary Outcomes Using the definition of expected value of a random variable, Therefore, the mean response E{Yi} is the probability that Yi =1 when the level of the predictor variable is Xi. E{Y} 1 X E{Y} = b0 + b1X

Problems when Response Variable is Binary 1. Error Terms are not normal: At each X level, the error cannot be normally distributed since it takes only 2 possible values, depending on whether Y is 0 or 1 2. Error Variance is not constant: Error Variance is a function of X, therefore not constant 3. Constraints with the response function: We need to find response functions that do not exceed the value of 1, and that is not easy

Link Functions Inverse of distribution functions have a sigmoid shape that can be helpful as a response function of a regression model with binary outcome. Such a function is called Link Function. We want to choose a link function that best fits our data. Goodness-of-fit statistics can be used to compare fits using different link functions:

Logistic Regression Assumption logit transformation Assumption: The logit transformation of the probabilities of the target value results in a linear relationship with the input variables.

Linear versus Logistic Regression Linear Regression Logistic Regression Target is an interval variable. Target is a discrete (binary or ordinal) variable. Input variables have any measurement level. Input variables have any measurement level. Predicted values are the mean of the target variable at the given values of the input variables. Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables.

Interpretation of Parameter Estimates The interpretation of the parameter estimates depends on The link function The reference event (1 or 0) The reference factor levels (for numerical factors, reference level is the smallest value) The logit link function provides the most natural interpretation of the estimated coefficients: The odds of a reference event is the ratio of P(event) to P(not event). The estimated coefficient of a predictor (factor or covariate) is the estimated change in the log of P(event)/P(not event) for each unit change in the predictor, assuming the other predictors remain constant

Parametric Models E(Y | X=x) = g(x;w) w0 + w1x1 +…+ wpxp) w1 w2 Generalized Linear Model Training Data

Logistic Regression Models log(odds) logit(p) 0.0 1.0 p 0.5 logit(p ) ( ) p 1 - p log g-1( ) p = w0 + w1x1 +…+ wpxp Training Data

Changing the Odds ( ) ( ) ( ) ´ p 1 - p log = w0 + w1x1 +…+ wpxp = p ( ) p 1 - p log = w0 + w1x1 +…+ wpxp = ( ) p 1 - p log w0 + w1(x1+1)+…+ wpxp ´ ( ) p 1 - p log exp(w1) w1 + w0 + w1x1 +…+ wpxp odds ratio Training Data

Regression diagnostics – Residual Analysis

The Home Equity Loan Case HMEQ Overview Determine who should be approved for a home equity loan. The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.

HMEQ The consumer credit department of a bank wants to automate the decision-making process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections). The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.

Original HMEQ data BAD REASON JOB LOAN MORTDUE VALUE DEBTINC YOJ DEROG Name Model Role Measurement Level Description BAD Target Binary 1=defaulted on loan, 0=paid back loan REASON Input HomeImp=home improvement, DebtCon=debt consolidation JOB Nominal Six occupational categories LOAN Interval Amount of loan request MORTDUE Amount due on existing mortgage VALUE Value of current property DEBTINC Debt-to-income ratio YOJ Years at present job DEROG Number of major derogatory reports CLNO Number of trade lines DELINQ Number of delinquent trade lines CLAGE Age of oldest trade line in months NINQ Number of recent credit inquiries

HMEQ: Modeling Goal The credit scoring model computes a probability of a given loan applicant defaulting on loan repayment. A threshold is selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection.

HMEQ: two added variables For model comparison purposes, we added two variables: BEHAVIOR (good/bad), which precisely mirrors the 0/1 values in BAD, to see how we can perfectly predict BAD using insider information FLIPCOIN (Head/Tail), which is completely random, to see if we can predict BAD using random flips of a coin

Introducing SAS Enterprise Miner v.5.3 Enterprise-grade (and expensive!) Data Mining package Implemented Methodology: Sample-Explore-Modify-Model-Assess (SEMMA) Available Modeling Tools: Logistic Regression Many others, such as Decision Trees, Neural Networks, Clustering, Market-Basket, etc.

Analysis of HMEQ in SAS EM Three logistic Regression nodes were added to the Analysis Diagram. In order to compare them, a Compare node was added.

SAS EM 4.3: A more accessible version Accessible through base SAS at UNT CoB Start SAS 9.3. From the SAS menu bar, select Solutions > Analysis > Enterprise Miner

Logistic Regression results (all predictors)

Logistic Regression results (stepwise, final model)

Interpretation of Odds Ratio results Predictors that cause the probability to default on the loan to increase (=odds ratio coeff. > 1): DEBTINC DELINQ DEROG NINQ Predictors that cause the probability to default on the loan to decrease (=odds ratio coeff. < 1): CLNO YOJ

Model Comparison Perfect Regression is, well, perfect. In Baseline Regression, 20% of the borrowers default, regardless of fitted value Stepwise Regression is somewhere between the other two models