Logistic Regression CSC 600: Data Mining Class 14.

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

Continued Psy 524 Ainsworth
Statistical Analysis SC504/HS927 Spring Term 2008
Lesson 10: Linear Regression and Correlation
Correlation and regression
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Logistic Regression Biostatistics 510 March 15, 2007 Vanessa Perez.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Introduction to Linear Regression and Correlation Analysis
Chapter 11 Simple Regression
Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011.
Linear vs. Logistic Regression Log has a slightly better ability to represent the data Dichotomous Prefer Don’t Prefer Linear vs. Logistic Regression.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 4 Section 2 – Slide 1 of 20 Chapter 4 Section 2 Least-Squares Regression.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Logistic Regression. Linear Regression Purchases vs. Income.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Logistic Regression Analysis Gerrit Rooks
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
LECTURE 06: CLASSIFICATION PT. 2 February 10, 2016 SDS 293 Machine Learning.
Logistic Regression An Introduction. Uses Designed for survival analysis- binary response For predicting a chance, probability, proportion or percentage.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Logistic Regression: Regression with a Binary Dependent Variable.
Stats Methods at IC Lecture 3: Regression.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
Chapter 15 Multiple Regression Model Building
Chapter 14 Introduction to Multiple Regression
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Regression Analysis AGEC 784.
Linear Regression CSC 600: Data Mining Class 12.
Classification Methods
Advanced Quantitative Techniques
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Chapter 13 Nonlinear and Multiple Regression
Advanced Quantitative Techniques
THE LOGIT AND PROBIT MODELS
Generalized Linear Models
Lecture Slides Elementary Statistics Eleventh Edition
Mean, Variance, and Standard Deviation for the Binomial Distribution
Linear Regression CSC 600: Data Mining Class 13.
Multiple Regression Analysis and Model Building
Multiple Regression.
Introduction to the General Linear Model (GLM)
John Loucks St. Edward’s University . SLIDES . BY.
Understanding Standards Event Higher Statistics Award
Generalized Linear Models
THE LOGIT AND PROBIT MODELS
Naïve Bayes CSC 600: Data Mining Class 19.
Scatter Plots of Data with Various Correlation Coefficients
Correlation and Regression
Correlation and Regression-III
Logistic Regression.
Introduction to Logistic Regression
Simple Linear Regression
Nearest Neighbors CSC 576: Data Mining.
Regression and Categorical Predictors
Naïve Bayes CSC 576: Data Science.
Chapter Thirteen McGraw-Hill/Irwin
Multiple Regression Berlin Chen
Regression Part II.
Classification Methods
Presentation transcript:

Logistic Regression CSC 600: Data Mining Class 14

Today… Logistic Regression

Logistic Regression In standard linear regression, the response is a continuous variable: In logistic regression, the response is qualitative

Logistic Regression There are lots of “prediction” problems with binary responses: Success or failure Spam or “Not Spam” Republican or Democrat “Thumbs up” or “Thumbs down” 1 or 0

Can a Qualitative Response be Converted into a Quantitative Response? Iris Dataset: Qualitative Response: {Setosa, Virginica, Versicolor} Encode a quantitative response: Then fit a linear regression model using least squares

Can a Qualitative Response be Converted into a Quantitative Response? What’s the problem? Encoding implies an ordering of the Iris classes Difference between Setosa and Virginica is same as difference between Viginica and Versicolor Difference between Setosa and Versicolor is greatest

Can a Qualitative Response be Converted into a Quantitative Response? Two different encodings Two different linear models will be produced Would lead to different predictions for the same test instance

Can a Qualitative Response be Converted into a Quantitative Response? Another example: Response: {Mild, Moderate, Severe} Encoding is fine if the gap between Mild and Moderate, is about the same as Moderate to Sever.

Can a Qualitative Response be Converted into a Quantitative Response? In general, no natural way to convert a qualitative response with more than two levels into a quantitative response. Binary response: Predict Default if yi < 0.5 Else predict NoDefault Predicted values may lie outside range of [0,1] Predicted values are not probabilities

Logistic Model Logistic Regression models the probability that Y belongs to a particular category For Default dataset: Probability of Default given Balance: Values of p(balance) will range between 0 and 1.

Logistic Model p(default |balance) > 0.5 Predict Default=Yes … or for a conservative company Lower the threshold and only predict Default=Yes if p(default|balance) > 0.1

Linear Model vs. Logistic Model x <- -50:50 B0 <- 1.5 B1 <- .250 x = range(-50,50) B1 = .25 B0 = 1.5 y = [slope * i + intercept for i in x] plt.plot(x, y, color='blue', linewidth=3)

Linear Model vs. Logistic Model With linear model, can always predict p(X) < 0 for some values of X and p(X) > 1 for others (unless X has limited range) Logistic Function: outputs between 0 and 1 for all values of X (many functions meet this criteria, logistic regression uses the logistic function on next slide)

Values of the odds close to 0 indicate very low probabilities. On average, 1 in 5 people with an odds of 1/4 will default. High odds indicate very high probabilities. Logistic Function odds … algebra … log-odds (logit) … algebra … “Logit is linear in X.”

Interpretation Linear: Logistic: “B1 gives the average change in Y with a one-unit increase in X.” “Relationship is constant (straight line).” Logistic: “Increasing X by one unit changes the log odds by B1, (multiplies the odds by eB1).” “Relationship is not a straight line. The amount that Y changes depends on the current value of X.”

Estimating the Regression Coefficients (yikes, math stuff) Usually the method of maximum likelihood is used (instead of least squares) Reasoning is beyond the scope of this course scikit calculates “best” coefficients automatically for us

“Default” Dataset Simulated toy dataset 10,000 observations 4 variables Default: {Yes, No} – whether customer defaulted on their debt Student: {Yes, No} – whether customer is a student Balance: average CC balance Income: customer income

“Default” dataset Since p-value of Balance coefficient is tiny, it is statistically significant that there is an association between Balance and the probability of Default. “Since B1=0.0047, an increase in balance is associated with an increase in the probability of default.” reg = linear_model.LogisticRegression() reg.fit(default['balance'].reshape(-1,1), default['default']) Coefficients: [[ 0.00478248]] Intercept: [-9.46506555] A one-unit increase in balance is associated with an increase in the log odds of default by 0.0047 units.

“Default” dataset Making Predictions:

“Default” dataset Do students have a higher chance of default?

Multiple Logistic Regression How to prediction a binary response using multiple predictors? Can generalize the logistic function to p predictors: Can use maximum likelihood to estimate the p+1 coefficients.

“Default” dataset Full model using all three predictor variables: RESPONSE Default: {Yes, No} – whether customer defaulted on their debt PREDICTORS (DUMMY VARIABLE USED) Student: {Yes, No} – whether customer is a student Balance: average CC balance Income: customer income

Full Model Coefficient for the dummy variable student is negative, indicating that students are less likely to default than nonstudents. Predictor Names: Index(['balance', 'income', 'student_Yes'], dtype='object') Coefficients: [[ 4.07585063e-04 -1.25881574e-04 -2.51032533e-06]] Intercept: [ -1.94180590e-06] > glm.fit <- glm(default ~ balance+income+student, data=Default, family=binomial) > summary(glm.fit) Call: glm(formula = default ~ balance + income + student, family = binomial, data = Default) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 *** balance 5.737e-03 2.319e-04 24.738 < 2e-16 *** income 3.033e-06 8.203e-06 0.370 0.71152 studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 p-values associated with student and balance are very small, indicating that each of these variables is associated with the probability of Default.

Model Comparison Conflicting Results? Call: Predictor Names: student_Yes Coefficients: [[ 0.38256903]] Intercept: [-3.48496241] Predictor Names: Index(['balance', 'income', 'student_Yes'], dtype='object') Coefficients: [[ 4.07585063e-04 -1.25881574e-04 -2.51032533e-06]] Intercept: [ -1.94180590e-06] Conflicting Results? Call: glm(formula = default ~ student, family = binomial, data = Default) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.50413 0.07071 -49.55 < 2e-16 *** studentYes 0.40489 0.11502 3.52 0.000431 *** Call: glm(formula = default ~ balance + income + student, family = binomial, data = Default) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 *** balance 5.737e-03 2.319e-04 24.738 < 2e-16 *** income 3.033e-06 8.203e-06 0.370 0.71152 studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Conflicting Results? How is it possible for student status to be associated with an increase in probability of default when it was the only predictor, and now a decrease in probability of default once balance is also factored in?

Interpretation Default rate of students/non-students, averaged over all values of balance Red = student Blue = not a student A student is less likely to default than a non-student, for a fixed value of balance.

Interpretation Default rate of students/non-students, as a function of balance value Red = student Blue = not a student A student is less likely to default than a non-student, for a fixed value of balance.

Interpretation Variables Student and Balance are correlated. Students tend to hold higher levels of debt, which is associated with a higher probability of default.

A student is riskier for default than a non-student if no information about the student’s CC balance is available. But, that student is less risky than a non-student with the same CC balance.

Interpretation is Key In linear regression, results obtained using one predictor may be vastly different compared to when multiple predictors are used Especially when there is correlation among the predictors

Logistic Regression: >2 Response Classes Yes, Two-class logistic regression models have multiple-class extensions. Beyond scope of course for now. We’ll use other data mining and statistical techniques instead. (See Chapter 11 of Ledolter)

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.