Predicting Loan Defaults

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Sparse vs. Ensemble Approaches to Supervised Learning

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.

Bayesian Learning Rong Jin.

Sparse vs. Ensemble Approaches to Supervised Learning

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

Decision Tree Models in Data Mining

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Who would be a good loanee? Zheyun Feng 7/17/2015.

You and Your Credit Score FICO. The Score The most widely used credit score is the FICO Score, the credit score created by Fair Isaac Corporation. Lenders.

Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

by B. Zadrozny and C. Elkan

Chapter 9 – Classification and Regression Trees

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

COMP24111: Machine Learning Ensemble Models Gavin Brown

Evaluating Classification Performance

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Risk Solutions & Research © Copyright IBM Corporation 2005 Default Risk Modelling : Decision Tree Versus Logistic Regression Dr.Satchidananda S Sogala,Ph.D.,

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

A Decision Support Based on Data Mining in e-Banking Irina Ionita Liviu Ionita Department of Informatics University Petroleum-Gas of Ploiesti.

Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Credit Scoring and Scorecard Lending

Predicting Loan Delinquency at 1M Transactions per Second

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Bagging and Random Forests

Week 2 Presentation: Project 3

An Empirical Comparison of Supervised Learning Algorithms

Customer Segmentation Based on RFM and Predicting Defaulters

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

David L. Olson Department of Management University of Nebraska

Pfizer HTS Machine Learning Algorithms: November 2002

Boosting and Additive Trees

Credit score overview Understanding how credit scores are determined.

Balgobin Y., Telecom ParisTech

COMP61011 : Machine Learning Ensemble Models

ECE 5424: Introduction to Machine Learning

Introduction Feature Extraction Discussions Conclusions Results

Asymmetric Gradient Boosting with Application to Spam Filtering

Reducing Loan Risk Using Data Analytics

CIKM Competition 2014 Second Place Solution

Classifying enterprises by economic activity

CIKM Competition 2014 Second Place Solution

Dr. Morgan C. Wang Department of Statistics

Implementing AdaBoost

Multiple Decision Trees ISQS7342

Machine Learning Interpretability

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Ensemble learning Reminder - Bagging of Trees Random Forest

Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque

Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

A Machine Learning Analysis of US Census Salary Data.

Machine Learning in Business John C. Hull

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Predicting Loan Defaults Team 6 Sibi Rajendran, Li Li, Ernest Stephenson

Project Overview Lending Club is a U.S. peer-to-peer lending company founded in 2006 Used 2007-2011 loan data Predicting loan status will help Lending Club to lower their default rate and become more profitable

Literature Review Previous problems with Logistic regression: Low accuracy and low specificity Potential difficulties with KNN: Computation time is large Original data set: 42,538 observations and 110 predictors

Data Description & Wrangling Trimmed number of predictors from 110 to 52 (NA cutoff 80%) Reduced number of predictors from 52 to 18 (manual) Examples of deleted predictors: Zip code, employment title, description, application type, tax liens, bankruptcies Examples of important predictors: Loan amount, interest rate, employment length, homeownership, annual income, debt-to-income ratio, and number of open accounts

Feature Engineering Created a binary response variable “is_bad” which indicates if the loan is default or not - originally 6 classes Created new feature “time_since_first_credit” that is the earliest credit line minus the issue date. New feature “perc_recv” as amount of principal received as of date divided by the loan amount (problem?)

Data Exploration

Prior Work on Modeling Logistic Regression, Naive Bayes and SVM have been tried before. Accuracy vs sensitivity vs specificity issues. Averaging predictions : Benchmark accuracy : 60% Specificity : 0.4 Sensitivity : 0.6

Modeling, Validation and Tuning Data split as 70:30 - training and testing. Most available hyperparameters were tuned either through gridsearch and/or cross-validation. Objective was to increase accuracy and specificity - profitable for LC.

Trees and Random Forests Default parameters gave good accuracy (80%) at the cost of specificity(0.12) Problem : High false positive rate / fail to predict enough defaults. Solution : Tune class weights appropriately (say, 1:3 or threshold)

Important Features Revolving Balance Time since first credit Issue date Annual income Sub-grade Some of the important features. Time since first credit is a new feature.

eXtreme Gradient Boosting Weak Classifiers + Boosting Sequential improvement and optimization Hyperparameters tuned : number of trees, learning rate, class scaling, maximum depth, regularization parameters

Comparison

Results Algorithm Accuracy Sensitivity Specificity Decision Tree 62% 0.62 0.69 Random Forest 67% 0.66 0.65 XGBoost 70% 0.58 0.80

Strengths & Weaknesses Less computation time and flexible Scalable to large datasets XGBoost, the best model, is fast and has a lot of customizable parameter. Weakness Not easily interpretable(XGB, Random Forest) Tuning might be difficult(XGB)

Future Scope Include FICO score variable Include more data from later years More feature engineering Create a web app that will take in parameters and output a credit risk probability

Conclusion Tree based models have relatively better accuracy as compared to other models When Lending Club uses the model, we want to assign each loan a probability for LC to make their own decision and not to just automatically accept or deny loans

Questions?