Predicting Loan Defaults

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Sparse vs. Ensemble Approaches to Supervised Learning
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Who would be a good loanee? Zheyun Feng 7/17/2015.
You and Your Credit Score FICO. The Score The most widely used credit score is the FICO Score, the credit score created by Fair Isaac Corporation. Lenders.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
by B. Zadrozny and C. Elkan
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
COMP24111: Machine Learning Ensemble Models Gavin Brown
Evaluating Classification Performance
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Risk Solutions & Research © Copyright IBM Corporation 2005 Default Risk Modelling : Decision Tree Versus Logistic Regression Dr.Satchidananda S Sogala,Ph.D.,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
A Decision Support Based on Data Mining in e-Banking Irina Ionita Liviu Ionita Department of Informatics University Petroleum-Gas of Ploiesti.
Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Credit Scoring and Scorecard Lending
Predicting Loan Delinquency at 1M Transactions per Second
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Bagging and Random Forests
Week 2 Presentation: Project 3
An Empirical Comparison of Supervised Learning Algorithms
Customer Segmentation Based on RFM and Predicting Defaulters
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
David L. Olson Department of Management University of Nebraska
Pfizer HTS Machine Learning Algorithms: November 2002
Boosting and Additive Trees
Credit score overview Understanding how credit scores are determined.
Balgobin Y., Telecom ParisTech
COMP61011 : Machine Learning Ensemble Models
ECE 5424: Introduction to Machine Learning
Introduction Feature Extraction Discussions Conclusions Results
Asymmetric Gradient Boosting with Application to Spam Filtering
Reducing Loan Risk Using Data Analytics
CIKM Competition 2014 Second Place Solution
Classifying enterprises by economic activity
CIKM Competition 2014 Second Place Solution
Dr. Morgan C. Wang Department of Statistics
Implementing AdaBoost
Multiple Decision Trees ISQS7342
Machine Learning Interpretability
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Ensemble learning Reminder - Bagging of Trees Random Forest
Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
A Machine Learning Analysis of US Census Salary Data.
Machine Learning in Business John C. Hull
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Predicting Loan Defaults Team 6 Sibi Rajendran, Li Li, Ernest Stephenson

Project Overview Lending Club is a U.S. peer-to-peer lending company founded in 2006 Used 2007-2011 loan data Predicting loan status will help Lending Club to lower their default rate and become more profitable

Literature Review Previous problems with Logistic regression: Low accuracy and low specificity Potential difficulties with KNN: Computation time is large Original data set: 42,538 observations and 110 predictors

Data Description & Wrangling Trimmed number of predictors from 110 to 52 (NA cutoff 80%) Reduced number of predictors from 52 to 18 (manual) Examples of deleted predictors: Zip code, employment title, description, application type, tax liens, bankruptcies Examples of important predictors: Loan amount, interest rate, employment length, homeownership, annual income, debt-to-income ratio, and number of open accounts

Feature Engineering Created a binary response variable “is_bad” which indicates if the loan is default or not - originally 6 classes Created new feature “time_since_first_credit” that is the earliest credit line minus the issue date. New feature “perc_recv” as amount of principal received as of date divided by the loan amount (problem?)

Data Exploration

Prior Work on Modeling Logistic Regression, Naive Bayes and SVM have been tried before. Accuracy vs sensitivity vs specificity issues. Averaging predictions : Benchmark accuracy : 60% Specificity : 0.4 Sensitivity : 0.6

Modeling, Validation and Tuning Data split as 70:30 - training and testing. Most available hyperparameters were tuned either through gridsearch and/or cross-validation. Objective was to increase accuracy and specificity - profitable for LC.

Trees and Random Forests Default parameters gave good accuracy (80%) at the cost of specificity(0.12) Problem : High false positive rate / fail to predict enough defaults. Solution : Tune class weights appropriately (say, 1:3 or threshold)

Important Features Revolving Balance Time since first credit Issue date Annual income Sub-grade Some of the important features. Time since first credit is a new feature.

eXtreme Gradient Boosting Weak Classifiers + Boosting Sequential improvement and optimization Hyperparameters tuned : number of trees, learning rate, class scaling, maximum depth, regularization parameters

Comparison

Results Algorithm Accuracy Sensitivity Specificity Decision Tree 62% 0.62 0.69 Random Forest 67% 0.66 0.65 XGBoost 70% 0.58 0.80

Strengths & Weaknesses Less computation time and flexible Scalable to large datasets XGBoost, the best model, is fast and has a lot of customizable parameter. Weakness Not easily interpretable(XGB, Random Forest) Tuning might be difficult(XGB)

Future Scope Include FICO score variable Include more data from later years More feature engineering Create a web app that will take in parameters and output a credit risk probability

Conclusion Tree based models have relatively better accuracy as compared to other models When Lending Club uses the model, we want to assign each loan a probability for LC to make their own decision and not to just automatically accept or deny loans

Questions?