Download presentation
Presentation is loading. Please wait.
1
Machine Learning Interpretability
Thuy Nguyen Mihir Jain Edward Adcock Toby Alfred-Jones © COPYRIGHT | Delta Capita | CONFIDENTIAL
2
Contents Project objective Use cases Data overview
Neural Network model description Machine Learning Interpretability Results Visualization Next steps 01 02 03 04 05 06 07 08
3
01. Project objective
4
Project objective Delta Capita has developed a Neural Network model to determine the likelihood of mortgage default given a set of information that represents an individual loan The aim of the project is to develop knowledge extraction techniques which interpret the mortgage defaulting model
5
02. Use cases
6
Use cases The ability to accurately determine loan default is especially useful in two domains: Mortgage-backed security investing Risk Management
7
03. Data overview Overview of data 8 Data features 9
Data features (provided) 10 Data features (created) 11 Data features (added) Model Preparation
8
Overview of data Main Dataset Freddy Mac Single-Family Loans
Time frame: 1999 – 2016 Size: 15.3m unique loans with 326m performance updates (monthly) Types of Loans: Default : 85k Fully Paid : 15.2m Ratio of Default vs Fully Paid loans: 0.6 : 99.4 Additional Dataset Average National Mortgage Interest Rate Monthly national interest rate for standard mortgages from January 1999 to July 2016 Housing Price Index Per State Monthly House Price Index in each U.S. state from January 1999 to July 2016 Unemployment Rate Per State Seasonally adjusted unemployment rate by each U.S. state from January 1999 to July 2016
9
Data features We split the following section into 3 parts:
Data Features provided in the main dataset Data Features created from the main dataset Data Features added from external sources We will later evaluate the models performance on data from Feature Sets 1, 2 & 3 (combined)
10
Data features (provided)
Monthly Performance Update Features Evaluation of on-going loans on a monthly basis Origination Features Assessment at the time of the loan application Origination Features Credit Score Original Unpaid Principal Balance First Payment Date Loan-To-Value Ratio First Time Home Buyer Flag Interest Rate Maturity Date Channel of origination of a loan Metropolitan Statistical Area Prepayment Penalty Mortgage Flag Mortgage Insurance Percentage Product Type Number of Units in a Property Property Type Occupancy Status Property State Combined Loan-To-Value Ratio Loan Purpose Debt-To-Income (DTI) Ratio Original Loan Term Number of Borrowers Performance Features Monthly Reporting Period Current Actual Unpaid Principal Balance Loan Age Remaining Months to Legal Maturity Current Interest Rate
11
Data features (created)
Based on history of current loan: Occurrence of: Loan Status (30-dd, 60-dd, 90-dd, foreclosed, etc ...) Occurrences of Loan Status in the last 12 months Percentage change between Last Balance and Current Balance Based on history of all loans: Number of Loans (active) per State/Zip-code Number of Loans (taken out) per State/Zip-code Number of Loans (taken out) per State/Zip-code in the last 12 months Default Rate per State/Zip-code Default Rate per State/Zip-code in the last 12 months Occurrences of ‘Paid Off’ & ‘Default’ per State/Zip-code Occurrences of ‘Paid Off’ & ‘Default’ per State/Zip-code in the last 12 months
12
Data features (added) Economic Features:
Monthly Unemployment Rate per State Monthly Housing Price Index Per State Monthly National Interest Rate Extra features created from added datasets: Difference between Current Interest Rate and National Interest Rate Number of Months that Mortgage Interest Rate is less than National Interest Rate
13
Model preparation Class imbalance Categorical data: One hot encoding
Using under-sampling technique on the training set New ratio of Default vs Fully Paid loans: : 85 Categorical data: One hot encoding For example, if the property is in New York, the value is 1, otherwise 0 Randomly shuffle data
14
04. Neural Network Model Description
Model Architecture 15 Performance Evaluation Metric 17 Model performance 19
15
Model Architecture We use Neural Network to create the Mortgage Classification model Model classes: Default or Fully Paid Model output: Any value between 0 and 1, which represents the probability of Default Threshold value of 0.5 (Value above 0.5 predicts a Default Loan) Model architecture: Layers Number of layers Number of Neurons Input layer (Number of loan features) 1 133* Hidden layer 2 100 : 100 Output layer (Number of classes) * means out of 133 input features, there are only 64 unique loan features
16
Performance Evaluation Metrics
We use 4 performance metrics: Accuracy - Overall classification accuracy True Positive Rate - Classification accuracy of ‘Default’ loans True Negative Rate - Classification accuracy of ‘Fully Paid’ loans AUC - ( True Positive Rate + True Negative Rate ) / 2
17
Model performance Using the data from Feature Sets 1, 2 & 3 (combined): Performance Metric % Accuracy 98.3 % Correct Default (True Positive Rate) 98.5 % Correct ‘Fully Paid’ (True Negative Rate) 98.2 AUC 98.4
18
05. Machine Learning Interpretability
Knowledge Extraction 22 TREPAN 23 Distilling Soft Decision Tree 24 LIME 25
19
Knowledge extraction Problems: Methods:
Neural Networks: high performance, but black-box Decision Tree: high representation, but low performance Combine Neural Networks & Decision Tree to create rules that are human-comprehensible Methods: Global: TREPAN Distilling Soft Decision Tree Local: LIME
20
TREPAN Key features: Neural Networks serve as an oracle that returns class labels Construct models of the underlying distribution of data Tree expansion: best-first expansion to increase fidelity Splitting tests: m-of-n Stopping criteria: Global criteria: size of the tree, highest fidelity tree Local criterion: stopping the tree Key metrics: Accuracy Fidelity Comprehensibility
21
Distilling soft decision tree
Key features: Mimic the input– output function from the Neural Networks Soft targets: true label, predictions of Neural Networks Trained with mini-batch-gradient descent Uses learned filters to make hierarchical decisions Selects a particular static probability distribution over classes as output Key metrics: Accuracy Comprehensibility: complexity of the tree
22
LIME Key features: Create a local linear model around the prediction
Assign weights to different features in the dataset Compute the class probability Predict the class having the highest probability Key metrics: Accuracy
23
06. Results TREPAN 27 Distilling Soft Decision Tree 28 LIME 29
24
TREPAN Use 400 data points Conditions: Model performance:
Maximum of nodes: 10 Minimum sample: 100 Model performance: Accuracy: 80% Fidelity: 88%
25
Distilling Soft Decision Tree
Use 400 data points Condition: Maximum of tree depth: 10 Accuracy: 95%
26
LIME Use 400 data points Example: Prediction of a loan for the 5th customer
27
07. Visualisation
28
Visualization - Dashboard
29
Visualization - Dashboard
30
08. Next Steps
31
Next steps Interpretability Model
Use the entire dataset to validate all interpretability models Try to interpret different Machine Learning models such as Random Forest, SVM Commercial products Develop a front-end app which is easier for people with no data science background to use Provide the tool to work irrespective of dataset or Python libraries Suggest recommendation from results of interpretability models
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.