Machine learning tehniques for credit risk modeling in practice Balaton Attila OTP Bank Analysis and Modeling Department 2017.02.23.
„Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data…”* * Wikipedia
New, complex databases needed new modeling tools Powerful models Internal databases GIRINFO Better GINI Comlex, large database to be analyzed Machine learning Utility companies Wide range dataset about costumer behavior Retailers Recognise connection between different variables Social networks
Why Machine Learning: to mine new, large complex datasets The actual phenomenon Traditional Stats Machine Learning BAD BAD GOOD GOOD Description Traditional stats will fit a predetermined (linear, quadratic, logarithmic) function to the data ML algorithms do not use predetermined function so that they can build a model closely to fit with data Self-learning Not available Self-learning possible to some extent (variable weight can be changed automatically) Regular expert supervision needed Dataset and Complexity Adequate for well-structured databases Can’t handle complex, poorly structured datasets Works well with small or poorly-structured datasets Recognizes complex patterns Intrepetation of results Easy to interpret the results and the effect of explanatory variables Model interpretation requires expertise Hardware capacity Less computationally intensive Demands more computational power
Need of interpretability Completely understandable Well comprehensible „Black box” sec min hour day week month year Forecast timeframe
New risk models is a sub-project of the Banks’ Digital Strategy OBR application models Internal development of a new scorecard using AMM techniques OBRU, OBR application models OTP HU application models Regular trainings about the alternative techniques Internal development of at least two scorecards with AMM Trainings for subsidiaries about AMM AMM belongs to the folclor during model development Subsidiaries test AMM techniques External teams support the validation process Establishing a Python server Involving OTP HU Big Data enviroment Weblog and detailed transactional data for fraud prevention Jun. 2016 Dec. 2016 Jun. 2017 Dec. 2017
Machine Learning in practice Machine Learning techniques can not replace the whole „classic” model development lifcycle A modell-fejlesztés ettől kezdve egy adatbányászati feladat, amely során az általános CRISP-DM metodológia szerint kell eljárni. Láthatók a lépései: - Üzleti megértés - Adatok megértése - Adat-előkészítés - Modellezés - Kiértékelés - Alkalmazás Én, amikor először találkoztam olyasmivel, ami adatbányászatnak lehetett nevezni, nyilván az első 3 unalmasnak tűnő lépésen könnyedén átugrottam, és rögtön mentem neki az adatoknak, ha jól emlékszem, neurális hálóval, vagy amim éppen volt. Az eredmény valami olyasmi lett, mint a Galaxis Útikalauz stopposoknak c. könyvben a 42. Valami kijött, de használni nem nagyon lehetett. Szükséges a folyamat minden lépésére rászánni a megfelelő időt és energiát!
We tested the following Machine Learning techniques… Random forest: „average” of a lot of random trees. Did not perform well for scorecards. Support Vector Machine: the goal is to find the appropriate hyperplane with optimal separational power. The extended version with kernel functions had capacity issues so it was only a supporting algorithm combined with regression. Neural network: can not be used for real time decision making Boosting techniques: supervised repeating of „weak classificators” (decision trees, regression, …) lead to a stronge classificator Main types of boosting: AdaBoost: underweight well classified and overweight misclassified elements in every round LogitBoost: special type of Adaboost where the loss function is logistic Gradient Boosting Trees: special type of LogitBoost, loss function is decreased along the gradient
Random Forest
Base classifiers: C1, C2, …, CT AdaBoost Base classifiers: C1, C2, …, CT In step m, find best Cm in predefined class using weights wi Error rate: Importance of a classifier:
AdaBoost Weight update: Classification:
AdaBoost example from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire
AdaBoost example
Compute , α and the weight of the instances in Round 2 AdaBoost example Compute , α and the weight of the instances in Round 2
AdaBoost example
AdaBoost example
Main AdaBoost idea and a new idea “Shortcomings” are identified by high weight data points The new model (e.g. stump) is fit irrespective to previous predictions In next iteration, learn just the residual of present model F(x): Fit a model to (x1; y1 – F(x1)); (x2; y2 – F(x2)); …; (xn; yn – F(xn)) Regression, no longer classification!
LogitBoost: The additive logistic regression model Logistic regression learns linear combination of classifiers for the “log odds-ratio” The logit transformation guarantees that for any F(x), p(x) is a probability in [0,1]. inverting, we get: Function of real label y – p(x) will be instance weigth
2. Fit a stump by weighted regression LogitBoost Algorithm Step 1: Initialization committee function: initial probabilities: Step 2: LogitBoost iterations for m=1,2,...,M repeat: A. Fitting the weak learner: 1. Compute working response and weights for i=1,...,n 2. Fit a stump by weighted regression Optimization is no longer for error rate but for (root) mean squared error (RMSE)
B. Updating and classifier output LogitBoost Algorithm B. Updating and classifier output
Regression Tree, Regression Stump Regression stump example: TIME_FROM_FIRST_SEND <= 25.5 true false 0.14627427718697633 -0.14621897522944
Iterative prediction y*m Gradient Boosting As in LogitBoost Iterative prediction y*m Residuals: y*m+1 = y*m + h(x) where h(x) is a simple regressor, e.g. stump, shallow tree New idea: optimize by gradient descent If we minimize the mean squared error to true values y averaged over training data: Derivative of (y*-y)2 can be computed and will be proportional to the error y*-y 𝑑 𝑑 𝐹 𝑗 𝑥 𝑖 𝑦 𝑖 − 𝑗 𝐹 𝑗 𝑥 𝑖 2 =2 𝑦 𝑖 − 𝑗 𝐹 𝑗 𝑥 𝑖 =2∙residual Well… this is just LogitBoost without the logistic loss function
Squared loss is overly sensitive to outliers Other loss functions Squared loss is overly sensitive to outliers Absolute loss more robust to outliers but has infinite derivative 𝐿 𝑦;𝐹 =|𝑦−𝐹(𝑥)| Huber loss 𝐿 𝑦;𝐹 = 1 2 (𝑦−𝐹 ) 2 if 𝑦−𝐹 ≤𝛿 𝛿( 𝑦−𝐹 −𝛿/2) if 𝑦−𝐹 >𝛿 Negative gradient is −𝑔 𝑥 𝑖 = 𝑑𝐿 𝑦;𝐹 𝑥 𝑖 𝑑𝐹 𝑥 𝑖 = 𝑦−𝐹( 𝑥 𝑖 ) if 𝑦−𝐹( 𝑥 𝑖 ) ≤𝛿 𝛿sign(𝑦−𝐹( 𝑥 𝑖 )) if 𝑦−𝐹( 𝑥 𝑖 ) >𝛿
Two main Gradient Boosting Loss Functions Deviance (as in Logistic Regression) 1 1+ 𝑒 −𝐹(𝑥) Exponential (as in AdaBoost)
Python server OTP Python server Red Hat Linux with Anaconda (2016Q2) Red Hat Linux with Anaconda Jupyter notebook More effective than localhost enviroments Background mode Web application
Application credit fraud prevention
SAS Fraud Solution – Logical system architecture
System architecture and working logic of OTPHU’s fraud system Hybrid approach Automated Business Rules Anomaly Detection Predictive Modeling Entity matching Social Network Analysis Alert Generation Process Rejection Process Rejection Application data Historical data Behavioural data Entity matching Network building Scenario scores Fraud scoring Manual investigation Approval Approval
Investigation List of suspicious applications Scenario analysis Network analysis
Thank you for your attention!