By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
CMPUT 466/551 Principal Source: CMU
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
1. Abstract 2 Introduction Related Work Conclusion References.
Introduction to Machine Learning Anjeli Singh Computer Science and Software Engineering April 28 th 2008.
Ensemble Learning: An Introduction
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Real Sector Division IMF Statistics Department The views expressed herein are those of the authors and should not necessarily be attributed to the IMF,
Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Logistic Regression: Regression with a Binary Dependent Variable.
Prepared by Fayes Salma.  Introduction: Financial Tasks  Data Mining process  Methods in Financial Data mining o Neural Network o Decision Tree  Trading.
Usman Roshan Dept. of Computer Science NJIT
Data Science Credibility: Evaluating What’s Been Learned
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CS 9633 Machine Learning Support Vector Machines
Machine Learning for Computer Security
Chapter 7. Classification and Prediction
Bagging and Random Forests
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Introduction Characteristics Advantages Limitations
Rule Induction for Classification Using
Trees, bagging, boosting, and stacking
David L. Olson Department of Management University of Nebraska
Boosting and Additive Trees
COMP61011 : Machine Learning Ensemble Models
Kathi Kellenberger Redgate
ECE 5424: Introduction to Machine Learning
Kathi Kellenberger Redgate Software
Data Mining Practical Machine Learning Tools and Techniques
Dr. Morgan C. Wang Department of Statistics
Data Mining for Business Analytics
Overview of Machine Learning
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensembles.
Generalization in deep learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Ensemble learning Reminder - Bagging of Trees Random Forest
Statistical Thinking and Applications
MIS2502: Data Analytics Classification Using Decision Trees
Predicting Loan Defaults
Chapter 6 Logistic Regression: Regression with a Binary Dependent Variable Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Usman Roshan Dept. of Computer Science NJIT
Azure Machine Learning
Multivariate Analysis - Introduction
Machine Learning in Business John C. Hull
Credit Card Fraudulent Transaction Detection
Presentation transcript:

By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran

Machine Learning applications relevant to the Financial Services sector

Machine Learning - Understanding Machine learning is the science of designing & applying algorithms that are able to learn things from past cases. It uses complex algorithms that iterate over large data sets and analyze the patterns in data. The algorithm facilitates the machines to respond to different situations for which they have not been explicitly programmed. It is used in spam detection, image recognition, product recommendation, predictive analysis etc. Significant reduction of human effort is the main aim of data scientists in implementing ML. Even with modern analytics tools, it takes a lot of time for humans to read, collect, categorize and analyze the data. ML teaches machines to identify and gauge the importance of patterns in place of humans. Particularly for use cases where data must be analyzed and acted upon in a short amount of time, having the support of machines allows humans to be more efficient and act with confidence. Machine Learning converts data intensive and confusing information into a simple format that suggests actions to decision makers. A user further trains the ML system by continually adding data and experience. Thus at its core, machine learning is a 3-part cycle i.e. Train-Test-Predict. Optimizing the cycle can make predictions more accurate and relevant to the specific use-case

Use of Machine Learning in Insurance Claim Fraud Insurance frauds cover the range of improper activities which an individual may commit in order to achieve a favorable outcome from the insurance company. This could range from staging the incident, misrepresenting the situation including the relevant actors and the cause of incident and finally the extent of damage caused. Potential situations could include: Covering-up for a situation that wasn’t covered under insurance (e.g. drunk driving, performing risky acts, illegal activities etc.) Misrepresenting the context of the incident: This could include transferring the blame to incidents where the insured party is to blame, failure to take agreed upon safety measures Inflating the impact of the incident: Increasing the estimate of loss incurred either through addition of unrelated losses (faking losses) or attributing increased cost to the losses

Process followed

Data Set 1 -Multiple Parties Data Set 4 – Age of the vehicle Data Set - Sample Data Set 1 -Multiple Parties Data Set 2 – For Insured Data Set 3 – FIR Logged Data Set 4 – Age of the vehicle Number of Claims 8,627 562,275 595,360 15,420 Number of Attributes 34 59 62 Categorical Attributes 12 11 13 24 Normal Claims 8537 591,902 595,141 14,497 Frauds Identified 90 373 219 913 Fraud Incidence Rate 1.04% 0.06% 0.03% 5.93% Missing Values 11.36% 10.27% 0.00% Number of Years of Data 10 3

BETTER DATA, BETTER RESULTS Volume of Data: A fraud management solution needs access to a vast store of historical transaction data to help train its models and maximize the likelihood that it will uncover patterns of suspicious activity. Richness of Data: It is not just the number of past transactions that counts — it is important to get as much information about each transaction as possible. Pulling data from different sources can enhance data quality and fill in missing information gaps. Relevancy of Data: By collecting data from payment processors, businesses and major payment networks, it is possible to tap into a vast reservoir of “truth information” — data regarding duplicated data, claim id’s are found invalid and manual reviews based on the actual outcome of past transactions. This data is critical to distinguishing between good and bad transactions.

Challenges Faced in Detection The incidence of frauds is far less than the total number of claims, and also each fraud is unique in its own way Another challenge encountered in the process of machine learning is missing value and handling categorical values. Missing data arises in almost all serious statistical analyses. The other challenge is handling categorical attributes (for e.g. – the gender variable is transposed into two different columns say male and female)

Machine Learning model which can be used Construction of machine learning algorithms that can learn from a dataset and make predictions on unseen data. Such algorithms operate by building a model from historical data in order to make predictions or decisions on the new unseen data. Logistic Regression: Logistic regression measures the relationship between a dependent variable and one or more independent variables by estimating probabilities using a logit function. Instead of regression in generalized linear model, a binomial prediction can be performed. Multivariate Normal Distribution - The multivariate normal distribution is a generalization of the univariate normal to two or more variables. Boosting - Boosting is a procedure that uses the idea of combining the outputs of many “weak” classifiers to produce a powerful “committee” Bagging : Unlike single decision trees which are likely to suffer from high variance or high bias. Decision trees are used to learn the input-output map of a supervised learning problem by expressing the map as a branching tree. The method does well on problems with complicated structures, but has the weakness that it has high variance. This weakness can be mitigated through bagging. Random Forest tuning can be controlled by using an “Out-Of-Bag” error estimate for each observation. This estimate is the average error from the trees corresponding to the bootstrap samples where that observation did not appear, and is used to control the training

Conclusion The machine learning models that are discussed and applied on the datasets should be able to identify most of the fraudulent cases with a low false positive rate i.e. with a reasonable precision. This enables system to focus on new fraud scenarios and ensuring that the models are adapting to identify them

Prediction of consumer credit risk Because of the increasing number of companies or startups created in the field of microcredit and peer to peer lending, we tried through this project to build an efficient tool to peer to peer lending managers, so that they can easily and accurately assess the default risk of their clients. In order to restore trust in the finance system and to prevent credit and default risk from happening again, banks and other credit companies have recently tried to develop new models to assess the credit risk of individuals even more accurately.

Data Set – from Kaggle Age of the borrower Number of dependents in family Monthly income Monthly expenditures divided by monthly gross income Total balance on credit cards divided by the sum of credit limits Number of open loans and lines of credit Number of mortgage and real estate loans Number of times the borrower has been 30- 59 days past due but no worse in the last 2 years Number of times the borrower has been 60- 89 days past due but no worse in the last two years Number of times the borrower has been 90 days or more past due.

Machine Learning model which can be used Logistic regression as it is a very classic model for this type of problems. Classification and Regression Trees : Trees are particularly efficient in classification Random Forests: this model averages multiple deep decision trees trained on different parts of the training set (this aims at reducing the variance) Gradient Boosting Trees (GBT): gradient boosting algorithm improves the accuracy of a predictive function through incremental minimization of the error term. After the initial tree is grown, each tree in the series is fitted with the purpose of reducing the error.

Conclusion It can clearly state that two distinct groups of models results: Logit and CART constitute the First one - The more sophisticated tree models Second one - Random Forest and Gradient Boosting Trees. By combining trees and gradient boosting technique (GBT model), we can implement a model which presents two principal features. First - Its predictive power is very accurate. Second - Small variance makes it much more reliable