Download presentation
Presentation is loading. Please wait.
Published byViolet Cooper Modified over 6 years ago
1
Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector
Matt Thomson 17/11/2016
2
Traditional Fraud Detection Assurance Scoring
Outline Introduction Traditional Fraud Detection Assurance Scoring Machine Learning Business Rules Anomaly Detection Graph Links
3
Who am I? Matt Thomson Capgemini Senior Data Scientist at Capgemini
PhD in Astrophysics ( Several years experience in fraud detection Capgemini Big Data Analytics team ~100 Data Scientists, Big Data Engineers and Data Analysts Focus on Open Source and Big Data technologies to solve client problems Sponsor the meetup today!
4
Introduction to the Problem
Public sector constantly working in an environment of reduced resources Want to provide a better service but with greater efficiency Therefore very important that limited resources are focussed correctly Assurance Scoring Use ML and other analytical methods to identify the least risky people or applications so that investigators resources can be targeted on the most risky
5
Hypothetical Example – 2016 Olympics tickets
Imagine running the application process for selling tickets to the Olympics Avoid selling tickets to touts/resellers Vast majority of people applying for tickets are genuine Fraud detection with big class imbalance problem (<0.1%) Avoid approach of investigating each person applying Lets say we know from 2012 Olympics which people ended up reselling their tickets – training data Use ML to identify the 30% (say) least likely to be touts – fast tracked Investigators focus on the high risk
6
Traditional Fraud Detection
Identify Historical Training Data Feature Engineering Model Training and Evaluation Model Execution Feedback
7
Assurance Scoring Focus on low-risk
Allows resources to be better focussed Not limited to Machine Learning Built using Python! Pandas, Scikit-learn etc Scala version using Spark MLlib
8
Assurance Scoring
9
POLE ‘Analytical’ Data Layer
POLE Layer Atomic data is Transformed and Loaded into POLE Person Object Location Event Disparate data sources - Atomic Layer
10
POLE ‘Analytical’ Data Layer
POLE contains ALL entities from the Atomic Layer, plus their inter-linkages
11
Assurance Scoring
12
Manipulate, Explore Data
Machine learning Framework: Structure, flexibility, consistency Manipulate, Explore Data Vector Build Input Data Training Model Transform Selection Validation Test Feature extraction and selection Model Building Variety of output files: logs, graphics, pickle models, etc Testing: Unit tests, monitoring tests and integration tests
13
Machine learning : Feature Engineering
SQL, Python Transform Explore Select Ask questions, validate Refine features Feature Extraction Data exploration Feature selection Historical Data
14
Machine Learning: Model Building
Hyper-parameter tuning Split Datasets Training results Build Models Training Selected features Models Validation results Validation Test Tests results Compare Models
15
Low risk? High risk? Depends on classifier’s threshold
True-positives : applications the model correctly classifies as high risk True negatives: applications model correctly classifies as low risk False-positives: applications the model scores as high risk but are not False-negatives: applications the model scores as low risk but were in fact high risk
16
Assurance Scoring
17
Business Rules Identifying Fraud often been done using deterministic rules Look for transactions near a threshold or at the end of the day Primarily data queries on your feature vector Olympics example – Anyone applying for more than £10,000 tickets
18
Assurance Scoring
19
Anomaly Detection Use the training data to create a baseline of applications by postcode (say) If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucket
20
Assurance Scoring
21
Graph Links - Matching Key part of assurance scoring – bringing data together from disparate sources Probability of Match: 80% Attribute Data Source 1 Data Source 2 Name Matt Thomson Matthew Thosmon Phone Number Favourite Sport Football Cricket
22
Assurance Scoring
23
Further Details Come and find me!
Assurance Scoring brochure: Blogs: Introduction: Integrating multiple techniques: Machine Learning: Many more on other topics
24
We’re Hiring! Data Science Big Data Engineer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.