Fraud Detection Notes from the Field. Introduction Dejan Sarka –Data.

Slides:



Advertisements
Similar presentations
THE USE OF STATISTICS AND DATA MINING TO INCREASE AUDIT EFFICIENCIES AND EFFECTIVENESS Abraham Meidan, Ph.D. WizSoft Inc.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Correlation and Linear Regression.
Introduction to Data Mining with XLMiner
Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.
Class 15: Tuesday, Nov. 2 Multiple Regression (Chapter 11, Moore and McCabe).
BA 555 Practical Business Analysis
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 7 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D
Data Mining – Intro.
Excel Data Analysis Tools Descriptive Statistics – Data ribbon – Analysis section – Data Analysis icon – Descriptive Statistics option – Does NOT auto.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
Example 16.3 Estimating Total Cost for Several Products.
Introduction to Directed Data Mining: Decision Trees
CHURN PREDICTION MODEL IN RETAIL BANKING USING FUZZY C- MEANS CLUSTERING Džulijana Popović Consumer Finance, Zagrebačka banka d.d. Consumer Finance, Zagrebačka.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Overview DM for Business Intelligence.
CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection.
Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
More About Significance Tests
by B. Zadrozny and C. Elkan
More value from data using Data Mining Allan Mitchell SQL Server MVP.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
The DM Process – MS’s view (DMX). The Basics  You select an algorithm, show the algorithm some examples called training example and, from these examples,
Statistical Analysis Topic – Math skills requirements.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
June 21, Objectives  Enable the Data Analysis Add-In  Quickly calculate descriptive statistics using the Data Analysis Add-In  Create a histogram.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
MIS2502: Data Analytics Advanced Analytics - Introduction.
ANOVA, Regression and Multiple Regression March
Data Mining and Decision Support
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Data Mining Copyright KEYSOFT Solutions.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
High Performance Statistical Queries. Sponsors Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining With SQL Server Data Tools Mining Data Using Tools You Already Have.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Show Me Potential Customers Data Mining Approach Leila Etaati.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Thursday, May 12, 2016 Report at 11:30 to Prairieview
Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative
Unit 5 – Chapters 10 and 12 What happens if we don’t know the values of population parameters like and ? Can we estimate their values somehow?
MIS2502: Data Analytics Advanced Analytics - Introduction
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Data Mining It's not the size of your data it's what you do with it
Reasoning in Psychology Using Statistics
Data Mining Lecture 11.
Statistics Branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. Practice or science of.
Common Core Math I Unit 2: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
Common Core Math I Unit 1: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

Fraud Detection Notes from the Field

Introduction Dejan Sarka –Data Scientist –MCT, SQL Server MVP –30 years of data modeling, data mining and data quality 13 books ~10 courses 2

Agenda Conducting a DM Project Introduction Data Preparation and Overview Directed Models Undirected Models Measuring Models Continuous Learning Results

Conducting a DM Project (1) A data mining project is about establishing a learning environment Start with a proof-of-concept (POC) project –5-10 working days (depends on a problem) –SMEs & IT pros available, data known & available Time split –Training: 20% –Data overview & preparation: 60% –Data mining modeling: 10% –Models evaluation: 5% –Presenting the results: 5%

Conducting a DM Project (2) Real project –10-20 working days (depends on a problem) –SMEs & IT pros available, data known & available Time split (without actual deployment) –Additional data overview & preparation: 20% –Additional data mining modeling: 5% –Models evaluation: 5% –Presenting and explaining the results: 10% –Defining deployment: 10% –Establishing models measuring environment: 30% –Additional training: 20%

Introduction to Fraud Analysis Fraud detection = outlier analysis Graphical methods –OLAP cubes help in multidimensional space Statistical methods –Single variable – must know distribution properties –Regression analysis – analysis of residuals Data Mining methods –Directed models – must have existing frauds flagged –Undirected models – sort data from most to least suspicious 6

Resolving Frauds Blocking procedure –Reject all suspicious rows Consecutive (sequential) procedure –Test least likely outlier first, then next most extreme outlier, etc. Learning procedure –Find frauds with directed mining methods and check all of them –Test from most to least likely outliers with undirected mining methods and check limited number of rows –Measure efficiency of the models and constantly change them 7

What Is an Outlier? (1) About.96 of the distribution is between mean +- two standard deviations

Skewness and Kurtosis Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable Kurtosis is a measure of the "peakedness" of the probability distribution of a real- valued random variable Source of the graphs: Wikipedia

What Is and Outlier (2) Outliers must be at least two (three, four,…) standard deviations from the mean, right? –Well, if the distribution is skewed, not true on one side –If it has long tails on both sides (high Kurtosis), not true on both sides –So, how much from mean should we go? No way to get a definite answer Start from other direction: how many rows can you inspect? –One person can inspect ~10,000 rows per day –Define outliers by the max. number of rows to inspect

Learning Procedure Sure frauds reported by customers –Flagged and used for training directed models –Models predict frauds before customers reports –Still, a clerk has to do a check before blocking Some rows tested with no previous knowledge –Patterns change, directed models become obsolete –Instead of random checks, use undirected models for sorting data from most to least likely frauds –Change and retrain directed models when obsolete Learn when a directed model is obsolete through constant measuring 11

Two Examples A big bank somewhere in the west… –Frauds in on-line transactions –Around 0.7% rows reported and flagged as frauds –Around 10,000 rows checked manually daily before the project –OLAP cubes already in place A small bank somewhere in the east… –Frauds in credit card transactions –Around 0.5% rows reported and flagged as frauds –No rows checked manually before the project –OLAP cubes already in test 12

Data Preparation (1) Problem: low number of frauds in overall data –Oversampling: multiply frauds –Undersampling: randomly select sample from non- fraudulent rows –Do nothing What to use depends on algorithm and data –Microsoft Neural Networks work the best when you have about 50% of frauds –Microsoft Naïve Bayes work well with 10% frauds –Microsoft Decision Trees work well with any percentage of frauds 13

Data Preparation (2) Domain knowledge helps creating useful derived variables –Flag whether we have multiple transactions from different IPs and same person in defined time window –Flag whether we have transactions from multiple persons and same IP in defined time window –Are multiple persons using the same card / account –Is amount of transaction close to the max amount for this type of transaction –Time of day, is day a holiday –Frequency of transactions in general –Demographics and more 14

Data Overview (1) Graphical presentation –Excel –OLAP cubes & Excel –Graphical packages Statistics –Frequency distributions for discrete variables –First four population moments –Shape of distribution – can discretize in equal range buckets –Statistical packages 15

Data Overview (2) SQL Server has not enough tools out of the box Can enrich the toolset –Transact-SQL queries –CLR procedures and functions –OLAP cubes –Office Data Mining Add-Ins Nevertheless, for large projects might be worth investing in 3 rd party tools –SPSS –Tableau –And more 16

Demo Enriching SQL Server toolset with custom solutions for data overview 17

Directed Models (1) Create multiple models on different datasets –Different over- and undersampling options –Play with different time windows Measure efficiency and robustness of the models –Split data in training and test sets –Lift chart for efficiency –Cross-validation for robustness –We are especially interested in true positive rate –Use Naïve Bayes to check Decision Trees input variables 18

Directed Models (2) How many models should we create? –Unfortunately, the answer is only one: it depends –Fortunately, models can be built quickly, after the data is prepared –Prepare as many as time allows How many cases should we analyze? –Enough to make the results “statistically significant” –Rule of thumb: 15,000 – max. 100,000 19

Demo Directed models and evaluation of the models 20

Clustering (1) Clustering is the algorithm to find outliers –Expectation – Maximization only –K-Means useless for this task Excel Data Mining Add-Ins use Clustering for the Highlight Exceptions task –Simple to use –No way to influence on algorithm parameters –No way to create multiple models and evaluate them –Don’t rely on the column highlighted 21

Clustering (2) How many clusters fit for input data the best? –The more the model fits to the data, the more likely is that outliers really are frauds –Create multiple models and evaluate them –No good built-in evaluation method Possibilities –Make models predictive, and use Lift Chart –MSOLAP_NODE_SCORE –ClusterProbability Mean & StDev –Use average entropy inside clusters to find the model with best fit (with potential correction factor) 22

Demo Clustering models and evaluation of the models 23

Harvesting Results Harvest the results with DMX queries –Use ADOMD.NET to build them in an application Actual results very good –Predicted 70% of actual frauds with Decision Trees –Had 14% of all frauds in first 10,000 rows with Clustering (20x lift to random selection!) Be prepared on surprises! –However, also use common sense –Too many surprises typically mean bad models

Continuous Learning (2) Measure the efficiency of the models –Use Clustering models on data that was not predicted correctly by directed models to learn whether additional or new patterns exist –Use OLAP cubes to measure efficiency over time –For directed models, measure predicted vs. reported frauds –For clustering models, measure number of actual vs. suspected frauds in a limited data set Fraud detection is never completely automatic –Thieves learn, so we must learn as well! 25

Continuous Learning (3) 26 Flag reported frauds Create directed models Predict on new data Cluster rest of data Check with control group Refine models Measure over time

Demo Harvesting the results and measuring over time 27

Conclusion Questions? Thanks! 28 © 2012 SolidQ