Data Quality Models for High Volume Transaction Streams: A Case Study Robert Grossman, Chris Curry, David Locke & Steve Vejcik Open Data Group Joseph Bugajski Visa International
The Problem: Detect Significant Changes in Visa’s Payments Network Account Merchant Issuing Bank Acquiring Bank
Visa Payment Network Over 1.59 billion Visa cards in circulation 6800 transactions per second (peak) Over 20,000 member banks Millions of merchants
The Challenge: Payments Data is Highly Heterogeneous Variation from cardholder to cardholder Variation from merchant to merchant Variation from bank to bank
Observe: If Data Were Homogeneous, Could Use Change Detection Model Sequence of events x[1], x[2], x[3], … Question: is the observed distribution different than the baseline distribution? Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests Observed Model Baseline Model
Key Idea: Build Models, One for Each Cell in Data Cube Build separate model for each bank (1000+) Build separate model for each geographical region (6 regions) Build separate model for each different type of merchant (c. 800 types of merchants) For each distinct cube, establish separate baselines for each metric of interest (declines, etc.) Detect changes from baselines Geospatial region Type of Transaction 20,000+ separate baselines Modeling using Cubes of Models (MCM) Bank
Greedy Meaningful/Manageable Balancing (GMMB) Algorithm Fewer alerts Alerts more manageable To decrease alerts, remove breakpoint, order by number of decreased alerts, & select one or more breakpoints to remove More alerts Alerts more meaningful To increase alerts, add breakpoint to split cubes, order by number of new alerts, & select one or more new breakpoints One model for each cell in data cube Breakpoint
Augustus Open source Augustus data mining platform was used to: Estimate baselines for over 15,000 separate segmented models Score high volume operational data and issue alerts for follow up investigations Augustus is PMML compliant Augustus scales with Volume of data (Terabytes) Real time transaction streams (15,000/sec+) Number of segmented models (10,000+)
Some Results to Date System has been operational for 2.5 years ROI –5.1xYear 1(over 6 months) –7.3xYear 2(12 months) –10.0x Year 3(12 months) Currently estimating over 15,000 individual baseline models The system has issued alerts for: Merchants using incorrect Merchant Category Code (MCC) - account testing Sales channel variations Incorrect use of merchant city name field Incorrect coding or recurring payments
Summary Used new methodology (Modeling using Cubes of Models) for modeling large, highly heterogeneous data sets This project contributed in part to the development of a Baseline Model in the open source Augustus system Integrated system to generate alerts using Baseline Models with manual investigation process Project is generating over 10x ROI Poster #20 in the Tuesday night Poster Session.