CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

Slides:

Advertisements

Similar presentations

A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Random Forest Predrag Radenković 3237/10

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

An Overview of Machine Learning

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

x – independent variable (input)

Sparse vs. Ensemble Approaches to Supervised Learning

Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye.

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.

Data Mining – Intro.

Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.

Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.

SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

Data Mining Chun-Hung Chou

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Data mining and machine learning A brief introduction.

Anomaly detection with Bayesian networks Website: John Sandiford.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.

Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

Implementing Query Classification HYP: End of Semester Update prepared Minh.

@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.

 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.

Predictive Analytics World CONFIDENTIAL1 Predictive Keyword Scores to Optimize PPC Campaigns Vincent Granville, Ph.D. Click Forensics February 19, 2009.

XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.

Ensemble Methods: Bagging and Boosting

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.

Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,

Slides for “Data Mining” by I. H. Witten and E. Frank.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

1 Some more examples Client satisfaction Products sold Trusted advisor score Net growth TOP PERFORMERS Age diversity HIGH Credibility HIGH Absenteeism.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Data Mining Copyright KEYSOFT Solutions.

Data Mining By: Johan Johansson. Mining Techniques Association Rules Association Rules Decision Trees Decision Trees Clustering Clustering Nearest Neighbor.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

A Decision Support Based on Data Mining in e-Banking Irina Ionita Liviu Ionita Department of Informatics University Petroleum-Gas of Ploiesti.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Prepared by Fayes Salma.  Introduction: Financial Tasks  Data Mining process  Methods in Financial Data mining o Neural Network o Decision Tree  Trading.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Oracle Advanced Analytics

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Machine Learning with Spark MLlib

Data Mining – Intro.

Boosting and Additive Trees (2)

Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Data Mining Lecture 11.

Vincent Granville, Ph.D. Co-Founder, DSC

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Predictive Keyword Scores to Optimize Online Advertising Campaigns

Neural networks (1) Traditional multi-layer perceptrons

Chapter 7: Transformations

Predicting Loan Defaults

Azure Machine Learning

Machine Learning in Business John C. Hull

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009

Potential Applications Fraud detection, spam detection Web analytics  Keyword scoring/bidding (ad networks, paid search)  Transaction scoring (click, impression, conversion, action)  Click fraud detection  Web site scoring, ad scoring, landing page / advertiser scoring  Collective filtering (social network analytics)  Relevancy algorithms Text mining  Scoring and ranking algorithms  Infringement detection  User feedback: automated clustering

General Model Predictive scores: score = f(response) Response:  Odds of converting (Internet traffic data – hard/soft conversions)  CR (conversion rate)  Probability that transaction is fraudulent Independent variables:  Rules  Highly correlated Traditional models:  Logistic regression, decision trees, naïve Bayes

Hidden Decision Trees (HDT) One-to-one mapping between scores and features Feature = rule combination = node (in DT terminology) HDT fundamentals:  40 binary rules  2 40 potential features  Training set with 10MM transactions  10MM features at most  500 out of 10MM features account to 80% of all transactions  The top 500 features have strong CR predictive power  Alternate algorithm required to classify the 20% remaining transactions  Using neighboring top features to score the 20% remaining transactions creates bias

HDT Implementation Each top node (or feature) is a final node from an hidden decision tree  No need for tree pruning / splitting algorithms and criteria: HDT is straightforward, fast, and can rely on efficient hash tables (where key=feature, value=score) Top 500 nodes come from multiple hidden decision trees Remaining 20% transactions scored using alternate methodology (typically, logistic regression) HDT is an hybrid algorithm  Blending multiple, small, easy-to-interpret, invisible decision trees (final nodes only) with logistic regression

HDT: Score Blending The top 500 nodes provide a score S 1 available for 80% of the transactions The logistic regression provides a score S 2 available for 100% of the transactions Rescale S 2 using the 80% transactions that have two scores S 1 and S 2  Make S 1 and S 2 compatible on these transactions  Let S 3 be the rescaled S 2 Transactions that can’t be scored with S 1 are scored with S 3

HDT History 2003: First version applied to credit card fraud detection 2006: Application to click scoring and click fraud detection 2008: More advanced versions to handle granular and very large data sets  Hidden Forests: multiple HDT’s, each one applied to a cluster of correlated rules  Hierarchical HDT’s: the top structure, not just rule clusters, is modeled using HDT’s  Non binary rules (naïve Bayes blended with HDT)

HDT Nodes: Example Y-axis = CR, X-axis = Score, Square Size = # observations

HDT Nodes: Nodes = Segments 80% of transactions found in 500 top nodes Each large square – or segment – corresponds to a specific transaction pattern HDT nodes provide an alternate segmentation of the data One large, medium-score segment corresponds to neutral traffic (triggering no rule) Segments with very low scores correspond to specific fraud cases Within each segment, all transactions have the same score Usually provides a different segmentation than PCA and other analyses

Score Distribution: Example

Score Distribution: Interpretation Reverse bell curve Scores below 425 correspond to un-billable transactions Spike at the very bottom and very top of the score scale 50% of all transactions have good scores Scorecard parameters  A drop of 50 points represents a 50% drop in conversion rate:  Average score is 650. Model improvement: from reverse bell curve to bell curve  Transaction quality vs. fraud detection  Anti-rules, score smoothing (will also remove score caking)

GOF Scores vs. CR: Example

GOF: Interpretation Overall good fit Peaks could mean:  Bogus conversions  Residual Noise  Model needs improvement (incorporate anti-rules) Valleys could mean:  Undetected conversions (cookie issue, time-to-conversion unusually high)  Residual noise  Model needs improvement

Conclusions Fast algorithm, easy to implement, can handle large data sets efficiently, output is easy to interpret Non parametric, robust  Risk of over-fitting is small if no more than top 500 nodes are selected and ad-hoc cross validation techniques used to remove unstable nodes  Built-in mechanism to compute confidence intervals for scores HDT: hybrid algorithm to detect multiple types of structures  Linear and non linear structures Future directions  Hidden forests to handle granular data  Hierarchical HDT’s