Cluster Analysis for Anomaly Detection Sutapat Thiprungsri Rutgers Business School July 31 th 2010.

Slides:

Advertisements

Similar presentations

DATA & STATISTICS 101 Presented by Stu Nagourney NJDEP, OQA.

Advertisements

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.

1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final.

MIS2502: Data Analytics Clustering and Segmentation.

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

Terminology A statistic is a number calculated from a sample of data. For each different sample, the value of the statistic is a uniquely determined number.

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

Continuous Audit at Insurance Companies

Improving Image registration accuracy Narendhran Vijayakumar 02/29/2008.

Overview Of Clustering Techniques D. Gunopulos, UCR.

Basic Data Mining Techniques

What is Cluster Analysis?

Part III: Inference Topic 6 Sampling and Sampling Distributions

Presentation on HARYANA. Present State of Issuance of Stamp Papers Issue of stamp papers by Treasury ( up to value of `10,000/-) Issue of stamp duty receipt.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Birch: An efficient data clustering method for very large databases

Data Mining: A Closer Look

Enterprise systems infrastructure and architecture DT211 4

Evaluating Performance for Data Mining Techniques

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

PAYCHECKS Personal Finance PAYING EMPLOYEES There are 3 methods employers may use to pay employees: Paycheck – payment given with a paper check.

Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,

1 Designing Substantive Procedures The auditor “must plan and perform the audit to reduce the audit risk to an acceptably low level that is consistent.

Presented by Tienwei Tsai July, 2005

Data Cleansing for Predictive Models: The Next Level Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Ratemaking & Product Management Seminar Philadelphia, PA.

Resistant Learning on the Envelope Bulk for Identifying Anomalous Patterns Fang Yu Department of Management Information Systems National Chengchi University.

S7: Audit Planning. Session Objectives To explain the need for planning To explain the need for planning To outline the essential elements of planning.

Asymmetric Information

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

Copyright © 2007 Pearson Education Canada 1 Chapter 20: Audit of the Capital Acquisition and Repayment Cycle.

Instructors begin using McGraw-Hill’s Homework Manager by creating a unique class Web site in the system. The Class Homepage becomes the entry point for.

Audit Planning. Session Objectives To explain the need for planning To outline the essential elements of planning process To finalise the audit approach.

1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Classification Heejune Ahn SeoulTech Last updated May. 03.

Confidence Intervals: The Basics BPS chapter 14 © 2006 W.H. Freeman and Company.

Using cluster analysis for Identifying outliers and possibilities offered when calculating Unit Value Indices OECD NOVEMBER 2011 Evangelos Pongas.

Outlier analysis. Outliers Working definition –An outlier x k is an element of a data sequence S that is inconsistent with out expectations, based on.

Sampling Error SAMPLING ERROR-SINGLE MEAN The difference between a value (a statistic) computed from a sample and the corresponding value (a parameter)

CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

Lecture 07: Dealing with Big Data

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Accounting Events.

CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

Copyright © 2007 Pearson Education Canada 1 Chapter 11: Overall Audit Plan and Audit Program.

Agenda Sage Vision New Features in Cashbook 2.New Features in RecXpress 3.New Features in EFTXpress.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Reconciling Bank Statements. After completing this lesson, you should be able to do the following: Explain how to automatically and manually reconcile.

Agenda Sage Vision New Features in Cashbook 2.New Features in RecXpress 3.New Features in EFTXpress.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.

Enforcement via Policies and Procedures. Processes and Procedures These are boring, tedious, time consuming…… BUT THEY ARE ESSENTIAL They must be written.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.

MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer

New Features in Cashbook New Features in RecXpress

Introduction to Sampling Distributions

Air Force Insurance Fund (NAF): How to File a Death Claim

Outlier Processing via L1-Principal Subspaces

Overview Of Clustering Techniques

Fuzzy Clustering.

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Presentation transcript:

Cluster Analysis for Anomaly Detection Sutapat Thiprungsri Rutgers Business School July 31 th 2010

2 Contribution To demonstrate that cluster analysis can be used to build a model for anomaly detection in auditing. To provide a guideline/example for using cluster analysis in continuous auditing.

Cluster Analysis Clustering is an unsupervised learning algorithm. Clustering is a useful technique for grouping data points such that points within a single group or cluster are similar, while points in different groups are different. 3 An Outline of Cluster Analysis Procedure. (Kachigan, 1991)

Cluster Analysis: Application Marketing Cluster analysis is used as the methodologies for understanding of the market segments and buyer behaviors. –For example, B. Zafer et al. (2006), Ya-Yueh et al. (2003), Vicki et al. (1992), Rajendra et al. (1981), Lewis et al. (2006), Hua- Cheng et al. (2005) Market segmentations using cluster analysis have been examined in many different industries. –For instance, finance and banking (Anderson et al, 1976, Calantone et al, 1978), automobile (Kiel et al, 1981), education (Moriarty et al, 1978), consumer product (Sexton, 1974, Schaninger et al, 1980) and high technology industry (Green et al, 1968). 4

Cluster Analysis for Outlier Detection An Outlier is an observation that deviates so much from other observations as to arouse suspicion that it is generated by a different mechanism (Hawkins, 1980) Literatures find outliers as a side-product of clustering algorithms (Ester et al, 1996; Zhang et al, 1996; Wang et al. 1997; Agrawal et al. 1998; Hinneburg and Keim 1998; Guha et al, ) –Distance-based outliers (Knorr and Ng, 1998, 1999; Ramaswamy et al., 2000) –Cluster-based outliers (Knorr and Ng 1999; Jiang et al, 2001, He et al, 2003 ;Duan et al, 2009;) 5 Research Question: How can we apply clustering models for detection of abnormal (fraudulent/erroneous) transactions in continuous auditing?

The Setting: Group Life Claim Purpose To detect potential fraud or errors in the group life claims process by using clustering techniques Data Group life claim from a major insurance company from Q1: 2009 Approximately 184,000 claims processed per year (~40,000 claims per quarter) 6

Group Life Claims Processing System (BIOS) Claim receipt & setup (Note A) If Clean Form (meets standard requirements) and under $10,000 then auto-adjudicated Note A: The Key elements of a claim include Employer’s Statement, Beneficiary designation and Enrollment Forms are submitted via the online system. Claimant’s statement and death certificate are submitted via paper. All paper documents supporting the claims are imaged. Note B: Payments are made to beneficiary(s) in one instance but can be made to multiple beneficiaries. Data Entry & Automated System Review by Claim Reviewer Assign Unique Claim Numbers Claim level details ran against plan business rules, state requirements, plan options Record & Payment Processing (Note B) Claim Examiner gathers additional information and approves if within approving authority $ Limits Yes No Yes No Countersignature required for amounts over approving authority Yes Reject/Deny Claim ClaimsUnderwritingBilling No Group Life Business Units Treasury Workstation (TWS)

Clustering Procedure Clustering Algorithm: K-mean Clustering Attributes: –Percentage: Total interest payment / Total beneficiary payment N_percentage=(percentage-MEAN)/STD –AverageCLM_PMT: Average number of days between the claims received date to payment date (the weighted average is used because a claim could have multiple payment dates) N_AverageDTH_PMT=(AverageDTH_PMT-MEAN)/STD –DTH_CLM: Number of days between the death dates to claim received date. N_DTH_CLM=(DTH_CLM-MEAN)/STD –AverageDTH_PMT: Average number of days between the death dates to the payment dates (the weighted average is used because a claim could have multiple payment dates) N_AverageDTH_PMT=(AverageDTH_PMT-MEAN)/STD 8 DTHCLMPMT

Cluster centroids: Cluster# Attribute Full Data (40080)(2523)(54)(84)(222)(295)(31)(768)(36103) N_AverageDTH_PMT N_percentage Clustered Instances ( 6%) 1 54 ( 0%) 2 84 ( 0%) ( 1%) ( 1%) 5 31 ( 0%) ( 2%) ( 90%) 9 Attributes: N_AverageDTH_PMT: Normalized Average number of days between the death dates to the payment dates (the weighted average is used because a claim could have multiple payment dates) N_Percentage: Normalized Total interest payment / Total beneficiary payment

Cluster1: 54 claims Cluster2: 84 claims Cluster5: 31 claims

11 Cluster centroids: Cluster# AttributeFullFull Data (40080)(510)(343)(194)(98)(3699)(30)(1275)(741)(32658)(286)(39)(110)(97) N_AverageCLM_PMT N_DTH_CLM N_AverageDTH_PMT N_percentage Clustered Instances ( 1%) ( 1%) ( 0%) 3 98 ( 0%) ( 9%) 5 30 ( 0%) ( 3%) ( 2%) ( 81%) ( 1%) ( 0%) ( 0%) ( 0%) Attributes: N_AverageCLM_PMT: Normalized average number of days between the claim received date to the payment dates (the weighted average is used because a claim could have multiple payment dates) N_DTH_CLM: Normalized number of days between the death date to the claim dates N_AverageDTH_PMT: Normalized Average number of days between the death dates to the payment dates (the weighted average is used because a claim could have multiple payment dates) N_Percentage: Normalized Total interest payment / Total beneficiary payment

12 Cluster 2:194 claims Cluster 3:98 claims Cluster 5:30 claims Cluster 10:39 claims Cluster 11:110 claims Cluster 12:97 claims

Distance-Based Outliers A distance-based outlier in a dataset is a data object having a distance far away from the center of the cluster. Probability distribution over the clusters for each observation is calculated. The observations which has lower than 0.6 would be identified as possible outliers. 13 CLM_IDCluster0Cluster1Cluster2Cluster …..………..

Distance-Based Outliers 14 Simple K-mean: 2 attributesSimple K-mean: 4 attributes ClusterOutliers Cluster 0154 Cluster 10 Cluster 26 Cluster 39 Cluster 422 Cluster 52 Cluster 636 Cluster 796 ClusterOutliers Cluster031 Cluster121 Cluster27 Cluster32 Cluster4205 Cluster52 Cluster649 Cluster746 Cluster8157 Cluster911 Cluster100 Cluster1112 Cluster124

Results: Distance-based AND Cluster-based outliers 15 Cluster-based outliers can be used to identify clusters with smaller populations as outliers. Distance-based outliers can be used to identify specific observations from clusters as outliers. Cluster Analysis Cluster-Based Outliers Distance-Based Outliers Cluster Analysis with 2 Attributes Cluster Analysis with 4 Attributes568547

Limitations Cluster Analysis always generates clusters, regardless of the properties of the data-set. Therefore, the interpretation of the results might not be clear. Identification of anomalies will have to be verified. 16 Future Research More attributes related to other aspect of the claims would be used. Rule-based selection processes would be incorporated to help in identification of anomalies.

Cluster 10:39 claims Cluster 5:30 claims Cluster 3:98 claims Cluster 11:110 claims Cluster 2:194 claims Cluster 12:97 claims

18 Cluster 10:39 claims Cluster 5:30 claims Cluster 3:98 claims Cluster 2:194 claims Cluster 11:110 claims Cluster 12:97 claims