Mining Data Semantics (MDS'2011) Workshop

Slides:

Advertisements

Similar presentations

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Advertisements

Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.

A Probabilistic Framework for Semi-Supervised Clustering

Basic Data Mining Techniques Chapter Decision Trees.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Fraud Detection Experiments Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud First Union Credit Card.

Basic Data Mining Techniques

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Radial Basis Function Networks

Enterprise systems infrastructure and architecture DT211 4

Evaluating Performance for Data Mining Techniques

Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.

Dongyeop Kang1, Youngja Park2, Suresh Chari2

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Active Learning for Class Imbalance Problem

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Inductive learning Simplest form: learn a function from examples

by B. Zadrozny and C. Elkan

Presented by Tienwei Tsai July, 2005

A Graph-based Friend Recommendation System Using Genetic Algorithm

INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.

Cat 2 Non Experimental Research Projects Day Competition 2009.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Data Mining Copyright KEYSOFT Solutions.

Unsupervised Streaming Feature Selection in Social Media

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Introduction to Machine Learning, its potential usage in network area,

CHAPTER 7 Decision Analytic Thinking I: What Is a Good Model?

Big data classification using neural network

Recommendation in Scholarly Big Data

AP CSP: Cleaning Data & Creating Summary Tables

What Is Cluster Analysis?

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering

Semi-Supervised Clustering

Starter Outline each part of the PERVERT wheel

Presented by Khawar Shakeel

System for Semi-automatic ontology construction

Constrained Clustering -Semi Supervised Clustering-

Statistical Data Analysis

Source: Procedia Computer Science（2015）70:

Machine Learning Basics

Mikhail Bilenko, Sugato Basu, Raymond J. Mooney

COMBINED UNSUPERVISED AND SEMI-SUPERVISED LEARNING FOR DATA CLASSIFICATION Fabricio Aparecido Breve, Daniel Carlos Guimarães Pedronette State University.

Business and Management Research

Machine Learning for Online Query Relaxation

Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.

Discriminative Frequent Pattern Analysis for Effective Classification

iSRD Spam Review Detection with Imbalanced Data Distributions

Business and Management Research

Tuning CNN: Tips & Tricks

Statistical Data Analysis

Concave Minimization for Support Vector Machine Classifiers

Unsupervised Learning: Clustering

Presentation transcript:

Combining Semi-Supervised Clustering with Social Network Analysis: A Case Study on Fraud Detection Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, joao.botelho@ist.utl.pt | Cláudia Antunes, claudia.antunes@ist.utl.pt

CONTENTS Motivation and problem statement S2C+SNA methodology Case study Conclusions My talk is divided into 4 parts I will start with a brief motivation for taxes payment fraud detection, which is adressed in our case study, and an introdution of the problem statement. Next, i will present a new methodology for fraud classification based on Semi-supervised clustering and social network analysis. A case study will be presented with some experimental results by applying the methodology on real taxes payment data. And finally, some relevant conclusions will be discussed. (45 seg)

CONTENTS Motivation and problem statement S2C+SNA methodology Case study Conclusions I'll start with some general information on Taxes Fraud detection (10 s)

FRAUD DETECTION IN TAXES PAYMENTS Fraud in Taxes Payments Improper payments in taxes due to fraud, waste and abuse; Involves millions of possible fraud targets; Necessity of effective tools to prevent fraud or or just to identify it in time; As I’ve already refer, the study case presented in this paper adresses fraud detection in Taxes payments. - This kind of fraud is many times associated with improper payments in taxes due to fraude, waste and abuse. - Since it’s not possible to investigate all operators (companies and taxpayers) involved on taxes payments it is clear the importance of focusing on operators with a higher risk of committing fraud. - The existence of effective tools to prevent it, or just to identify it in time is very important. (45s)

CHALLENGES ON FRAUD DETECTION Unbalanced nature of datasets Dataset dominated by non-fraud instances. Difficulty on labeling data Due to the cost of identifying and attest fraud. False negatives Non-caught fraud can be labeled as non-fraud which difficults the training process. The detection of fraud is naturally impaired by 3 main issues: 1- The unbalanced nature of datasets, since they are dominated by fraud instances. 2- The difficulty on labeling data, due to the cost of identifying and attest fraud. 3-The false negatives since any non-caught fraud generaly is labeled as non-fraud. This means that an unknown amount of instances are incorrectly classified, which difficults the training process. 45s

CONTENTS Motivation and Problem statement S2C+SNA methodology Case study Conclusions Right, Lets move on to the S2C+SNA methodology…

Metodologia da Solução S2C+SNA METHODOLOGY Semi-Supervised Clustering Social Network Analysis The proposed mehtodology results from the combination of semi-supervised clustering and social network analysis (10s)

WHY SEMI-SUPERVISED CLUSTERING? Labeled Data Ability to deal with a reduced amount of labeled data. Unlabeled Data Use un-labeled data to improve the generalization Why choosing Semi-supervised algorithms? Because they have the hability to deal with a reduced amount of labeled data and also make use of unlabeled data to improve generalization. Semi-Supervised Clustering makes use of labeled and unlabeled data to build a classifier. (30 s)

WHY SOCIAL NETWORKS? Why Social Networks? classification Source of valuable attributes to classification based on entities’ social relations Fraud Fraud is perpetrated by people, who live in a society, having multiple social relations.; Why Social Networks? We choose social networks to enrich our classification model based on hiden patterns and relationships between entities. And also because fraud is perpetrated by people who live in a society, having multiple social relations. (30s) (Indeed, organizations that collect taxes have data that can be used to determine the social network for each entity, and then use it to better classify it)

DATA PREPARATION> DATASET This methodology assumes the existence of two datasets: - Dataset with labeled and unlabeled instances; - Social network Data (describing interactions between these instances); This methodology assumes the existence of two datasets: - Dataset with labeled and unlabeled instances; - Social network Data (describing interactions between these instances); (20s)

DATA PREPARATION>SNOWBALL SAMPLING In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling. In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling. It is possible to locate target entities and ask them to name others who would be likely candidates for investigation (25s)

DATA PREPARATION>BAD RANK Derived from PageRank e HITS Used by Google to detect web SPAM Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members. At Fraud detection, link analysis assumes an important role, since most of the times fraud happens in a criminal network To extract information from social networks we use the Bad Rank algorithm wich is derived from PageRank (from google) and Hits. Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members. (35s)

DATA PREPARATION>BAD RANK (DEMO) Let’s see how BadRank works. This Figure represents the application of BadRank in a very simple criminal network between three organizations, where only one of them is known as fraud. The use of this known case of fraud in the initialization of BadRank can expose the other two organizations, since they will have a high BadRank rate for having some kind of relations with the known fraud organization. This way BadRank can be used to spread a Fraud risk from a set of known fraud organizations (seed set) to all neighbors. (50s)

DATA PREPARATION>BAD RANK The application of Bad Rank results in a new attribute that will enrich the entity decription to be used in the classification process. The application of Bad Rank results in a new attribute that will enrich the entity decription to be used in the classification process. (20s)

MODELING>SEMI-SUPERVISED CLUSTERING The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge. Typically, this knowledge can be incorporated: when the initial centroids are chosen (by seeding) Seeded-Kmeans Constrained-Kmeans in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms). PCK-Means MPCK-Means In the modeling phase we apply semi-supervised clustering. The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm to incorporate domain knowledge. \Typically, this knowledge can be incorporated: when the initial centroids are chosen (by seeding) which happens in Seeded-Kmeans Constrained-Kmeans in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms). Here we have… PCK-Means MPCK-Means 60s

MODELING>SEMI-SUPERVISED CLUSTERING The main goal of semi-supervised clustering is to assign labels to unlabeled instances, using both the domain information of labeled instances and unlabeled instances. (20s)

CONTENTS Motivation and Problem statement S2C+SNA methodology Case study Conclusions Let’s move to the study case

CASE STUDY Dataset: Fraud in Taxes Payments; Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances. 3000 instances; 50% Fraud; 50% Non Fraud; The dataset used in our study case contains real data from Fraud in Taxes Payments. It is important to refer that, like in other fraud domains, the original dataset was unbalanced, with only 10% of fraud instances in the entire dataset. This is a common problem in fraud detection (known as “skewed class distribution”) and can be addressed in by forms of sampling and other techniques that transforms the dataset into a more balanced one. (46) Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances. (60s)

EXPERIMENTS SETUP All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances. The results presented next report the best, worst and the average of the acuracy results obtained on these datasets. Since Semi-Supervised clustering can produce different accuracy results for different constraints, all the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances. (30s)

CLUSTERING RESULTS WITH AND WITHOUT BADRANK ATTRIBUTE This chart compares the average results by applying semi-supervised algorithms with and without the bad rank attribute in the classification dataset. On the y axis we have the accuracy. This metric summarizes the percentage of instances correctly predicted by the classification model. On the X axis we have the percentage of labels incorporated at semi-supervised clustering and BadRank. From these results it is clear that with a small fraction of labelled data (about 15%) all semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (K-Means). When the fraction of labelled data grows these algorithms reacts in different ways. Constrained K-Means have the best performance when comparing to other algorithms. PCK-Means and MPCK-Means don’t reveal significant differences on accuracy without BadRank. With the incorporation of BadRank, the results show significant improvements in all experiments, after 15% of labelled data used. 60 + 45

BEST AND WORST RESULTS WITHOUT BADRANK The best and worst results obtained with experiments without BadRank, shows that, although all algorithms increase their best results as more labelled instances are made available, constraint based algorithms (MPCK-Means and PCK-Means) tend to decline their worst results as the number of labelled instances, used as constraints, grows. (35s)

BEST AND WORST RESULTS WITH BADRANK The best results with BadRank show a considerable increase for all semi-supervised algorithms, after 15% of labelled instances are made available. In the worst results, the trend of decrease on constraint based algorithms, seems to be attenuated as more labelled instances are made available. (40s)

CONTENTS Motivation and Problem statement S2C+SNA methodology Case study Conclusions Finally, let’s take some conclusions

CONCLUSIONS It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans). Constrained K-Means have the best performance when comparing to other semi-supervised algorithms. Semi-supervised clustering performs better when data is enriched with social network analysis. BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used. It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans). Constrained K-Means have the best performance when comparing to other semi-supervised algorithms. This algorithms performs better when the classification dataset is enriched with social network analysis. With BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used. (45s)

CONCLUSIONS This methodology can also be applied to other areas: where supervised information is very difficult to achieve where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data). Churn detection is a good candidate to apply this methodology. Churn detection is a good candidate to apply this methodology, considering that once someone (an influencer) at the center of a network decides to change provider, the other entities in the network are likely to follow him. (40 s)

FIM QUESTIONS? Why the difference on clustering results? Like has been stated in specific studies for semi-supervised clustering, the difference of performances verified for a fixed number of constraints is explained by two properties for the set of constraints: informativeness and coherence. While the first one refers to “the amount of information in the constraint set that the algorithm cannot determine on its own”, the second one is related to the amount of agreement between the constraints in the set, given a distance metric” [14]. Positively, the dataset used in this study is a hard problem to solve by clustering methods, due to the existence of overlapping clusters. But this is a feature for any fraud detection dataset. Why not supervised classification? Supervised classification algorithms assume the existence of a training dataset with an adequate number of pre-classified data (labeled data) to produce effective classification models. Unfortunately, in some cases supervised information is very hard to achieve and consequently there could no exist enough labeled data to train a classification model. In these cases, Semi-Supervised clustering algorithms can be useful as classification technique. What means 1% of pre labeled instances incorporated into the algorithm? Using for instance, 1% of pre-labelled instances in the experiment means that 1% of labeled instances of the dataset were incorporated in the respective semi-supervised clustering algorithm and that 1% of fraud instances were used in the BadRank computation.