Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Data Semantics (MDS'2011) Workshop

Similar presentations


Presentation on theme: "Mining Data Semantics (MDS'2011) Workshop"— Presentation transcript:

1 Combining Semi-Supervised Clustering with Social Network Analysis: A Case Study on Fraud Detection
Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, | Cláudia Antunes,

2 CONTENTS Motivation and problem statement S2C+SNA methodology
Case study Conclusions My talk is divided into 4 parts I will start with a brief motivation for taxes payment fraud detection, which is adressed in our case study, and an introdution of the problem statement. Next, i will present a new methodology for fraud classification based on Semi-supervised clustering and social network analysis. A case study will be presented with some experimental results by applying the methodology on real taxes payment data. And finally, some relevant conclusions will be discussed. (45 seg)

3 CONTENTS Motivation and problem statement S2C+SNA methodology
Case study Conclusions I'll start with some general information on Taxes Fraud detection (10 s)

4 FRAUD DETECTION IN TAXES PAYMENTS
Fraud in Taxes Payments Improper payments in taxes due to fraud, waste and abuse; Involves millions of possible fraud targets; Necessity of effective tools to prevent fraud or or just to identify it in time; As I’ve already refer, the study case presented in this paper adresses fraud detection in Taxes payments. - This kind of fraud is many times associated with improper payments in taxes due to fraude, waste and abuse. - Since it’s not possible to investigate all operators (companies and taxpayers) involved on taxes payments it is clear the importance of focusing on operators with a higher risk of committing fraud. - The existence of effective tools to prevent it, or just to identify it in time is very important. (45s)

5 CHALLENGES ON FRAUD DETECTION
Unbalanced nature of datasets Dataset dominated by non-fraud instances. Difficulty on labeling data Due to the cost of identifying and attest fraud. False negatives Non-caught fraud can be labeled as non-fraud which difficults the training process. The detection of fraud is naturally impaired by 3 main issues: 1- The unbalanced nature of datasets, since they are dominated by fraud instances. 2- The difficulty on labeling data, due to the cost of identifying and attest fraud. 3-The false negatives since any non-caught fraud generaly is labeled as non-fraud. This means that an unknown amount of instances are incorrectly classified, which difficults the training process. 45s

6 CONTENTS Motivation and Problem statement S2C+SNA methodology
Case study Conclusions Right, Lets move on to the S2C+SNA methodology…

7 Metodologia da Solução
S2C+SNA METHODOLOGY Semi-Supervised Clustering Social Network Analysis The proposed mehtodology results from the combination of semi-supervised clustering and social network analysis (10s)

8 WHY SEMI-SUPERVISED CLUSTERING?
Labeled Data Ability to deal with a reduced amount of labeled data. Unlabeled Data Use un-labeled data to improve the generalization Why choosing Semi-supervised algorithms? Because they have the hability to deal with a reduced amount of labeled data and also make use of unlabeled data to improve generalization. Semi-Supervised Clustering makes use of labeled and unlabeled data to build a classifier. (30 s)

9 WHY SOCIAL NETWORKS? Why Social Networks?
classification Source of valuable attributes to classification based on entities’ social relations Fraud Fraud is perpetrated by people, who live in a society, having multiple social relations.; Why Social Networks? We choose social networks to enrich our classification model based on hiden patterns and relationships between entities. And also because fraud is perpetrated by people who live in a society, having multiple social relations. (30s) (Indeed, organizations that collect taxes have data that can be used to determine the social network for each entity, and then use it to better classify it)

10 DATA PREPARATION> DATASET
This methodology assumes the existence of two datasets: - Dataset with labeled and unlabeled instances; - Social network Data (describing interactions between these instances); This methodology assumes the existence of two datasets: - Dataset with labeled and unlabeled instances; - Social network Data (describing interactions between these instances); (20s)

11 DATA PREPARATION>SNOWBALL SAMPLING
In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling. In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling. It is possible to locate target entities and ask them to name others who would be likely candidates for investigation (25s)

12 DATA PREPARATION>BAD RANK
Derived from PageRank e HITS Used by Google to detect web SPAM Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members. At Fraud detection, link analysis assumes an important role, since most of the times fraud happens in a criminal network To extract information from social networks we use the Bad Rank algorithm wich is derived from PageRank (from google) and Hits. Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members. (35s)

13 DATA PREPARATION>BAD RANK (DEMO)
Let’s see how BadRank works. This Figure represents the application of BadRank in a very simple criminal network between three organizations, where only one of them is known as fraud. The use of this known case of fraud in the initialization of BadRank can expose the other two organizations, since they will have a high BadRank rate for having some kind of relations with the known fraud organization. This way BadRank can be used to spread a Fraud risk from a set of known fraud organizations (seed set) to all neighbors. (50s)

14 DATA PREPARATION>BAD RANK
The application of Bad Rank results in a new attribute that will enrich the entity decription to be used in the classification process. The application of Bad Rank results in a new attribute that will enrich the entity decription to be used in the classification process. (20s)

15 MODELING>SEMI-SUPERVISED CLUSTERING
The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge. Typically, this knowledge can be incorporated: when the initial centroids are chosen (by seeding) Seeded-Kmeans Constrained-Kmeans in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms). PCK-Means MPCK-Means In the modeling phase we apply semi-supervised clustering. The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm to incorporate domain knowledge. \Typically, this knowledge can be incorporated: when the initial centroids are chosen (by seeding) which happens in Seeded-Kmeans Constrained-Kmeans in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms). Here we have… PCK-Means MPCK-Means 60s

16 MODELING>SEMI-SUPERVISED CLUSTERING
The main goal of semi-supervised clustering is to assign labels to unlabeled instances, using both the domain information of labeled instances and unlabeled instances. (20s)

17 CONTENTS Motivation and Problem statement S2C+SNA methodology
Case study Conclusions Let’s move to the study case

18 CASE STUDY Dataset: Fraud in Taxes Payments;
Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances. 3000 instances; 50% Fraud; 50% Non Fraud; The dataset used in our study case contains real data from Fraud in Taxes Payments. It is important to refer that, like in other fraud domains, the original dataset was unbalanced, with only 10% of fraud instances in the entire dataset. This is a common problem in fraud detection (known as “skewed class distribution”) and can be addressed in by forms of sampling and other techniques that transforms the dataset into a more balanced one. (46) Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances. (60s)

19 EXPERIMENTS SETUP All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances. The results presented next report the best, worst and the average of the acuracy results obtained on these datasets. Since Semi-Supervised clustering can produce different accuracy results for different constraints, all the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances. (30s)

20 CLUSTERING RESULTS WITH AND WITHOUT BADRANK ATTRIBUTE
This chart compares the average results by applying semi-supervised algorithms with and without the bad rank attribute in the classification dataset. On the y axis we have the accuracy. This metric summarizes the percentage of instances correctly predicted by the classification model. On the X axis we have the percentage of labels incorporated at semi-supervised clustering and BadRank. From these results it is clear that with a small fraction of labelled data (about 15%) all semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (K-Means). When the fraction of labelled data grows these algorithms reacts in different ways. Constrained K-Means have the best performance when comparing to other algorithms. PCK-Means and MPCK-Means don’t reveal significant differences on accuracy without BadRank. With the incorporation of BadRank, the results show significant improvements in all experiments, after 15% of labelled data used.

21 BEST AND WORST RESULTS WITHOUT BADRANK
The best and worst results obtained with experiments without BadRank, shows that, although all algorithms increase their best results as more labelled instances are made available, constraint based algorithms (MPCK-Means and PCK-Means) tend to decline their worst results as the number of labelled instances, used as constraints, grows. (35s)

22 BEST AND WORST RESULTS WITH BADRANK
The best results with BadRank show a considerable increase for all semi-supervised algorithms, after 15% of labelled instances are made available. In the worst results, the trend of decrease on constraint based algorithms, seems to be attenuated as more labelled instances are made available. (40s)

23 CONTENTS Motivation and Problem statement S2C+SNA methodology
Case study Conclusions Finally, let’s take some conclusions

24 CONCLUSIONS It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans). Constrained K-Means have the best performance when comparing to other semi-supervised algorithms. Semi-supervised clustering performs better when data is enriched with social network analysis. BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used. It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans). Constrained K-Means have the best performance when comparing to other semi-supervised algorithms. This algorithms performs better when the classification dataset is enriched with social network analysis. With BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used. (45s)

25 CONCLUSIONS This methodology can also be applied to other areas:
where supervised information is very difficult to achieve where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data). Churn detection is a good candidate to apply this methodology. Churn detection is a good candidate to apply this methodology, considering that once someone (an influencer) at the center of a network decides to change provider, the other entities in the network are likely to follow him. (40 s)

26 FIM QUESTIONS? Why the difference on clustering results?
Like has been stated in specific studies for semi-supervised clustering, the difference of performances verified for a fixed number of constraints is explained by two properties for the set of constraints: informativeness and coherence. While the first one refers to “the amount of information in the constraint set that the algorithm cannot determine on its own”, the second one is related to the amount of agreement between the constraints in the set, given a distance metric” [14]. Positively, the dataset used in this study is a hard problem to solve by clustering methods, due to the existence of overlapping clusters. But this is a feature for any fraud detection dataset. Why not supervised classification? Supervised classification algorithms assume the existence of a training dataset with an adequate number of pre-classified data (labeled data) to produce effective classification models. Unfortunately, in some cases supervised information is very hard to achieve and consequently there could no exist enough labeled data to train a classification model. In these cases, Semi-Supervised clustering algorithms can be useful as classification technique. What means 1% of pre labeled instances incorporated into the algorithm? Using for instance, 1% of pre-labelled instances in the experiment means that 1% of labeled instances of the dataset were incorporated in the respective semi-supervised clustering algorithm and that 1% of fraud instances were used in the BadRank computation.


Download ppt "Mining Data Semantics (MDS'2011) Workshop"

Similar presentations


Ads by Google