Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015.

Similar presentations


Presentation on theme: "Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015."— Presentation transcript:

1 Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015

2 Briefly speaking… Data mining 電腦審計部門 Support 誰對財報舞弊 的預測能力最 強 ? Fraud costs US business more than $400 billion annually. (Kirkos et al.) 1. Introduction

3 2. Literature Review decision trees neural networks (NN) Bayesian belief networks case based reasoning fuzzy rule-based reasoning hybrid methods Aamodt and Plaza [1] and Kotsiantis et al. Further, Deshmukh and Talluru Calderon and Cheh support vector machines Cecchini et al. Huang et al. Sohl and Venkatachalam Cerullo and Cerullo Koskivaara Feroz et al. Aamodt and Plaza and Kotsiantis et al. Pacheco et al.

4 3. Methodology Dataset 35 financial items of 202 companies from China (101 were fraudulent 101 were non-fraudulent companies) Dataset Processing log transformation ten-fold cross validation →improve the reliability of the result.

5 Without feature selection With feature selection

6 Support Vector Machines (SVM) --- 支持向量機器 The training examples that are closest to the maximum margin hyperplane are called support vectors. All other training examples are irrelevant for defining the binary class boundaries. financial applications : credit rating, time series prediction, and insurance claim frauds detection.

7 Genetic Programming (GP) --- 基因規劃 an extension of genetic algorithms (GA). evolutionary computation→ reproduction, crossover, mutation → find the best solution to the problem

8 Multi-Layer Feedforward Neural Network (MLFF)--- 多層式前饋網路 MLFF is one of the most common NN structures, as they are simple and effective, and have found home in a wide assortment of machine learning applications.

9 Group Method Data Handling (GMDH) --- 資料處理群集分析 an inductive learning algorithm for modeling complex systems. It is a self-organizing approach that tests increasingly complicated models and evaluates them using some external criterion on separate parts of the data sample.

10 Logistic Regression(LR)--- 羅吉斯回歸 The response variable is taken to be dichotomous or binary, i.e., yi=0 or 1 for all i=1,...,n. LR is a commonly used approach for performing binary classification.

11 Probabilistic Neural Network(PNN) --- 機率神經網路 It is a pattern classification network based on the classical Bayes classifier, which is statistically an optimal classifier that seeks to minimize the risk of misclassification. Input layer Pattern layer Summation layer output layer

12 4. Feature Selection In this study, we employed a feature selection phase by using the simple t-statistic technique. μ1 and μ2 represent the means of the samples of fraudulent companies and non- fraudulent companies for a given feature respectively, σ1 and σ2 represent the standard deviation of the samples of fraudulent companies and non-fraudulent companies for a given feature respectively. n1 and n2 represent the number of samples of fraudulent companies and non-fraudulent companies for a given feature. (For given features)

13 The t-statistic values are computed for each feature and the top 18 features with the highest t-statistic are identified. A high t-statistic value: highly discriminate between fraudulent and non-fraudulent companies.

14 Top 18 items selected by t-statistic based feature selection.

15 5. Results and Discussion 35 financial items for 202 companies with Feature selection without Feature selection TOP 18 TOP 10 1. 2. 3. Which classifier is best?

16 Result 1: without feature selection

17 Explanations The sensitivity is the measure of the proportion of the number of fraudulent companies predicted correctly as fraudulent by a particular model to the total number of actual fraudulent companies. (true positive rate) The specificity is the measure of the proportion of the number of nonfraudulent companies predicted as non-fraudulent by a model to the total number of actual non- fraudulent companies. (true negative rate)

18 Result 2: Top18 by feature selection Results based on AUC indicated that GP yielded highest accuracy followed by PNN, which yielded marginally less accuracy. This makes us infer that the selected feature subsets have a high discriminatory power and the ‘left-over’ features have very little to contribute to the success of financial fraud detection.

19 Result 3: Top10 by feature selection

20 PNN Without Feature selection With Feature selection TOP 18TOP 10 PNN GP Remaining Classifiers T-test : differences?

21 Result 4: PNN vs. other classifiers (without feature selection) > Critical value 1.73 ✓ ✓ ✓ ✓ ✓ PNN is the best of the all In order to find out whether the difference in average AUCs is statistically significant or not, we conducted a t-test between the top performer and the remaining classifiers

22 Result 5: GP vs. other classifiers (Top18 by feature selection) ✓ ✓ ✓ PNN is almost as good as GP

23 Result 6: PNN vs. other classifiers (Top10 by feature selection) ✓ ✓ ✓ PNN, GP, GMDH are good

24 Conclusion and future research directions The results obtained in this study are better than those obtained in an earlier study on the same dataset. Results based on AUC indicated that the PNN was the top performer followed by GP which yielded marginally less accuracies in most of the cases.

25 The End


Download ppt "Detection of Financial Statement fraud and Feature Selection Using Data Mining Techniques 指導教授:徐立群 教授 學生:吳泰霖 R16031145 顏伶安 R16034193 Jan 9, 2015."

Similar presentations


Ads by Google