Download presentation
Presentation is loading. Please wait.
Published byBruce Carroll Modified over 6 years ago
1
International Conference on Mathematical Modelling and Computational Methods in Science and Engineering, Alagappa University, Karaikudi, India February 20-22, 2017
2
Detection of Outliers to enhance the performance of Decision tree algorithm
Vaishali R 1 , MTech Dr.P.Ilango , Professor School of Computing Science and Engineering VIT University CT ENGG 029
3
ABSTRACT Abstract: An efficient knowledge discovery process requires quality of datasets. Normally dataset may suffer from the existence of noise namely missing data, wrongly entered data etc. The wrongly entered data may be the outliers in the dataset. The outliers certainly affect the detection of patterns in the dataset. . It is observed that the Pima Indians diabetes dataset contain outliers in the attributes such as number of pregnancy, Blood pressure, insulin etc. level that may affect the accuracy of prediction. The proposed work focuses on detection of outliers in the dataset. After the detection of outliers it may be removed to generate the patterns effectively or it may be handled according to its criticality with decision tree induction algorithm. Keywords: datamining, knowledge discovery, outliers, Pima Indians Diabetes dataset, Decision tree Induction
4
Introduction “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.” - Frank E. Grubbs
5
Causes of Outliers in a dataset
Apparatus malfunction. Fraudulent behaviour. Human error. Natural deviations. Contamination.
6
Applications of Outliers
Fraud Detection Medicine Public Health Sports statistics Detecting measurement errors
7
Literature Survey The outliers are handled with various approaches based on distance and density. The performance of Naïve Bayesian algorithm and SVM are tested with the resultant dataset. Highest accuracy is produced by SVM classifier after handling the diabetes data with density based approach.[1] In [2] missing data in medical datasets are handled with three imputation methods and the efficiency of Simple Kmeans clustering is tested with the obtained datasets. [3] Proposes a filter that applies the LOF (Local Outlier Factor) algorithm to compute an "outlier" score for each instance in the data. Can use multiple cores/cpus to speed up the LOF computation for large datasets. Nearest neighbor search methods and distance functions are pluggable. [4] Proposes feature bagging filter to detect the outliers in a dataset to compute an "outlier" score for each instance in the data. The filter appends to the input data a new attribute called "BaggedLOF".
8
Problem Statement Pima Indian Diabetes dataset obtained from UCI Machine learning repository contains outliers. The dataset claims that it has no missing values, but it is predicted that certain attributes such as age, Blood pressure have the instance values as zero, which practically not feasible. The wrongly entered instance values add to the outliers of the dataset which will degrade the quality of the dataset for the knowledge discovery process. The Dataset contains the pregnancy attribute, for which the meta data provided is not sufficient to make decisions. It establishes a confusion whether the number of times pregnant denotes the total number of completed pregnancies or includes aborted pregnancy also. The highest value in number of time pregnant is 17, which can be considered as an outlier or invalid data.
9
Methods to Handle outliers
Interquartile Range – Box Plot LOF – Local outlier Factor Isolation Forest Angle Based Outlier Detection Imputation
10
Outliers Detection Method IQR
Interquartile Range Q3 - Q1 Lower Inner Fence: Q *IQR Upper Inner Fence: Q *IQR Lower Outer Fence: Q1 - 3*IQR Upper Outer Fence: Q3 + 3*IQR
11
Outliers Detection Method - LOF
This algorithm compute an "outlier" score for each instance in the data. Can use multiple cores/cpus to speed up the LOF computation for large datasets. Nearest neighbor search methods and distance functions are pluggable.
12
Outliers Detection Method – Bagged LOF
Bagged LOF algorithm is used to compute an "outlier" score for each instance in the data. The filter appends to the input data a new attribute called "BaggedLOF". numBags -- The number of bags to use (number of runs of LOF, each using a different random subset of attributes) The output scores of the Bagged LOF is clustered using the simple K-Means algorithm.
13
Outliers Detection Method – Isolation forest
Implements the isolation forest method for anomaly detection. Note that this classifier is designed for anomaly detection, it is not designed for solving two-class or multi-class classification problems!The data is expected to have have a class attribute with one or two values, which is ignored at training time. The distributionForInstance() method returns (1 - anomaly score) as the first element in the distribution, the second element (in the case of two classes) is the anomaly score.
14
Missing Value Imputation
Original Dataset contains instance value as 0 for certain attributes. The wrongly entered values in a dataset are outliers. We consider them as missing values and make the instance values empty. The dataset with missing values is imputed with two different statistical methods.
15
Decision Tree algorithm
C 4.5 Decision Tree: Java Class for generating a pruned or unpruned C4.5 decision tree. Proposed by Ross Quinlan (1993). C4.5: “Programs for Machine Learning”. Morgan Kaufmann Publishers, San Mateo, CA. Logistic Decision Trees: Classifier for building 'Functional trees', which are classification trees that could have logistic regression functions at the inner nodes and/or leaves. The algorithm can deal with binary and multi-class target variables, numeric and nominal attributes and missing values. Joao Gama (2004). Functional Trees. Proposed by Niels Landwehr, Mark Hall, Eibe Frank (2005). Logistic Model Trees.
16
Results
17
Conclusion J48 Obtains maximum accuracy of 75%
Meta data of Pregnancy in Diabetes dataset is not sufficient. Outliers will affect the performance of a classifier.
18
References [1] V Mahalakshmi, M. Govindarajan, ”Comparison of Outlier Detection Methods in Diabetes Data “ International Journal of Computer Applications (0975 – 8887) Volume 155 – No 10, December 2016 [2] T. Santhanam, M.S Padmavathi “Comparison of K-Means Clustering and Statistical Outliers in Reducing Medical Datasets” IEEE, ICSEMR 2014 [3] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jorg Sander (2000). LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Record. 29(2): [4] Aleksandar Lazarevic, Vipin Kumar: Feature Bagging for Outlier Detection. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, New York, NY, USA, , [5] Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou: Isolation Forest. In: ICDM, , 2008.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.