International Conference on Mathematical Modelling and Computational Methods in Science and Engineering, Alagappa University, Karaikudi, India February.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Random Forest Predrag Radenković 3237/10
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Decision Tree Rong Jin. Determine Milage Per Gallon.
Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.
Dr. Abdul Aziz Associate Dean Faculty of Computer Sciences Riphah International University Islamabad, Pakistan Dr. Nazir A. Zafar.
Induction of Decision Trees
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
Anomaly detection with Bayesian networks Website: John Sandiford.
by B. Zadrozny and C. Elkan
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Ensemble Methods: Bagging and Boosting
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.
Ensemble with Neighbor Rules Voting Itt Romneeyangkurn, Sukree Sinthupinyo Faculty of Computer Science Thammasat University.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
International Journal of Software Engineering and Its Applications Vol. 7, No. 2, March, 2013 Insights of Data Mining for Small and Unbalanced Data Set.
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Oracle Advanced Analytics
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Dr. Hongqin FAN Department of Building and Real Estate
JMP Discovery Summit 2016 Janet Alvarado
Latent variable discovery in classification models
Outlier Detection Identifying anomalous values in the real- world database is important both for improving the quality of original data and for reducing.
Rule Induction for Classification Using
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Akbar Akbari Esfahani1, Theodor Asch2
Source: Procedia Computer Science(2015)70:
Dipartimento di Ingegneria «Enzo Ferrari»,
Machine Learning University of Eastern Finland
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Cost-Sensitive Learning
Classification and Prediction
Machine Learning Week 1.
Data Mining Practical Machine Learning Tools and Techniques
Cost-Sensitive Learning
COSC 4335: Other Classification Techniques
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Ensemble learning.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Somi Jacob and Christian Bach
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Nearest Neighbors CSC 576: Data Mining.
Lecture 10 – Introduction to Weka
Chapter 7: Transformations
©Jiawei Han and Micheline Kamber
Machine Learning with Clinical Data
Machine Learning for Optics Measurements and Corrections
Credit Card Fraudulent Transaction Detection
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

International Conference on Mathematical Modelling and Computational Methods in Science and Engineering, Alagappa University, Karaikudi, India February 20-22, 2017

Detection of Outliers to enhance the performance of Decision tree algorithm Vaishali R 1 , MTech Dr.P.Ilango , Professor School of Computing Science and Engineering VIT University CT ENGG 029

ABSTRACT Abstract: An efficient knowledge discovery process requires quality of datasets. Normally dataset may suffer from the existence of noise namely missing data, wrongly entered data etc. The wrongly entered data may be the outliers in the dataset. The outliers certainly affect the detection of patterns in the dataset. . It is observed that the Pima Indians diabetes dataset contain outliers in the attributes such as number of pregnancy, Blood pressure, insulin etc. level that may affect the accuracy of prediction. The proposed work focuses on detection of outliers in the dataset. After the detection of outliers it may be removed to generate the patterns effectively or it may be handled according to its criticality with decision tree induction algorithm. Keywords: datamining, knowledge discovery, outliers, Pima Indians Diabetes dataset, Decision tree Induction

Introduction “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.” - Frank E. Grubbs

Causes of Outliers in a dataset Apparatus malfunction. Fraudulent behaviour. Human error. Natural deviations. Contamination.

Applications of Outliers Fraud Detection Medicine Public Health Sports statistics Detecting measurement errors

Literature Survey The outliers are handled with various approaches based on distance and density. The performance of Naïve Bayesian algorithm and SVM are tested with the resultant dataset. Highest accuracy is produced by SVM classifier after handling the diabetes data with density based approach.[1] In [2] missing data in medical datasets are handled with three imputation methods and the efficiency of Simple Kmeans clustering is tested with the obtained datasets. [3] Proposes a filter that applies the LOF (Local Outlier Factor) algorithm to compute an "outlier" score for each instance in the data. Can use multiple cores/cpus to speed up the LOF computation for large datasets. Nearest neighbor search methods and distance functions are pluggable. [4] Proposes feature bagging filter to detect the outliers in a dataset to compute an "outlier" score for each instance in the data. The filter appends to the input data a new attribute called "BaggedLOF".

Problem Statement Pima Indian Diabetes dataset obtained from UCI Machine learning repository contains outliers. The dataset claims that it has no missing values, but it is predicted that certain attributes such as age, Blood pressure have the instance values as zero, which practically not feasible. The wrongly entered instance values add to the outliers of the dataset which will degrade the quality of the dataset for the knowledge discovery process. The Dataset contains the pregnancy attribute, for which the meta data provided is not sufficient to make decisions. It establishes a confusion whether the number of times pregnant denotes the total number of completed pregnancies or includes aborted pregnancy also. The highest value in number of time pregnant is 17, which can be considered as an outlier or invalid data.

Methods to Handle outliers Interquartile Range – Box Plot LOF – Local outlier Factor Isolation Forest Angle Based Outlier Detection Imputation

Outliers Detection Method IQR Interquartile Range Q3 - Q1 Lower Inner Fence: Q1 - 1.5*IQR Upper Inner Fence: Q3 + 1.5*IQR Lower Outer Fence: Q1 - 3*IQR Upper Outer Fence: Q3 + 3*IQR

Outliers Detection Method - LOF This algorithm compute an "outlier" score for each instance in the data. Can use multiple cores/cpus to speed up the LOF computation for large datasets. Nearest neighbor search methods and distance functions are pluggable.

Outliers Detection Method – Bagged LOF Bagged LOF algorithm is used to compute an "outlier" score for each instance in the data. The filter appends to the input data a new attribute called "BaggedLOF". numBags -- The number of bags to use (number of runs of LOF, each using a different random subset of attributes) The output scores of the Bagged LOF is clustered using the simple K-Means algorithm.

Outliers Detection Method – Isolation forest Implements the isolation forest method for anomaly detection. Note that this classifier is designed for anomaly detection, it is not designed for solving two-class or multi-class classification problems!The data is expected to have have a class attribute with one or two values, which is ignored at training time. The distributionForInstance() method returns (1 - anomaly score) as the first element in the distribution, the second element (in the case of two classes) is the anomaly score.

Missing Value Imputation Original Dataset contains instance value as 0 for certain attributes. The wrongly entered values in a dataset are outliers. We consider them as missing values and make the instance values empty. The dataset with missing values is imputed with two different statistical methods.

Decision Tree algorithm C 4.5 Decision Tree: Java Class for generating a pruned or unpruned C4.5 decision tree. Proposed by Ross Quinlan (1993). C4.5: “Programs for Machine Learning”. Morgan Kaufmann Publishers, San Mateo, CA. Logistic Decision Trees: Classifier for building 'Functional trees', which are classification trees that could have logistic regression functions at the inner nodes and/or leaves. The algorithm can deal with binary and multi-class target variables, numeric and nominal attributes and missing values. Joao Gama (2004). Functional Trees. Proposed by Niels Landwehr, Mark Hall, Eibe Frank (2005). Logistic Model Trees.

Results

Conclusion J48 Obtains maximum accuracy of 75% Meta data of Pregnancy in Diabetes dataset is not sufficient. Outliers will affect the performance of a classifier.

References [1] V Mahalakshmi, M. Govindarajan, ”Comparison of Outlier Detection Methods in Diabetes Data “ International Journal of Computer Applications (0975 – 8887) Volume 155 – No 10, December 2016 [2] T. Santhanam, M.S Padmavathi “Comparison of K-Means Clustering and Statistical Outliers in Reducing Medical Datasets” IEEE, ICSEMR 2014 [3] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jorg Sander (2000). LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Record. 29(2):93-104. [4] Aleksandar Lazarevic, Vipin Kumar: Feature Bagging for Outlier Detection. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, New York, NY, USA, 157-166, 2005. [5] Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou: Isolation Forest. In: ICDM, 413-422, 2008.