Dealing with Noisy Data. 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

Random Forest Predrag Radenković 3237/10

NEURAL NETWORKS Perceptron

Decision Tree Approach in Data Mining

Errors in Chemical Analyses: Assessing the Quality of Results

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Data Mining Classification: Alternative Techniques

Yue Han and Lei Yu Binghamton University.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

CMPUT 466/551 Principal Source: CMU

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Sparse vs. Ensemble Approaches to Supervised Learning

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Ensemble Learning: An Introduction

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Machine Learning: Ensemble Methods

Sparse vs. Ensemble Approaches to Supervised Learning

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Chapter 5 Data mining : A Closer Look.

Ensemble Learning (2), Tree and Forest

Introduction to machine learning

For Better Accuracy Eick: Ensemble Learning

Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.

Active Learning for Class Imbalance Problem

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

by B. Zadrozny and C. Elkan

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.

Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Universit at Dortmund, LS VIII

Benk Erika Kelemen Zsolt

Yaomin Jin Design of Experiments Morris Method.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Two-Way (Independent) ANOVA. PSYC 6130A, PROF. J. ELDER 2 Two-Way ANOVA “Two-Way” means groups are defined by 2 independent variables. These IVs are typically.

Ensemble Methods in Machine Learning

Classification Ensemble Methods 1

Data Mining and Decision Support

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Gustavo.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Ensemble Classifiers.

Machine Learning: Ensemble Methods

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning Week 1.

Introduction to Data Mining, 2nd Edition by

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining, 2nd Edition

iSRD Spam Review Detection with Imbalanced Data Distributions

Ensemble learning.

Chapter 7: Transformations

Presentation transcript:

Dealing with Noisy Data

1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative Analysis

Dealing with Noisy Data 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative Analysis

Introduction The presence of noise in data is a common problem that produces several negative consequences in classification problems The performance of the models built under such circumstances will heavily depend on the quality of the training data Problems containing noise are complex problems and accurate solutions are often difficult to achieve without using specialized techniques

Dealing with Noisy Data 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative Analysis

Identifying Noise The presence of noise in the data may affect the intrinsic characteristics of a classification problem – Noise may create small clusters of instances of a particular class in parts of the instance space corresponding to another class, – remove instances located in key areas within a particular class, – disrupt the boundaries of the classes and increase overlapping among them.

Identifying Noise Noise is specially relevant in supervised problems, where it alters the relationship between the informative features and the measure output – For this reason noise has been specially studied in classification and regression Robustness is the capability of an algorithm to build models that are insensitive to data corruptions and suffer less from the impact of noise – The more robust an algorithm is, the more similar the models built from clean and noisy data are

Identifying Noise Robustness is considered more important than performance results when dealing with noisy data – it allows one to know a priori the expected behavior of a learning method against noise in cases where the characteristics of noise are unknown

Identifying Noise Several approaches have been studied in the literature to deal with noisy data: Robust learners. These are techniques characterized by being less influenced by noisy data. An example of a robust learner is the C4.5 algorithm Data polishing methods. Their aim is to correct noisy instances prior to training a learner Noise filters identify noisy instances which can be eliminated from the training data

Identifying Noise Noise is not the only problem that supervised ML techniques have to deal with. Complex and nonlinear boundaries between classes are problems that may hinder the performance of classifiers – it often is hard to distinguish between such overlapping and the presence of noisy examples Relevant issues related to the degradation of performance: – Presence of small disjuncts – Overlapping between classes

Identifying Noise

Another interesting problem is pointed out: – the higher or lower presence of examples located in the area surrounding class boundaries, which are called borderline examples Misclassification often occurs near class boundaries where overlapping usually occurs Classifier performance degradation was strongly affected by – the quantity of borderline examples – the presence of other noisy examples located farther outside the overlapping region

Identifying Noise Safe examples are placed in relatively homogeneous areas with respect to the class label. Borderline examples are located in the area surrounding class boundaries, where either the minority and majority classes overlap or these examples are very close to the difficult shape of the boundary Noisy examples are individuals from one class occurring in the safe areas of the other class

Identifying Noise

Safe examples are placed in relatively homogeneous areas with respect to the class label. Borderline examples are located in the area surrounding class boundaries, where either the minority and majority classes overlap or these examples are very close to the difficult shape of the boundary Noisy examples are individuals from one class occurring in the safe areas of the other class

Dealing with Noisy Data 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative Analysis

Types of Noise A large number of components determine the quality of a data set Among them, the class labels and the attribute values directly influence such a quality – The quality of the class labels refers to whether the class of each example is correctly assigned – the quality of the attributes refers to their capability of properly characterizing the examples for classification purposes

Types of Noise Based on these two information sources, two types of noise can be distinguished: – Class noise (also referred as label noise) occurs when an example is incorrectly labeled. Two types of class noise can be distinguished: Contradictory examples Misclassifications – Attribute noise refers to corruptions in the values of one or more attributes  erroneous attribute values, missing or unknown attribute values, and incomplete attributes or “do not care” values

Types of Noise Attribute noise is more harmful than class noise Eliminating or correcting examples in data sets with class and attribute noise, respectively, may improve classifier performance Attribute noise is more harmful in those attributes highly correlated with the class labels Most of the works found in the literature are only focused on class noise

Noise Introduction Mechanisms Traditionally the label noise introduction mechanism has not attracted much attention However, the nature of noise becomes more and more important Authors distinguish between three possible statistical models for label noise

Noise Introduction Mechanisms a)In the most simplistic case in which the noise procedure is not dependent of either the true value of the class Y or the input attribute values X, the label noise is called noise completely at random or NCAR

Noise Introduction Mechanisms b)Things get more complicated in the noise at random (NAR) model. Although the noise is independent of the inputs X, the true value of the class make it more or less prone to be noisy. This asymmetric labeling error can be produced by the different cost of extracting the true class, as for example in medical case-control studies, financial score assets and so on

Noise Introduction Mechanisms The third and last noise model is the noisy not at random (NNAR), where the input attributes somehow affect the probability of the class label being erroneous. NNAR model is the more general case of class noise.

Noise Introduction Mechanisms In the case of attribute noise, the modelization described above can be extended and adapted: – When the noise appearance does not depend either on the rest of the input features’ values or the class label the NCAR noise model applies – When the attribute noise depends on the true value but not on the rest of input values or the observed class label y the NAR model is applicable – In the last case (NNAR) the noise probability will depend on the value of the feature but also on the rest of the input feature values

Simulating the Noise of Real-World Data Sets Checking the effect of noisy data on the performance of classifier learning algorithms is necessary to improve their reliability and has motivated the study of how to generate and introduce noise into the data Noise generation can be characterized by three main characteristics: – The place where the noise is introduced – The noise distribution – The magnitude of generated noise values

Simulating the Noise of Real-World Data Sets The two types of noise considered, class and attribute noise, have been modeled using four different noise schemes: – Class noise is introduced using an uniform class noise scheme (randomly corrupting the class labels of the examples) and a pairwise class noise scheme (labeling examples of the majority class with the second majority class) – Attribute noise is introduced using an uniform attribute noise scheme when erroneous attribute values can be totally unpredictable, and the Gaussian attribute noise scheme if noise is a low variation with respect to the correct value

Simulating the Noise of Real-World Data Sets Robustness  compare the performance of the classifiers learned with the original (without induced noise) data set with the performance of the classifiers learned using the noisy data set Those classifiers learned from noisy data sets being more similar (in terms of results) to the noise free classifiers will be the most robust ones.

Dealing with Noisy Data 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative Analysis

Noise Filtering at Data Level Noise filters are preprocessing mechanisms to detect and eliminate noisy instances in the training set Result  reduced training set which is used as an input to a classification algorithm The separation of noise detection and learning has the advantage that noisy instances do not influence the classifier building design

Noise Filtering at Data Level Noise filters are generally oriented to detect and eliminate instances with class noise from the training data Elimination of such instances has been shown to be advantageous – The elimination of instances with attribute noise seems counterproductive – Instances with attribute noise still contain valuable information in other attributes which can help to build the classifier It is also hard to distinguish between noisy examples and true exceptions

Noise Filtering at Data Level We will consider three noise filters designed to deal with mislabeled instances as they are the most common and the most recent It should be noted that these three methods are ensemble-based and vote-based filters. Collecting information from different models will provide a better method for detecting mislabeled instances than collecting information from a single model

Ensemble Filter The Ensemble Filter (EF) is a well-known filter in the literature It uses a set of learning algorithms to create classifiers in several subsets of the training data They serve as noise filters for the training set. The authors originally proposed the use of C4.5, 1-NN and LDA

Ensemble Filter The complete process carried out by EF is described below:

Cross-Validated Committees Filter The Cross-Validated Committees Filter (CVCF) is mainly based on performing an Γ -FCV to split the full training data and on building classifiers using decision trees in each training subset The authors of CVCF place special emphasis on using ensembles of decision trees such as C4.5

Cross-Validated Committees Filter The basic steps of CVCF are the following:

Iterative-Partitioning Filter The Iterative-Partitioning Filter (IPF) is a preprocessing technique based on the Partitioning Filter IPF removes noisy instances in multiple iterations until the number of identified noisy instances in each of these iterations is less than a percentage p of the size of the original training data set

Iterative-Partitioning Filter The basic steps of each iteration are:

More Filtering Methods Apart from the three aforementioned filtering methods, we can find many more in the specialized literature – Detection based on thresholding of a measure: Measure, Least complex correct hypothesis – Classifier predictions based: Cost sensitive, SVM based, ANN based, Multi classifier system, C4.5 – Voting filtering: Ensembles, Bagging, ORBoost, Edge analysis – Model influence: LOOPC, Single perceptron perturbation – Nearest neighbor based: CNN, BBNR, IB3, Tomek links, PRISM, DROP – Graph connections based: Grabiel graphs, Neighborhood graphs

Dealing with Noisy Data 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative Analysis

Robust Learners Against Noise Filtering the data has also one major drawback: – some instances will be dropped from the data sets, even if they are valuable We can use other approaches to diminish the effect of noise in the learned model One powerful approach is to train not a single classifier but several ones

Robust Learners Against Noise C4.5 has been considered as a paradigmatic robust learner against noise It is also true that classical decision trees have been considered sensitive to class noise as well Solutions: – Carefully select an appropriate splitting criteria measure – Prunning This instability has make them very suitable for ensemble methods

Robust Learners Against Noise An ensemble is a system where the base learners are all of the same type built to be as varied as possible Many ensemble approaches exist and their noise robustness has been tested The two most classic approaches are bagging and boosting Bagging obtains better performance than boosting when label noise is present When the base classifiers are different we talk of Multiple Classifier Systems (MCSs)

Robust Learners Against Noise We can separate the labeled instances in several “bags” or groups, each one containing only those instances belonging to the same class This type of decomposition is well suited for those classifiers that can only work with binary classification problems – It has also been suggested that this decomposition can help to diminish the effects of noise – This decomposition is expected to decrease the overlapping between the classes – It limits the effect of noisy instances to their respective bags

Dealing with Noisy Data 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative Analysis

Experimental Comparative Analysis Summary of some major studies: – Using filters diminish the loss of accuracy as the noise increases – The complexity of the models obtained is also lower when using filters compared to no filtering at all – The filtering efficacy depends on the characteristics of the data and can be anticipated – Using MCSs improve the behavior of the classifier when compared to its parts (or base classifiers) – Using a one-versus-one (OVO) decomposition improves the baseline classifiers in terms of accuracy when data is corrupted by noise in all the noise schemes shown