1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic.

Slides:



Advertisements
Similar presentations
Visual Data Mining: Concepts, Frameworks and Algorithm Development Student: Fasheng Qiu Instructor: Dr. Yingshu Li.
Advertisements

1 Top 10 Algorithms in Data Mining Xindong Wu ( 吴信东 ) Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic University; 合肥工业大学计算机与信息学院.
Missing values problem in Data Mining
IEEE CBMS’06, DM Track Salt Lake City, Utah “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation.
Ensemble Learning what is an ensemble? why use an ensemble?
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Adaboost and its application
Three kinds of learning
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Machine Learning: Ensemble Methods
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
3 ème Journée Doctorale G&E, Bordeaux, Mars 2015 Wei FENG Geo-Resources and Environment Lab, Bordeaux INP (Bordeaux Institute of Technology), France Supervisor:
1 Feature Selection: Algorithms and Challenges Joint Work with Yanglan Gang, Hao Wang & Xuegang Hu Xindong Wu University of Vermont, USA; Hefei University.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
CLassification TESTING Testing classifier accuracy
Full model selection with heuristic search: a first approach with PSO Hugo Jair Escalante Computer Science Department, Instituto Nacional de Astrofísica,
CS 391L: Machine Learning: Ensembles
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8.
Learning with AdaBoost
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
1 CHUKWUEMEKA DURUAMAKU.  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
Class Imbalance in Text Classification
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Data Mining and Decision Support
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Experience Report: System Log Analysis for Anomaly Detection
Feature Selection: Algorithms and Challenges
Presented by Khawar Shakeel
Prepared by: Mahmoud Rafeek Al-Farra
Trees, bagging, boosting, and stacking
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Cost-Sensitive Learning
Cost-Sensitive Learning
A Unifying View on Instance Selection
Prepared by: Mahmoud Rafeek Al-Farra
Introduction to Data Mining, 2nd Edition
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Ensemble learning.
Ensemble learning Reminder - Bagging of Trees Random Forest
Lecture 10 – Introduction to Weka
Presentation transcript:

1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic University; 合肥工业大学计算机与信息学院

Tsinghua University, Beijing, January 15, Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

Tsinghua University, Beijing, January 15, Noise Is Everywhere Random noise  “a random error or variance in a measured variable” (Han & Kamber 2001)  “any property of the sensed pattern which is not due to the true underlying model but instead to randomness in the world or the sensors” (Duda et.al. 2000) Structured noise  Caused by systematic mechanisms Equipment failure Deceptive information

Tsinghua University, Beijing, January 15, Noise Categories and Locations Categorized by types  Erroneous value  Missing value Categorized by variable types (Zhu & Wu 2004)  Independent variable Attribute noise  Dependent variable Class noise

Tsinghua University, Beijing, January 15, Existing Efforts (1): Learning with Random Noise Data preprocessing techniques  Identifying mislabeled examples (Brodley & Friedl 1999) Noise filtering  Erroneous attribute value detection (Teng 1999) Attribute value prediction  Missing attribute value acquisition (Zhu & Wu 2004, Zhu & Wu 2005) Acquiring the most informative missing values  Data imputation (Fellegi & Holt 1976) Filling missing values

Tsinghua University, Beijing, January 15, Existing Efforts (2): Classifier Ensembling w/ Random Noise Bagging (Breiman 1996) Boosting (Freund & Schapire 1996)

Tsinghua University, Beijing, January 15, Limitations The design of current ensembling methods only focus on making diverse base learners How to learn from past noise- handling efforts to avoid future noise?

Tsinghua University, Beijing, January 15, Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

Tsinghua University, Beijing, January 15, A System Framework for Noise-Tolerant Data Mining Error Identification and Instance Ranking Error Profiling and Reasoning Error-Tolerant Mining

Tsinghua University, Beijing, January 15,

Tsinghua University, Beijing, January 15, Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

Tsinghua University, Beijing, January 15, Error Detection and Instance Ranking (AAAI-04) Error Detection  Construct suspicious instance subset  Locate erroneous attribute values Impact-Sensitive Ranking  Rank suspicious instances based on located erroneous attribute values and their impacts. Noisy Dataset D Suspicious Instances SubsetS Erroneous Attribute Detection Calculate Information-gain Ratios Impact-sensitive Weight for Each Attribute Overall Impact Value for Each Suspicious Instance Impact-sensitive Rankingand Recommendation Impact-sensitive Ranking Error Detection

Tsinghua University, Beijing, January 15, Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

Tsinghua University, Beijing, January 15, Error Profiling with Structured Noise Unlimited types of structured noise Occurs in many studies Objective  Construct a systematic approach  Study specific types of structured noise.

Tsinghua University, Beijing, January 15, Approach Rule Learning Rule Evaluation

Tsinghua University, Beijing, January 15, Associative Noise (ICDM ’07) Associative noise  The error in one attribute is associated with other attribute values Stability of certain measures is conditioned on other attributes Intentionally planted false information Model  Assumptions Noisy data set D, purged data set D’.  Associative Corruption Rules  Errors are only in feature attributes.

Tsinghua University, Beijing, January 15, Associative Profiling Take the purged data set D' as the base data set For each corrupted attribute A i in D, add A i into D' and label A i as the class attribute Learn a classification tree from D' Obtain modification rules  If A 1 = a 11, A 2 = a 21,C = c 1, then A 5 = a 51 => A 5 = a 52  If A 2 = a 21, A 3 = a 31, then A 5 = a 52 => A 5 = a 52

Tsinghua University, Beijing, January 15, Associative Profiling Rules Inverse obtained rules  If A 1 = a 11, A 2 = a 21,C = c 1, then A 5 = a 51 => A 5 = a 52  If A 1 = a 11, A 2 = a 21,C = c 1, then A 5 = a 52 => A 5 = a 51 In D’, learn a Bayes learner L for attribute A’ i Evaluation  Correcting noisy data set D 2 with the help of L  Corrected data set D’ 2  Does D’ 2 have a higher quality than data set D 2 in terms of supervised learning?

Tsinghua University, Beijing, January 15, Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

Tsinghua University, Beijing, January 15, Error-Tolerant Data Mining Get a set of diverse base training sets by re- sampling Unify error detection, correction and data cleansing for each base training set to improve its quality Classifier ensembling.

Tsinghua University, Beijing, January 15, C2 Flowchart

Tsinghua University, Beijing, January 15, Accuracy Enhancement Three Steps: Locate noisy data from given dataset Recommend possible corrections  Attribute prediction  Construct solution set Select and perform one correction for each noisy instance Classifier T’ D ’ D S ’

Tsinghua University, Beijing, January 15, Attribute Prediction Switch each attribute (A i ) with the class label to train a classifier AP i I k : A 1, A 2,.., A i,.., A N, C I k : A 1, A 2,..,C,.., A N,A i Classification Algorithm AP i Use AP i to evaluate whether attribute A i possibly contains any error

Tsinghua University, Beijing, January 15, Construct A Solution Set I k : A 1 A 2 … A i C AP 1 AP 2 AP i I k ’ : A 1 ’ A 2 ’ … A i ’ C For example, Solution set for instance I k {A 1 --> A 1 ’, A j --> A j ’, { A k1 --> A k1 ’, A k2 --> A k2 ’ } } k = 3: maximum attribute value changes. D’D’ Classifier T’ D’D’ S

Tsinghua University, Beijing, January 15, Select and Perform Corrections DD1’D1’ S1S1 … Classifier Ensembling D2’D2’ S2S2 Dn’Dn’ SnSn Resampling Noise locating, detecting, D 1 ’’D 2 ’’D n ’’ … correcting

Tsinghua University, Beijing, January 15, Experimental Results We integrate Weka-3-4 packages into our system We use C4.5 classification tree Real-world datasets from UCI data depository Attribute error corruption scheme:  Erroneous attribute values are introduced into each attribute independently with noise level x  100%.

Tsinghua University, Beijing, January 15, Results C2 won 34 trials Bagging won 4 trials Tied 2 trials

Tsinghua University, Beijing, January 15, Results Monks3 Performance Comparison on Base Learners Performance Comparison on Four Methods 20% 10% Noise Level

Tsinghua University, Beijing, January 15, Results Monks3 Performance Comparison on Base Learners Performance Comparison on Four Methods 40% 30% Noise Level

Tsinghua University, Beijing, January 15, Performance Discussions C2 (ICDM 2006), Bagging, ACE (ICTAI 2005) outperform classifier T C2 outperforms Bagging in most trials When the noise level is high, the accuracy enhancement module is less reliable We can consider improvement from following aspects:  Locating noisy data  Recommending possible corrections  Selecting and performing one correction.

Tsinghua University, Beijing, January 15, Concluding Remarks A defining problem hence a long-term issue  Data mining from large, noisy data sources With different types of noise,  Structured noise A specific type of structured noise  Associative Noise Associative profiling  Random noise C2 – Corrective Classification Future work: how to combine noise profiling and noise-tolerant mining with unknown noise types?

Tsinghua University, Beijing, January 15, References 1. J. Han and M. Kamber. Data Mining: Concepts and Techniques, R.O. Duda, et.al. Pattern Classification (2nd Edition), Wiley-Interscience, Y. Zhang, X. Zhu, X. Wu and J.P. Bond, ACE: An Aggressive Classifier Ensemble with Error Detection, Correction and Cleansing, IEEE ICTAI 2005, pp Y. Zhang, X. Zhu and X. Wu, Corrective Classification: Classifier Ensembling with Corrective and Diverse Base Learners, ICDM 2006, pp X. Zhu and X. Wu, Class Noise vs Attribute Noise: A Quantitative Study of Their Impacts, Artificial Intelligence Review, 22(2004), 3-4: Y. Zhang and X Wu, Noise Modeling with Associative Corruption Rules, ICDM 2007, pp

Tsinghua University, Beijing, January 15, Acknowledgements Joint work with:  Dr. Xingquan Zhu  Yan Zhang  Dr. Jeffrey Bond Supported by  DOD (US)  NSF (US)