Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic.

Similar presentations


Presentation on theme: "1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic."— Presentation transcript:

1 1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic University; 合肥工业大学计算机与信息学院

2 Tsinghua University, Beijing, January 15, 2008 2 Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

3 Tsinghua University, Beijing, January 15, 2008 3 Noise Is Everywhere Random noise  “a random error or variance in a measured variable” (Han & Kamber 2001)  “any property of the sensed pattern which is not due to the true underlying model but instead to randomness in the world or the sensors” (Duda et.al. 2000) Structured noise  Caused by systematic mechanisms Equipment failure Deceptive information

4 Tsinghua University, Beijing, January 15, 2008 4 Noise Categories and Locations Categorized by types  Erroneous value  Missing value Categorized by variable types (Zhu & Wu 2004)  Independent variable Attribute noise  Dependent variable Class noise

5 Tsinghua University, Beijing, January 15, 2008 5 Existing Efforts (1): Learning with Random Noise Data preprocessing techniques  Identifying mislabeled examples (Brodley & Friedl 1999) Noise filtering  Erroneous attribute value detection (Teng 1999) Attribute value prediction  Missing attribute value acquisition (Zhu & Wu 2004, Zhu & Wu 2005) Acquiring the most informative missing values  Data imputation (Fellegi & Holt 1976) Filling missing values

6 Tsinghua University, Beijing, January 15, 2008 6 Existing Efforts (2): Classifier Ensembling w/ Random Noise Bagging (Breiman 1996) Boosting (Freund & Schapire 1996)

7 Tsinghua University, Beijing, January 15, 2008 7 Limitations The design of current ensembling methods only focus on making diverse base learners How to learn from past noise- handling efforts to avoid future noise?

8 Tsinghua University, Beijing, January 15, 2008 8 Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

9 Tsinghua University, Beijing, January 15, 2008 9 A System Framework for Noise-Tolerant Data Mining Error Identification and Instance Ranking Error Profiling and Reasoning Error-Tolerant Mining

10 Tsinghua University, Beijing, January 15, 2008 10

11 Tsinghua University, Beijing, January 15, 2008 11 Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

12 Tsinghua University, Beijing, January 15, 2008 12 Error Detection and Instance Ranking (AAAI-04) Error Detection  Construct suspicious instance subset  Locate erroneous attribute values Impact-Sensitive Ranking  Rank suspicious instances based on located erroneous attribute values and their impacts. Noisy Dataset D Suspicious Instances SubsetS Erroneous Attribute Detection Calculate Information-gain Ratios Impact-sensitive Weight for Each Attribute Overall Impact Value for Each Suspicious Instance Impact-sensitive Rankingand Recommendation Impact-sensitive Ranking Error Detection

13 Tsinghua University, Beijing, January 15, 2008 13 Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

14 Tsinghua University, Beijing, January 15, 2008 14 Error Profiling with Structured Noise Unlimited types of structured noise Occurs in many studies Objective  Construct a systematic approach  Study specific types of structured noise.

15 Tsinghua University, Beijing, January 15, 2008 15 Approach Rule Learning Rule Evaluation

16 Tsinghua University, Beijing, January 15, 2008 16 Associative Noise (ICDM ’07) Associative noise  The error in one attribute is associated with other attribute values Stability of certain measures is conditioned on other attributes Intentionally planted false information Model  Assumptions Noisy data set D, purged data set D’.  Associative Corruption Rules  Errors are only in feature attributes.

17 Tsinghua University, Beijing, January 15, 2008 17 Associative Profiling Take the purged data set D' as the base data set For each corrupted attribute A i in D, add A i into D' and label A i as the class attribute Learn a classification tree from D' Obtain modification rules  If A 1 = a 11, A 2 = a 21,C = c 1, then A 5 = a 51 => A 5 = a 52  If A 2 = a 21, A 3 = a 31, then A 5 = a 52 => A 5 = a 52

18 Tsinghua University, Beijing, January 15, 2008 18 Associative Profiling Rules Inverse obtained rules  If A 1 = a 11, A 2 = a 21,C = c 1, then A 5 = a 51 => A 5 = a 52  If A 1 = a 11, A 2 = a 21,C = c 1, then A 5 = a 52 => A 5 = a 51 In D’, learn a Bayes learner L for attribute A’ i Evaluation  Correcting noisy data set D 2 with the help of L  Corrected data set D’ 2  Does D’ 2 have a higher quality than data set D 2 in terms of supervised learning?

19 Tsinghua University, Beijing, January 15, 2008 19 Outline 1. Introduction  Noise  Existing Efforts in Noise Handling 2. A System Framework for Error Tolerant Data Mining 3. Error Detection and Instance Ranking 4. Error Profiling with Structured Noise 5. Error Tolerant Mining

20 Tsinghua University, Beijing, January 15, 2008 20 Error-Tolerant Data Mining Get a set of diverse base training sets by re- sampling Unify error detection, correction and data cleansing for each base training set to improve its quality Classifier ensembling.

21 Tsinghua University, Beijing, January 15, 2008 21 C2 Flowchart

22 Tsinghua University, Beijing, January 15, 2008 22 Accuracy Enhancement Three Steps: Locate noisy data from given dataset Recommend possible corrections  Attribute prediction  Construct solution set Select and perform one correction for each noisy instance Classifier T’ D ’ D S ’

23 Tsinghua University, Beijing, January 15, 2008 23 Attribute Prediction Switch each attribute (A i ) with the class label to train a classifier AP i I k : A 1, A 2,.., A i,.., A N, C I k : A 1, A 2,..,C,.., A N,A i Classification Algorithm AP i Use AP i to evaluate whether attribute A i possibly contains any error

24 Tsinghua University, Beijing, January 15, 2008 24 Construct A Solution Set I k : A 1 A 2 … A i C AP 1 AP 2 AP i I k ’ : A 1 ’ A 2 ’ … A i ’ C For example, Solution set for instance I k {A 1 --> A 1 ’, A j --> A j ’, { A k1 --> A k1 ’, A k2 --> A k2 ’ } } k = 3: maximum attribute value changes. D’D’ Classifier T’ D’D’ S

25 Tsinghua University, Beijing, January 15, 2008 25 Select and Perform Corrections DD1’D1’ S1S1 … Classifier Ensembling D2’D2’ S2S2 Dn’Dn’ SnSn Resampling Noise locating, detecting, D 1 ’’D 2 ’’D n ’’ … correcting

26 Tsinghua University, Beijing, January 15, 2008 26 Experimental Results We integrate Weka-3-4 packages into our system We use C4.5 classification tree Real-world datasets from UCI data depository Attribute error corruption scheme:  Erroneous attribute values are introduced into each attribute independently with noise level x  100%.

27 Tsinghua University, Beijing, January 15, 2008 27 Results C2 won 34 trials Bagging won 4 trials Tied 2 trials

28 Tsinghua University, Beijing, January 15, 2008 28 Results Monks3 Performance Comparison on Base Learners Performance Comparison on Four Methods 20% 10% Noise Level

29 Tsinghua University, Beijing, January 15, 2008 29 Results Monks3 Performance Comparison on Base Learners Performance Comparison on Four Methods 40% 30% Noise Level

30 Tsinghua University, Beijing, January 15, 2008 30 Performance Discussions C2 (ICDM 2006), Bagging, ACE (ICTAI 2005) outperform classifier T C2 outperforms Bagging in most trials When the noise level is high, the accuracy enhancement module is less reliable We can consider improvement from following aspects:  Locating noisy data  Recommending possible corrections  Selecting and performing one correction.

31 Tsinghua University, Beijing, January 15, 2008 31 Concluding Remarks A defining problem hence a long-term issue  Data mining from large, noisy data sources With different types of noise,  Structured noise A specific type of structured noise  Associative Noise Associative profiling  Random noise C2 – Corrective Classification Future work: how to combine noise profiling and noise-tolerant mining with unknown noise types?

32 Tsinghua University, Beijing, January 15, 2008 32 References 1. J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2001. 2. R.O. Duda, et.al. Pattern Classification (2nd Edition), Wiley-Interscience, 2000. 3. Y. Zhang, X. Zhu, X. Wu and J.P. Bond, ACE: An Aggressive Classifier Ensemble with Error Detection, Correction and Cleansing, IEEE ICTAI 2005, pp.310-317. 4. Y. Zhang, X. Zhu and X. Wu, Corrective Classification: Classifier Ensembling with Corrective and Diverse Base Learners, ICDM 2006, pp.1199-1204. 5. X. Zhu and X. Wu, Class Noise vs Attribute Noise: A Quantitative Study of Their Impacts, Artificial Intelligence Review, 22(2004), 3-4: 177-210. 6. Y. Zhang and X Wu, Noise Modeling with Associative Corruption Rules, ICDM 2007, pp. 733-738.

33 Tsinghua University, Beijing, January 15, 2008 33 Acknowledgements Joint work with:  Dr. Xingquan Zhu  Yan Zhang  Dr. Jeffrey Bond Supported by  DOD (US)  NSF (US)


Download ppt "1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic."

Similar presentations


Ads by Google