IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy,

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 1 Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction Alexey Tsymbal Department of Computer Science Trinity College Dublin Ireland Seppo Puuronen & Oleksandr Pechenizkiy Dept. of CS and IS University of Jyväskylä Finland Mykola Pechenizkiy Dept. of Mathematical IT University of Jyväskylä Finland IEEE CBMS’06: DM Track Salt Lake City, Utah, USAJune 21-23, 2006

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 2 Outline  DM and KDD background –KDD as a process, DM strategy –Supervised Learning (SL)  Noise in data –Types and sources of noise  Feature Extraction approaches used: –Conventional Principal Component Analysis –Class-conditional FE: parametric and non-parametric  Experiments design –Impact of class noise on SL and the effect of FE –Dataset characteristics  Results and Conclusion

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 3 Knowledge discovery as a process Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997. kNN Naïve Bayes C4.5 PCA and LDA Class noise is introduced in training datasets

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 4 CLASSIFICATION New instance to be classified Class Membership of the new instance J classes, n training observations, p features Given n training instances (x i, y i ) where x i are values of attributes and y is class Goal: given new x 0, predict class y 0 Training Set The task of classification Examples: - diagnosis of thyroid diseases; - heart attack prediction, etc.

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 5 Data may contain various types of errors:  random or systematic –random errors are often referred to as noise; –some authors regard as noise both mislabeled examples and outliers which are correctly classified but are relatively rare instances ( exceptions ).  Quality of a dataset in SL – characterized by two information parts of instances: – the quality of the attributes indicates how well they characterize instances for classification purposes, and – the quality of class labels indicates the correctness of class labels’ assignments.  Noise is often similarly divided into two major categories that are –class noise (misclassifications or mislabeling) contradictory instances (instances with the same values of the attributes but different class labels, forming so-called irreducible or Bayes error ) and wrongly classified (labeled) instances that are misclassifications (mislabelings). –attribute noise (errors introduced to attribute values): erroneous attribute values, missing or so-called ‘don‘t know’ values, and incomplete or so-called ‘don’t care’ values.

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 6 Sources of Class Noise:  The major factors that impact on the amount of mislabeled instances in a dataset – data-entry errors; –the errors of devices used for automatic classification; –the subjectivity and the inadequacy of information used to label each instance.  Domains in which medical experts may disagree are natural ones for subjective labeling errors: –if the absolute ground truth is unknown then experts must subjectively provide labels and mislabeled instances naturally appear; –if an observation needs to be ranked according to a disease severity;  If the information used to label an instance is different from the information to which the learning algorithm will have access: –if an expert relies on visual input rather than the numeric values of the attributes.  If the results of some tests (attribute values) are unknown – impossible to obtain or difficult to obtain –e.g. because of cost or time considerations.

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 7 Handling class noise  noise-tolerant techniques – try to avoid overfitting the possibly noisy training set during SL: –handle noise implicitly; –noise-handling mechanism is often embedded into either search heuristics and stopping criteria used in model construction; post-processing such as decision tree post-pruning; or model selection mechanism based e.g. on MDL principle.  filtering techniques – detect and eliminate noisy instances before SL: –handle noise explicitly; –the noise-handling mechanism is often implemented as a filter that is applied before SL; –results in a reduced training set f the noisy instances are not corrected but deleted; –single-algorithm filters and ensemble filters.  brief review of these approaches, their proc and cons can be found in paper, we omit this discussion due to time constrains It is often hard to distinguish noise from exceptions (outliers) without the help of an expert, especially if the noise is systematic

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 8 Focus of this study:  to apply Feature Extraction (FE) techniques to eliminate the effect of class noise on SL.  This approach fits better to the second category of noise-tolerant techniques as –it helps to avoid overfitting implicitly within learning techniques.  However, this approach has also some similarity with the filtering approach as –it clearly has a separate phase of dimensionality reduction which is undertaken before the SL process.  Brief background on FE techniques used in this study – in next few slides.

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 9 Feature Extraction Feature extraction (FE) is a dimensionality reduction technique that extracts a subset of new features from the original set by means of some functional mapping keeping as much information in the data as possible (Fukunaga 1990). Conventional Principal Component Analysis (PCA) is one of the most commonly used feature extraction techniques, that is based on extracting the axes on which the data shows the highest variability (Jolliffe 1986). PCA has the following properties: (1) it maximizes the variance of the extracted features; (2) the extracted features are uncorrelated; (3) it finds the best linear approximation in the mean-squares sense; (4) it maximizes the information contained in the extracted features.

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 10 FE example “Heart Disease” 0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate -0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate -0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate 100% Variance covered 87% 60% 67%

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 11 PCA- and LDA-based Feature Extraction  Experimental studies with these FE techniques and basic SL techniques: Tsymbal et al., FLAIRS’02; Pechenizkiy et al., AI’05 Use of class information in FE process is crucial for many datasets: Class-conditional FE can result in better classification accuracy while solely variance-based FE has no effect on or deteriorates the accuracy. No superior technique, but nonparametric approaches are more stables to various dataset characteristics

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 12 Experiment design  WEKA3 environment: Data Mining Software in Java: –http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/  10 medical datasets: –next slide.  Classification algorithms: – k NN, Naïve Bayes, C4.5.  Feature Extraction techniques: –PCA, PAR, NPAR – 0.85% variance threshold.  Artificially imputed class noise: –0% - 20%; 2% step  Evaluation: –accuracy averaged over 30 test runs of Monte-Carlo cross validation for each sample; –30% - test set; 70% - used for forming a train set out of which 0%-20% have artificially corrupted class label.

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 13 Datasets Characteristics dataset instancesfeatures classes contractions 98272 laryngeal1 213162 laryngeal2 692162 laryngeal3 353163 rds 85172 weaning 302172 voice3 238103 voice9 428109 Further information on these datasets and the datasets themselves are available at http://www.informatics.bangor.ac.uk/~kuncheva/activities/real_data.htm. http://www.informatics.bangor.ac.uk/~kuncheva/activities/real_data.htm

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 14 Classification Accuracy vs. Imputed Class Noise

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 15 Classification Error Increase due to Class Noise k NN (k Nearest Neigbour) Naïve Bayes C4.5 decision tree

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 16 Typical behavior of SL with(-out) FE * laryngeal1 dataset k Nearest Neigbour Naïve Bayes C4.5 decision tree

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 17 Summary and Conclusions  Class noise affects SL with most of considered datasets  FE can significantly increase the accuracy of SL –producing better feature space and fighting “the curse of dimensionality”.  In this study we showed that applying FE for SL –decreases the negative effect of class noise in the data;  Directions of further research: –the comparison of FE techniques with other dimensionality reduction and instance selection techniques; –the comparison of FE with filter approaches for class noise elimination.

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy 18 Contact Info Mykola Pechenizkiy Department of Mathematical Information Technology, University of Jyväskylä, FINLAND E-mail: mpechen@cs.jyu.fimpechen@cs.jyu.fi www.cs.jyu.fi/~mpechen THANK YOU! MS Power Point slides of this and other recent talks and full texts of selected publications are available online at: http://www.cs.jyu.fi/~mpechenhttp://www.cs.jyu.fi/~mpechen

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy,

Similar presentations

Presentation on theme: "IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy,

Similar presentations

Presentation on theme: "IEEE CBMS’06, DM Track Salt Lake City, Utah 22.06.06 “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy,"— Presentation transcript:

Similar presentations

About project

Feedback