Presentation is loading. Please wait.

Presentation is loading. Please wait.

ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy,

Similar presentations


Presentation on theme: "ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy,"— Presentation transcript:

1 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 1 The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning Alexey Tsymbal Department of Computer Science Trinity College Dublin Ireland Seppo Puuronen Dept. of CS and IS University of Jyväskylä Finland Mykola Pechenizkiy Dept. of Mathematical IT University of Jyväskylä Finland ACM SAC’06: DM TrackDijon, FranceApril 23-27, 2006

2 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 2 Outline  DM and KDD background –KDD as a process, DM strategy  Supervised Learning (SL) –Curse of dimensionality and indirectly relevant features –Feature extraction (FE) as dimensionality reduction  Feature Extraction approaches used: –Conventional Principal Component Analysis –Class-conditional FE: parametric and non-parametric  Sampling approaches used: –Random, Stratified random, kdTree-based selective  Experiments design –Impact of sample reduction on FE for SL  Results and Conclusion

3 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 3 Knowledge discovery as a process Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997. Naïve Bayes PCA and LDA Instance selection: Random; Stratified and kd-Tree-based

4 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 4 CLASSIFICATION New instance to be classified Class Membership of the new instance J classes, n training observations, p features Given n training instances (x i, y i ) where x i are values of attributes and y is class Goal: given new x 0, predict class y 0 Training Set The task of classification Examples: - diagnosis of thyroid diseases; - heart attack prediction, etc.

5 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 5 Improvement of Representation Space  Curse of dimensionality  drastic increase in computational complexity and classification error with data having a large number of dimensions  Indirectly relevant features

6 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 6 Extracted features Original features How to construct good RS for SL? What is the effect of sample reduction on the performance of FE for SL?

7 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 7 FE example “Heart Disease” 0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate -0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate -0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate 100% Variance covered 87% 60% 67%

8 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 8 PCA- and LDA-based Feature Extraction  Experimental studies with these FE techniques and basic SL techniques: Tsymbal et al., FLAIRS’02; Pechenizkiy et al., AI’05 Use of class information in FE process is crucial for many datasets: Class-conditional FE can result in better classification accuracy while solely variance-based FE has no effect on or deteriorates the accuracy. No superior technique, but nonparametric approaches are more stables to various dataset characteristics

9 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 9 What is the effect of sample reduction?  Sampling approaches used: –Random sampling (dashed) –Stratified random sampling –kdTree-based sampling (dashed) –Stratified kdTree-based sampling

10 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 10 Stratified Random Sampling

11 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 11 Stratified sampling with kd -tree based selection

12 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 12 Experiment design  WEKA environment  10 UCI datasets  SL: Naïve Bayes  FE: PCA, PAR, NPAR – 0.85% variance threshold  Sampling: RS, stratified RS, kdTree, stratified kdTree  Evaluation: –accuracy averaged over 30 test runs of Monte-Carlo cross validation for each sample –20% - test set; 80% - used for forming a train set out of which 10%- 100% are selected with one of 4 sampling approaches: RS, stratified RS, kd-tree, stratified kd-tree

13 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 13 Accuracy results If sample size p ≥ 20% then NPAR outperforms other methods; and if p ≥ 30%, NPAR outperforms others even if they use p = 100%. The best p for NPAR depends on sampling method: stratified/ RS p = 70%, kd -tree p = 80%, and stratified + kd -tree p = 60%. PCA is the worst when p is relatively smaller, especially with stratification and kd -tree indexing. PAR and Plain behaves similarly with every sampling approach. In general for p > 30% different sampling approaches have very similar effects.

14 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 14 Results: kd -Tree sampling with/out stratification Stratification improves kd -tree sampling wrt FE for SL. The figure on the left shows the difference in NB accuracy due to use of RS in comparison with kd -tree based sampling, and the right part – due to use of RS in comparison with kd -tree based sampling with stratification RS – kd -treeRS – stratified kd -tree

15 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 15 Summary and Conclusions  FE techniques can significantly increase the accuracy of SL –producing better feature space and fighting “the curse of dimensionality”.  With large datasets only part of instances is selected for SL –we analyzed the impact of sample reduction on the process of FE for SL.  The results of our study show that –it is important to take into account both class information and information about data distribution when the sample size to be selected is small; but –the type of sampling approach is not that much important when a large proportion of instances remains for FE and SL; –NPAR approach extracts good features for SL with small #instances (except RS case) in contrast with PCA and PAR approaches.  Limitations of our experimental study: –fairly small datasets, although we think that comparative behavior of sampling and FE techniques wont change dramatically; –experiments only with Naïve Bayes, it is not obvious that the comparative behavior of the techniques would be similar with other SL techniques; –no analysis of complexity issues, selected instances and number of extracted features, effect of noise in attributes and class information.

16 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 16 Contact Info Mykola Pechenizkiy Department of Mathematical Information Technology, University of Jyväskylä, FINLAND E-mail: mpechen@cs.jyu.fimpechen@cs.jyu.fi Tel. +358 14 2602472 Mobile: +358 44 3851845 Fax: +358 14 2603011 www.cs.jyu.fi/~mpechen THANK YOU! MS Power Point slides of this and other recent talks and full texts of selected publications are available online at: http://www.cs.jyu.fi/~mpechenhttp://www.cs.jyu.fi/~mpechen

17 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 17 Extra slides

18 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 18 Datasets Characteristics

19 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 19 Framework for DM Strategy Selection Pechenizkiy M. 2005. DM strategy selection via empirical and constructive induction. (DBA’05)

20 ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy, S. Puuronen and A. Tsymbal 20 Meta-Learning


Download ppt "ACM SAC’06, DM Track Dijon, France 27.04.06 “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy,"

Similar presentations


Ads by Google