Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science University of Jyväskylä Finland Alexey Tsymbal Department of Computer Science Trinity.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
IEEE CBMS’06, DM Track Salt Lake City, Utah “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy,
/12:00 Agora, Aud 2 Public examination of PhD thesis: “Feature Extraction for Supervised Learning in Knowledge Discovery Systems” 1 Prof. Seppo.
Indian Statistical Institute Kolkata
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
The Impact of Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5 Mykola Pechenizkiy Department of Computer Science and Information.
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Knowledge Management Challenges in Knowledge Discovery Systems Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science University of Jyväskylä.
CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
ACM SAC’06, DM Track Dijon, France “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy,
A Technique for Advanced Dynamic Integration of Multiple Classifiers Alexey Tsymbal*, Seppo Puuronen**, Vagan Terziyan* *Department of Artificial Intelligence.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Chapter 2 Dimensionality Reduction. Linear Methods
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Chapter 3 Data Exploration and Dimension Reduction 1.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
CSE 185 Introduction to Computer Vision Face Recognition.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Feature extraction using fuzzy complete linear discriminant analysis The reporter : Cui Yan
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Dimensionality reduction
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
2D-LDA: A statistical linear discriminant analysis for image matrix
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Principal Component Analysis (PCA)
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
School of Computer Science & Engineering
LECTURE 10: DISCRIMINANT ANALYSIS
Dimensionality reduction
Principal Component Analysis (PCA)
Basic machine learning background with Python scikit-learn
Machine Learning Basics
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hairong Qi, Gonzalez Family Professor
Dimensionality Reduction
Feature space tansformation methods
Generally Discriminant Analysis
Somi Jacob and Christian Bach
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
Principal Component Analysis
CAMCOS Report Day December 9th, 2015 San Jose State University
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science University of Jyväskylä Finland Alexey Tsymbal Department of Computer Science Trinity College Dublin Ireland ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based Feature Extraction for Supervised Learning: (kNN, Naïve Bayes and C4.5 )

2 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Contents DM and KDD background –KDD as a process –DM strategy Classification –Curse of dimensionality and Indirectly relevant features –Dimensionality reduction Feature Selection (FS) Feature Extraction (FE) Feature Extraction for Classification –Conventional PCA –Class-conditional FE: parametric and non-parametric –Combining principal component (PCs) and linear discriminants (LDs) Experimental Results –3 FE strategies, 3 Classifiers, 21 UCI datasets Conclusions and Further Research

3 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen What is Data Mining Data mining or Knowledge discovery is the process of finding previously unknown and potentially interesting patterns and relations in large databases (Fayyad, KDD’96) Data mining is the emerging science and industry of applying modern statistical and computational technologies to the problem of finding useful patterns hidden within large databases (John 1997) Intersection of many fields: statistics, AI, machine learning, databases, neural networks, pattern recognition, econometrics, etc.

4 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Knowledge discovery as a process Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, I

5 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen CLASSIFICATION New instance to be classified Class Membership of the new instance J classes, n training observations, p features Given n training instances (x i, y i ) where x i are values of attributes and y is class Goal: given new x 0, predict class y 0 Training Set The task of classification Examples: - prognostics of recurrence of breast cancer; - diagnosis of thyroid diseases; - heart attack prediction, etc.

6 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Goals of Feature Extraction Improvement of representation space

7 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Constructive Induction Feature extraction (FE) is a dimensionality reduction technique that extracts a subset of new features from the original set by means of some functional mapping keeping as much information in the data as possible (Fukunaga 1990).

8 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Feature selection or transformation Features can be (and often are) correlated –FS techniques that just assign weights to individual features are insensitive to interacted or correlated features. Data is often not homogenous –For some problems a feature subset may be useful in one part of the instance space, and at the same time it may be useless or even misleading in another part of it. –Therefore, it may be difficult or even impossible to remove irrelevant and/or redundant features from a data set and leave only useful ones by means of feature selection. That is why the transformation of the given representation before weighting the features is often preferable.

9 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen FE for Classification

10 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Principal Component Analysis PCA extracts a lower dimensional space by analyzing the covariance structure of multivariate statistical observations. The main idea – determine the features that explain as much of the total variation in the data as possible with as few of these features as possible. PCA has the following properties: (1) it maximizes the variance of the extracted features; (2) the extracted features are uncorrelated; (3) it finds the best linear approximation; (4) it maximizes the information contained in the extracted features.

11 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen The Computation of the PCA

12 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen The Computation of the PCA 1)Calculate the covariance matrix S from the input data. 2)Compute the eigenvalues and eigenvectors of S and sort them in a descending order with respect to the eigenvalues. 3)Form the actual transition matrix by taking the predefined number of components (eigenvectors). 4)Finally, multiply the original feature space with the obtained transition matrix, which yields a lower- dimensional representation. The necessary cumulative percentage of variance explained by the principal axes is used commonly as a threshold, which defines the number of components to be chosen.

13 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen FT example “Heart Disease” 0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate -0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate -0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate 100% Variance covered 87% 60% 67%

14 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen PCA for Classification PCA for classification: a) effective work of PCA, b) the case where an irrelevant principal component was chosen from the classification point of view. PCA gives high weights to features with higher variabilities disregarding whether they are useful for classification or not.

15 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Simultaneous Diagonalization Algorithm Transformation of X to Y:, where  and  are the eigenvalues and eigenvectors matrices of S B. Computation of S B in the obtained Y space. Selection of m eigenvectors of S B, which correspond to the m largest eigenvalues. Computation of new feature space, where  is the set of selected eigenvectors. The usual decision is to use some class separability criterion, based on a family of functions of scatter matrices: the within- class, the between-class, and the total covariance matrices. Class-conditional Eigenvector-based FE

16 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Parametric Eigenvalue-based FE The within-class covariance matrix shows the scatter of samples around their respective class expected vectors: The between-class covariance matrix shows the scatter of the expected vectors around the mixture mean: where c is the number of classes, n i is the number of instances in a class i, is the j-th instance of i-th class, m (i) is the mean vector of the instances of i-th class, and m is the mean vector of all the input data.

17 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Nonparametric Eigenvalue-based FE Tries to increase the number of degrees of freedom in the between-class covariance matrix, measuring the between-class covariances on a local basis. K-nearest neighbor (kNN) technique is used for this purpose. The coefficient w ik is a weighting coefficient, which shows importance of each summand. assign more weight to those elements of the matrix, which involve instances lying near the class boundaries and are more important for classification.

18 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen S b : Parametric vs Nonparametric Differences in the between-class covariance matrix calculation for nonparametric (left) and parametric (right) approaches for the two-class case.

19 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Combining PCs and LDs for SL Improvement of the parametric class-conditional LDA-based approach by adding a few principal components (PCs) to the Linear Discriminants (LDs) for further supervised learning (SL)

20 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Experimental Settings 21 data sets with different characteristics taken from the UCI machine learning repository 3 classifiers: 3-nearest neighbor classification (3NN), Naïve-Bayes (NB) learning algorithm, and C4.5 decision tree learning (C4.5) –The classifiers were used from WEKA library with their defaults settings. 3 approaches with each classifier: –PCA with classifier –Parametric LDA with classifier –PCA+LDA with classifier After PCA we took 3 main PCs. We took all the LDs (features extracted by parametric LDA) as it was always equal to #classes – test runs of Monte-Carlo cross validation were made for each data set to evaluate the classification accuracy. In each run, the training set/the test set = 70%/30% by stratified random sampling to keep class distributions approximately same.

21 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen

22 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Ranking of the FE approaches Ranking of the FE approaches according to the results on 21 UCI data sets kNN

23 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Ranking of the FE approaches Ranking of the FE approaches according to the results on 21 UCI data sets Naïve Bayes

24 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Ranking of the FE approaches Ranking of the FE approaches according to the results on 21 UCI data sets C4.5

25 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Accuracy of classifiers, averaged over 21 datasets.

26 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen State transition diagram

27 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Effect of combining PCs with LDs according to state transition diagrams PAR+PCA vs PCA PAR+PCA vs PAR kNN-10+5 NB C

28 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen When combination of PCs and LDs is practically useful.

29 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Conclusions “The curse of dimensionality” is a serious problem in ML/DM/KDD –Classification accuracy decreases and processing time increases dramatically –FE is a common way to cope with this problem. Before applying a learning algorithm the space of instances is transformed into a new space of a lower dimensionality, trying to preserve the distances among instances and class separability. A classical approach (that takes into account class information) here is Fisher’s LDA, –tries to minimize the within class covariance and to maximize the between class covariance in the extracted features. –is well studied, and commonly used, - often provides informative features for classification, but! extracts no more than the #classes -1 features. often fails to provide reasonably good classification accuracy even with fairly simple datasets, where the intrinsic dimensionality exceeds that number. There has been considered a number of ways to solve this problem. –Many approaches suggest non-parametric variations of LDA (rather time-consuming), which lead to greater numbers of extracted features. –Dataset partitioning and local FE

30 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Conclusions (cont.) In this paper we consider an alternative way to improve LDA- based FE for classification, –combining the extracted LDs with a few PCs. Our experiments with the combination of LDs with PCs have shown that the discriminating power of LDA features can be improved by PCs with many datasets and learning algorithms. –The best performance is exhibited with the C4.5: a possible explanation for the good behaviour with C4.5 is that decision trees use implicit feature selection, and thus implicitly select LDs and/or PCs, useful for classification, out of the combined set of features, discarding the less relevant and duplicate ones. Moreover, this feature selection is local.

31 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Thank You! Mykola Pechenizkiy Department of Computer Science and Information Systems, University of Jyväskylä, FINLAND Tel Mobile: Fax: Acknowledgments: ADMKD Reviewers COMAS Graduate School of the University of Jyväskylä Finland Science Foundation Ireland WEKA software library UCI datasets