Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n<<D Benefits: Lower computing complexity Improvement of the classification performance Danger: Possible loss of information
Basic approaches to DR Feature extraction Transform t : R D R n Creation of a new feature space. The features lose their original meaning. Feature selection Selection of a subset of the original features.
Principal Component Transform (Karhunen-Loeve) PCT belongs to “feature extraction”, t is a rotation y = Tx, T is a matrix of the eigenvectors of the original covariance matrix C x PCT creates D new uncorrelated features y, C y = T’ C x T n features with the highest variations are kept
Principal Component Transform
Applications of the PCT “Optimal” data representation, compaction of the energy Visualization and compression of multimodal images
PCT of multispectral images Satellite image: B, G, R, nIR, IR, thermal IR
Why is PCT bad for classification purposes? PCT evaluates the contribution of individual features solely by their variation, which may be different from their discrimination power.
Why is PCT bad for classification purposes?
Separability problem Dimensionality reduction methods for classification purposes (Two-class problem) must consider the discrimination power of individual features. The goal is to maximize the “distance” between the classes.
An Example 3 classes, 3D feature space, reduction to 2D High discriminabilityLow discriminability
DR via feature selection Two things needed: Discriminability measure (Mahalanobis distance, Bhattacharyya distance) MD 12 = (m 1 – m 2 )(C 1 + C 2 ) -1 (m 1 – m 2 )’ Selection strategy Feature selection optimization problem
Feature selection strategies Optimal - full search, complexity D!/(D-n)!n! - branch & bound Sub-optimal - direct selection (optimal if the features are not correlated) - sequential selection (SFS, SBS) - generalized sequential selection (SFS(k), Plus k minus m, floating search)
A priori knowledge in feature selection The above discriminability measures (MD, BD) require normally distributed classes. They are misleading and inapplicable otherwise.
A priori knowledge in feature selection The above discriminability measures (MD, BD) require normally distributed classes. They are misleading and inapplicable otherwise. Crucial questions in practical applications: Can the class-condicional distributions be assumed to be normal? What happens if this assumption is wrong?
A two-class example Class 2 Class 1 x2 is selected x1 is selected
Conclusion PCT is optimal for representation of “one- class” data (visualization, compression, etc). PCT should not be used for classification purposes. Use feature selection methods based on a proper discriminability measure. If you still use PCT before classification, be aware of possible errors.