Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis
Outline Motivation/Feature Selection (FS) Rough set theory Fuzzy-rough feature selection Feature grouping Experimentation
The problem: too much data The amount of data is growing exponentially – Staggering 4300% annual growth in global data Therefore, there is a need for FS and other data reduction methods – Curse of dimensionality: a problem for machine learning techniques The complexity of the problem is vast – (e.g. the powerset of features for FS)
Feature selection Remove features that are: – Noisy – Irrelevant – Misleading Task: find a subset that – Optimises a measure of subset goodness – Has small/minimal cardinality In rough set theory, this is a search for reducts – Much research in this area
Rough set theory (RST) For a subset of features P Upper approximation Set X Lower approximation Equivalence class [x] P
Rough set feature selection By considering more features, concepts become easier to define…
Rough set theory Problems: – Rough set methods (usually) require data discretization beforehand – Extensions require thresholds, e.g. tolerance rough sets – Also no flexibility in approximations E.g. objects either belong fully to the lower (or upper) approximation, or not at all
Fuzzy-rough sets Extends rough set theory – Use of fuzzy tolerance instead of crisp equivalence – Approximations are fuzzified – Collapses to traditional RST when data is crisp New definitions: Fuzzy upper approximation: Fuzzy lower approximation:
Fuzzy-rough feature selection Search for reducts – Minimal subsets of features that preserve the fuzzy lower approximations for all decision concepts Traditional approach – Greedy hill-climbing algorithm used – Other search techniques have been applied (e.g. PSO) Problems – Complexity is problematic for large data (e.g. over several thousand features) – No explicit handling of redundancy
Feature grouping Idea: don’t need to consider all features – Those that are highly correlated with each other carry the same or similar information – Therefore, we can group these, and work on a group by group basis This paper: based on greedy hill-climbing – Group-then-rank approach Relevancy and redundancy handled by – Correlation: similar features grouped together – Internal ranking (correlation with decision feature) F1F1
Forming groups of features Calculate correlations F1F1 F1F1 F2F2 F2F2 F3F3 F3F3 FnFn FnFn... #1 f 3 #2 f 12 #3 f 1 … #m f n #1 f 3 #2 f 12 #3 f 1 … #m f n #1 f #2 f #3 f … #m f n #1 f #2 f #3 f … #m f n #1 f #2 f #3 f … #m f n #1 f #2 f #3 f … #m f n #1 f #2 f #3 f … #m f n #1 f #2 f #3 f … #m f n Feature groups Internally-ranked feature groups Correlation measure Threshold : Redundancy Relevancy Data τ
... Selecting features Feature subset search and selection Search mechanism Subset evaluation Selected subset(s)
Fuzzy-rough feature grouping
Initial experimentation Setup: – 10 datasets ( features) – 3 classifiers – Stratified 5 x 10-fold cross-validation Performance evaluation in terms of – Subset size – Classification accuracy – Execution time FRFG compared with – Traditional greedy hill-climber (GHC) – GA & PSO (200 generations, population size: 40)
Results: average subset size
Results: classification accuracy JRip IBk (k=3)
Results: execution times (s)
Conclusion FRFG – Motivation: reduce computational overhead; improve consideration of redundancy – Group-then-rank approach – Parameter determines granularity of grouping – Weka implementation available: Future work – Automatic determination of parameter τ – Experimentation using much larger data, other FS methods, etc – Clustering of features – Unsupervised selection?
Thank you!
Simple example Dataset of six features After initialisation, the following groups are formed Within each group, rank determines relevance: e.g. f 4 more relevant than f 3 Ordering of groups Greedy hill-climber F1F1 F2F2 F3F3 F4F4 etc… {F 4, F 1, F 3, F 5, F 2, F 6 }F =
Simple example... First group to be considered: F 4 – Feature f 4 is preferable over others – So, add this to current (initially empty) subset R – Evaluate M(R + {f 4 }): If better score than the current best evaluation, store f 4 Current best evaluation = M(R + {f 4 }) – Set of features which appear in F 4 : ({f 1, f 4, f 5 }) Add to the set Avoids Next feature group with elements that do not appear in Avoids: F 1 And so on… F4F4 F1F1