Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah Supervisor: Dr. Sid Ray
Outline Introduction Pattern Recognition Feature subset selection Current methods Proposed method IFS Results Conclusion
Introduction: Pattern Recognition The classification of objects into groups by learning from a small sample of objects Apples and strawberries: Classes: apples and strawberries Features: colour, size, weight, texture Applications: Character recognition Voice recognition Oil mining Weather prediction …
Introduction: Pattern Recognition Pattern representation Measuring and recording features Size, colour, weight, texture…. Feature set reduction Reducing the number of features used Selecting a subset Transformations Classification The resulting features are used for classification of unknown objects
Introduction: Feature subset selection Can be split into two processes: Feature subset searching Not usually feasible to exhaustively try all feature subset combinations Criterion function Main issue of feature subset selection (Jain et al. 2000) Focus of our research
Current methods Euclidean distance Statistical properties of the classes are not considered Mahalanobis distance Variances and co-variances of the classes are taken into account
Limitations of Current Methods
Friedman and Rafsky’s two sample test Minimum spanning tree approach for determining whether two sets of data originate from the same source A MST is built across the data from two sources, edges which connect samples of different data sets are removed If many edges are removed, then the two sets of data are likely to originate from the same source
Friedman and Rafsky’s two sample test Method can be used as a criterion function MST built across the sample points Edges which connect samples of different classes are removed A good subset is one that provides discriminatory information about the classes, therefore the fewer edges removed the better
Limitations of Friedman and Rafsky’s technique
Our Proposed Method Use the number of edges and edge lengths in determining the suitability of a subset A good subset will have a large number of short edges connecting samples of the same class And a small number of long edges connecting samples of different classes
Our Proposed Method We experimented with using average edge length and weighted average - weighted average was expected to perform better
IFS - Interactive Feature Selector Developed to allow users to experiment with various feature selection methods Automates the execution of experiments Allows visualisation of data sets, and results Extensible, developers can add criterion functions, feature selectors and classifiers easily into the system
IFS - Screenshot
Experimental Framework Data setNo. SamplesNo. FeatsNo. Classes Iris15043 Crab20072 Forensic Glass21497 Diabetes33282 Character Synthetic75075
Experimental Framework Spearman’s rank correlation A good criterion function will have good correlation with the classifier, subsets which are ranked high should achieve high accuracy levels Subset chosen Final subsets selected by criterion functions are compared to the optimal subset chosen by the classifier Time
Forensic glass data set results
Synthetic data set
Algorithm completion times
Algorithm complexities K-NN MST criterion functions Mahalanobis distance Euclidean distance
Conclusion MST based approaches generally achieved higher accuracy values and rank correlation - in particular with the K-NN classifier Criterion function based on Friedman and Rafsky’s two sample test performed the best
Conclusion MST approaches are closely related with the KNN classifier Mahalanobis criterion function suited to data sets with Gaussian distributions and strong feature interdependence Future work: Construct a classifier based on KNN, which gives closer neighbours higher priority Improve IFS