Alan P. Reynolds*, David W. Corne and Michael J. Chantler Feature Selection for Multi-Purpose Predictive Models: a Many-Objective Task Alan P. Reynolds*, David W. Corne and Michael J. Chantler School of Mathematical and Computer Sciences (MACS), Heriot-Watt University, Edinburgh, Scotland *A.Reynolds@hw.ac.uk Introduction Modified dominance Feature subset selection – the elimination of features from a data set to reduce costs and improve the performance of machine learning algorithms – has been treated previously as a multiobjective optimization problem, minimizing complexity while maximizing accuracy (e.g. [1]) or maximizing sensitivity and specificity [2]. We show how attempting to satisfy each potential user of the resulting data or application leads us to consider the problem as having infinitely many objectives. Using standard dominance, if feature set A is to dominate set B, the sensitivity- specificity graph for A must be at least as high as that for B at all points, and higher at at least one point. Our modified dominance relation considers the areas between the two graphs, as shown in Fig. 2. Two objectives for binary classification Binary classification is the art of creating a model, or classifier, that predicts, based on an item’s features, whether the item belongs to a class of interest (positive) or not (negative). A classifier (e.g. a spam filter) is trained on items with predetermined class labels and then used to predict for unlabelled items. Counts of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) are used to create the following two objectives: Sensitivity: the proportion of those items in the ‘positive’ class that are correctly identified, i.e. TP / (TP + FN). Specificity: the proportion of those items in the ‘negative’ class that are correctly identified, i.e. TN / (TN + FP). Fig. 2: Feature subset A dominates feature subset B if there is an orange area but no green area (standard dominance) or if the orange area divided by the green area exceeds a given dominance threshold. A dominance threshold of 1 results in maximization of the dominated area. An infinite threshold produces the standard dominance relation. Varying the threshold between these values allows us to control the strength of the dominance relation. Classification algorithms Results and conclusions Some classification algorithms simply generate a class prediction. Others generate a probability that the item belongs to the class of interest. This is converted into a class prediction by setting a probability threshold. By allowing the threshold to vary between 0 and 1, we obtain a range of classifiers with different trade-offs between sensitivity and specificity. In a sense the algorithm is multiobjective. A variant of NSGA II was applied to feature selection from three data sets. Here we show results from the Ionosphere data set, using Naive Bayes as the core classification algorithm and a dominance factor of 5. Recall that the quality of each solution is represented by a graph, so the non- dominated set gives a set of graphs. For clarity, we present the envelope of these graphs, i.e. only points non-dominated with respect to sensitivity and specificity. Feature subset selection Why reduce the number of features in a data set? Improve algorithm efficiency and speed up the learning process. Produce simpler classifiers that can be more easily comprehended. Reduce the cost of obtaining or generating the data. Prevent over-fitting. In the ‘wrapper’ approach to feature selection, feature subset quality is estimated by applying a simple classification algorithm. Here we wrap a ‘multiobjective’ classification algorithm, e.g. Naïve Bayes. The performance of a single feature set is given by a graph of specificity against sensitivity (see Fig. 1). Fig. 3: Performance on the Ionosphere data set: training data. Infinitely many objectives for feature selection! We wish to perform feature subset selection to produce a reduced data set to be used in multiple applications or in a single application used by multiple users, such as a texture search engine. How do we optimize feature subset quality when different users may require different trade-offs between sensitivity and specificity? Fig. 4: Performance on the Ionosphere data set: test data. In conclusion, we have demonstrated that feature selection may be successfully approached as a problem with an infinite set of objectives. This approach is most useful when the resulting feature set or application is to be used by many users with different sensitivity-specificity trade-off preferences. Fig. 1: The performance characteristics of a single feature subset, evaluated using a ‘multiobjective’ classifier. The objective trade-off preferences of three users are shown. References Answer: we maximize performance for each potential user, or equivalently, we maximize specificity at every value of sensitivity – an uncountable set of objectives! Oliveira, L.S., Sabourin, R., Bortolozzi, F. and Suen, C.Y.: A methodology for feature selection using multi-objective genetic algorithms for handwritten digit string recognition. International Journal of Pattern Recognition and Artificial Intelligence 17(6), 903-929 (2003) Emmanouilidis, C.,: Evolutionary multi-objective feature selection and ROC analysis with application to industrial machinery fault diagnosis. In: Evolutionary Methods for Design, Optimisation and Control (2002) Deb, K., Agrawal, S., Pratab, A., Meyarivan, T.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: PPSN 2000, LNCS 1917, 849-858 (2000) Hughes, E.J.: Evolutionary many-objective optimisation: Many once or one many? In: Proc. 2005 IEEE Congress on Evolutionary Computation (CEC 2005) 1, 222-227 (2005) A bad idea? It has been suggested that dominance based algorithms such as NSGA II [3] perform poorly with more than 4 objectives [4]. Can NSGA II be successfully applied to a problem with an infinite set of objectives? There are reasons for hope: In practice, the graph is piecewise horizontal – the number of unique objectives is bounded by the number of items in the training set that are in the class of interest. A feature set with a good value of specificity at a sensitivity of 0.5 is likely to have a good value of specificity at a sensitivity of 0.48 – objectives are highly correlated. Modifications to the dominance relation can be made to further improve algorithm convergence (see below).