Genetic-Algorithm-Based Instance and Feature Selection Instance Selection and Construction for Data Mining Ch. 6 H. Ishibuchi, T. Nakashima, and M. Nii
Abstract GA based approach for selecting a small number of instances from a given data set in a pattern classification problem. To improve the classification ability of our nearest neighbor classifier by searching for an appropriate reference set.
Genetic Algorithm Coding Fitness function Binary string of the length (n+m) ai : inclusion or exclusion of the i-th feature sp : the inclusion or exclusion of the p-th instance Fitness function Minimize |F|, minimize |P|, and maximize g(S) |F| : number of selected feature |P| : number of selected instance g(S) : classification performance
Genetic Algorithm Performance measure (first one) : gA(S) The number of correctly classified instances Minimize |P| subject to gA(S) = m Performance measure (second one) : gB(S) When an instance xq was included in the reference set, xq was not selected as its own nearest neighbor. fitness
Genetic Algorithm Initialization Genetic Operation: Iterate the following procedure Npop/2 times to generate Npop string Randomly select a pair of strings Apply a uniform crossover Apply a mutation operator Generation Update: Select the Npop best string from 2Npop Termination test
Numerical Example
Biased Mutation For effectively decreasing the number of selected instances is to bias the mutation probability In the biased mutation, a much larger probability is assigned to the mutation from sp = 1 to sp = 0.
Data sets 2 artificial + 4 real Normal distribution with small overlap Normal distribution with large overlap Iris data Appendicitis Data Cancer Data Wine Data
Parameter Specifications Pop Size : 50 Crossover Prob. : 1.0 Mutation Prob. Pm = 0.01 for feature selection Pm(1 0) = 0.1 for instance selection Pm(0 1) = 0.01 for instance selection Stopping condition : 500 gen. Weight values : Wg = 5; WF = 1; WP = 1 Performance measure : gA(S) or gB(S) 30 trials for each data
Performance on Training Data
Performance on Test Data Leaving-one-out procedure (iris & appendicitis) 10-fold cross-validation (cancer & wine)
Effect of Feature Selection
Effect on NN
Some Variants