GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume 31, Issue 2, pp , 2006.
Introduction Support vector machines (SVM) were first suggested by Vapnik (1995) for classification. SVM classifies data with different class label by determining a set of support vectors that outline a hyperplane in the feature space. The kernel function transforms train data vector to feature space. SVM used in a range of problems including pattern recognition (Pontil and Verri 1998), bioinformatics (Brown et al. 1999), text categorization (Joachims 1998).
Problems While using SVM, we confront two problems: How to set the best parameters for SVM ! How to choose the input attributes for SVM !
Feature Selection Feature selection is used to identify a powerfully predictive subset of fields within the database and to reduce the number of fields presented to the mining process. Affects several aspects of pattern classification: 1.The accuracy of classification algorithm learned 2.The time needed for learning a classification function 3.The number of examples needed for learning 4.The cost associated with feature
SVM Parameters Setting Proper parameters setting can improve the classification accuracy of SVM. The parameters that should be optimized include penalty parameter C and the parameters with different kernel function. Grid Algorithm is an alternative to find the best C and the gamma parameter, however it is time consuming and does not perform well.
Research Purposes This research objective is to optimize the parameters and the feature subset simultaneously, without degrading the classification accuracy of SVM. Genetic Algorithms (GA) have the potential to be used to generate both the feature subset and the SVM parameters at the same time.
An Overview of This Paper
Support Vector Machine (SVM) Support vector machine (SVM) is a new technique for data classification were first suggested by Vapnik in SVM is using Separating Hyperplane to distinguish the data of two or several different Class that deal with the data mining problem of classification.
Separating Hyperplane
Slack Variable
Penalty Parameter Slack variable which accounts for the cost of overlapping error Consequently objective function must be revised by using penalty parameter C, as follows
Non-Linear Classifier
Polynomial: RBF: Sigmoidal: Kernel Function
Genetic Algorithm Genetic algorithms (GA), a general adaptive optimization search methodology based on a direct analogy to Darwinian natural selection.
Wrapper Model of Feature Selection
Chromosomes Design : represents the value of parameter C : represents the value of parameter γ : represents selected features
Genotype to Phenotype The bit strings for parameter C and γ are genotype that should be transformed into phenotype Value: the phenotype value P min : the minimum value of parameter (user define) P max : the maximum value of parameter (user define) D: the decimal value of bit string L: the length of bit string
Fitness Function Design W A : SVM classification accuracy weight SVM_accuracy: SVM classification accuracy W F : weight of the features C i : cost of feature i F i : “1” represents that feature i is selected; “0” represents that feature i is not selected
System Flows for GA-based SVM (1) Data preprocess: scaling (2) Converting genotype to phenotype (3) Feature subset (4) Fitness evaluation (5) Termination criteria (6) Genetic operation
Figure of System Flows
Experimental Dataset No.Names#Classes#Instances Nominal features Numeric features Total features 1 German (credit card) Australian (credit card) Pima-Indian diabetes Heart disease (Statlog Project) Breast cancer(Wisconsin) Contraceptive Method Choice Ionosphere Iris Sonar Statlog project : vehicle Vowel
Experiments Description To guarantee that the present results are valid and can be generalized for making predictions regarding new data Using k-fold-cross-validation This study used k = 10, meaning that all of the data will be divided into ten parts, each of which will take turns at being the testing data set.
Accuracy Calculation Accuracy using the binary target datasets can be demonstrated by the positive hit rate (sensitivity), the negative hit rate (specificity), and the overall hit rate. For the multiple class datasets, the accuracy is demonstrated only by the average hit rate.
Accuracy Calculation Sensitivity is the proportion of cases with positive class that are classified as positive: P(T+|D+) = TP / (TP+FN). Specificity is the proportion of cases with the negative class: P(T-|D-) = TN / (TN + FP). Overall hit rate is the overall accuracy which is calculated by (TP+TN) / (TN+FP+FN+FP). Target (or Disease) +- Predicted (or Test) +True Positive(TP)False Positive(FP) -False Negative(FN)True Negative(TN)
Accuracy Calculation The SVM_accuracy of the fitness in function is measured by Sensitivity*Specificity for the datasets with two classes (positive or negative). Overall hit rate for the datasets with multiple classes.
GA Parameter Setting Chromosome Represented by using Binary Code Population Size 500 Crossover Rate 0.7,One Point Crossover Mutation Rate 0.02 Roulette Wheel Selection Elitism Replacement
W A and W F Weight W A and W F can influence the experiment result according to the fitness function The higher W A is; the higher classification accuracy is. The higher W F is; the smaller the number of features is.
Folder #4 of German Dataset Curve Diagram
Experimental Results for German Dataset
Results summary (GA-based approach vs. Grid search ) GA-based approachGrid algorithm p-value for Wilcoxon Testing NamesNumber of Original features Number of Selected features Average Positive Hit Rate Average Negative Hit Rate Average Overall Hit Rate% Average Positive Hit Rate Average Negative Hit Rate Average Overall Hit Rate% German2413± ± ± * Australian143± ± ± * diabetes83.7± ± ± Heart disease135.4± ± ± * breast cancer101± ± ± Contraceptive95.4±0.53 N/A 71.22±4.15N/A 53.53± * ionosphere346± ± ± * iris41±0 N/A 100±0N/A 97.37± * sonar6015± ± ± * vehicle189.2±1.4 N/A 84.06±3.54N/A 83.33± Vowel137.8±1 N/A 99.3±0.82N/A 95.95± *
ROC curve for fold #4 of German Credit Dataset
Average AUC for Datasets GA-based approachGrid algorithm German Australian diabetes Heart disease breast cancer Contraceptive ionosphere iris sonar vehicle Vowel
Conclusion We proposed a GA-based strategy to select features subset and to set the parameters for SVM classification. We have conducted two experiments to evaluate the classification accuracy of the proposed GA- based approach with RBF kernel and the grid search method on 11 real-world datasets from UCI database. Generally, compared with the grid search approach, the proposed GA-based approach has good accuracy performance with fewer features.
Thank You Q & A