FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR J.M. Sotoca (Pattern Recognition in Information System, PRIS, 2003)
Feature selection process with validation Evaluation Validation Subset of Original set of features features Goodness of the subset S E L C T D S U B E T Stopping criterion Selection: Process of search a subset of features to evaluate. Evaluation: Evaluate the goodness of the subset under examination. Stopping criterion: Is the best subset? ->Criterion Can have a threshold with a fixed number of features to select. To maximise the accuracy classification or a posteriori probability for a determinate rule of classifier. Validation: Classification the training Set of the subset of features chosen with a test Set. NO: Search a new subset of features. no yes
Filter and wrapper methods Set of input variables Subset selection algorithm Learning algorithm Filter approach Wrapper approach Set of input variables Subset evaluation Learning algorithm Subset selection algorithm Two groups of feature selection methods. Filtering methods: The subset selected is independent of the learning method that will use the selected features. We obtain a ranking of relevance in the features set. We have a choice of a subset with the better features. Wrapper methods: In this case, using the evaluation function based on the same learning algorithm that will be used for learning on domain represented with the selected features.
Validation: weighting-selection In the machine learning literature can distinguish two degrees of feature relevance: Strongly relevant: Removing some strongly relevant feature means to add ambiguity and generally produces a decrease in the classifier performance. Weakly relevant: The effect of eliminating a weakly relevant feature depends on which other attributes are removed. Otherwise, it can be considered as irrelevant feature. Validation: The goodness of this weights can be showed through the NN rule using weighted distances. This is validation criterion of the order and the quality between features. Feature weighting: We obtain a set of weights with the degree of relevance of the features. Feature selection: Pick up only the features more relevant, by using a binary weights vector (that is, assigning a value of 1 to relevant features and 0 to irrelevant features. Filter methods: When use filter methods we obtain a ranking of relevance. The data set has 40 continuous features when 19 are irrelevant. So, in the figure sort the features and progressively discard the feature with the lowest weight value following a scheme SBS (Sequential Backward Selection). The dotted line, represent the accuracy when assign a weight value of 1 though the features selected and 0 to the features removed (feature selection). The solid line, represents the accuracy when assign the weight value obtained by ReliefF to the features pick up and 0 to the features discarded (feature weighting in the features not eliminated). So, when the irrelevant and some weakly features are eliminated improve the classification accuracy. However, if remove a strongly relevant decrease the classifier performance.
Comparative of feature weighting methods Nearest hit: Search for each instance x, the nearest neighbour with the same class. Nearest miss: Search for each instance x, the nearest neighbour with different class. ReliefF Algorithm (Kononenko, 1994): This algorithm calculates for each feature and m instances randomly of TS, the difference between nearest miss and nearest hit. ReliefF is a extension for multi-class data sets. Class Weighted-L2 (CW_L2) (Paredes and Vidal, 2000): This method obtains a set of weights (one weight per attribute and class) by means of gradient-descent minimisation of an appropriate criterion function based in the division of nearest hit with nearest miss. Cambiar esta transparencia: Relief_F: Choose instances in the training set and for each instance in it finds its near neighbour of the same class (near hit) and its near neighbour of the different class (near miss). A feature is more relevant if it distinguishes between an instance and its near miss, and less relevant if it distinguishes between and instance and its near hit. GLS (Generalized Least Squares): We use the Generalized Least Squares to minimise a function criterion, with the end to obtain a set of weight that collect the order of relevance of the features set. We have made a comparative between methods based of distance to observed their behaviours. This method have the similar prestence of data sets with irrelevant features and obtain better results than Relief when all atributes have similar relevance.
The Generalized Least Squares(GLS) Initialisation: wi = 1.0; n= (d x K) + 2 is the number of observations for each instance x. Qll is a matrix equal to identity matrix assuming isotropic error in the observations . In each iteration t, do: Calculate the matrices A, B, Qww= BQllBT and the vector of residual functions W. Calculate the new weights wt: Until the residual or leaving-one-out error rate is minimum;
A class-intensity-based model Class Intensity: Sum of the influences of each neighbour pk with class label c(pk) over a instance x of the Training Set (TS). This influence is inverse of the squared distance D as: w: Weights vector or parameters of the model. : Observations vector in the TS. It is formed by the set of differences d x K to take part in the neighbourhood, where K is the number of neighbours and d is the number of dimensions. The charge class C is defined as follow:
A class-intensity-based model The squared criterion distance D can be expressed as follows: where max(xi) and min(xi) are the maximum and minimum of the feature i.
Feature Weight Estimation For each instance x TS, the criterion function to minimise is: where Ex1(w,) is the class intensity in the actual iteration and Ex2(wa,) is when all neighbours have the same class label. wa are the weights vector obtained by the model in the previous iteration. The parameters model w = {w1,...,wd} in the d-dimensional feature space, collect the relevance of the features.
Feature Weight Estimation The observations vector is the set of all ki, k= 1,...,K, i=1,...,d. Also, we add Ex1 and Ex2 in ours observations over the instance x. The vector of residual functions is defined as follows:
Descriptions of data sets The main characteristics are summarised in the table (the number of irrelevant features are given in brackets). Six artificial databases (Led+17, Monk 1-3, Waveform and Waveform+40) have been chosen to evaluate performance under controller conditions. Features Classes Instances Led+17 24 (17) 10 2000 Waveform 21 3 5000 Waveform+40 40 (19) Monk1 6 (3) 2 556 Monk2 6 601 Monk3 494 Diabetes 8 768 Glass 9 214 Heart 13 270 Vowel 528 Vehicle 18 4 848 Wine 178
Empirical Results Validation with the k-NN classifier rule. We call (wi = 1.0) in the case of non-weighted k-NN classification. The first five columns correspond to the results when using the 1-NN rule, while the last columns are those from the best k-NN classifiers (1 k 21).
Learning capability Effect of TS size in the Led+17 database Effect of TS size in the Monk2 database Validation: Learning Ability Study the effect of using different Training Sets on the classification accuracy of the feature weighting methods. Right: Binary data set is showed, with 24 features where 17 features are irrelevant. Both algorithms (ReliefF and GLS) find the 7 relevant features and only need around of 100 prototypes to reach the optimal classification. Left: Categorical data set is showed with 6 features where all are relevant, although exit small differences of relevant between features. The experiment with this a other data sets suggest that, when all attributes are relevant, GLS performs faster than ReliefF.
Concluding remarks A new feature weighting method has been introduced. It basically consist to minimisation a criterion function through generalised least squared (GLS). The behaviour of the GLS algorithm proposed here is similar to that of the well-known ReliefF approach. Studying the learning rate of ReliefF and GLS models, both obtain goods results in presence of irrelevant attributes, while GLS is able to obtain better results when all attributes are relevants.
Further works Movement of the set of observed data . Detection of outliers. Simultaneous fit of multiple models. Feature selection by class.