Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling

Abstract Feature selection is an important problem in machine learning. In high dimensional data it is critical to weed out noisy features, which have no discriminatory power, before applying standard learning algorithms. This problem has been studied extensively in the supervised setting. However, there is sparse literature semi- supervised and unsupervised feature selection. Furthermore, while feature selection algorithms pick a discrete subset of features, it is a more challenging and interesting problem to assign a weight to each feature based on discriminatory power. In a sense, feature selection can be thought of as a sub-problem of feature scaling, where the weights are binary. In this project I propose two simple feature scaling algorithms – one for the semi-supervised setting, and one for the unsupervised setting.

Introduction Although feature scaling/selection is reminiscent of dimensionality reduction, it is a fundamentally different problem. While dimensionality reduction methods, such as PCA, focus on the search for features which explain most of the variance in the data, this is clearly not appropriate for feature selection. This is because there is no clear correlation between variance and discriminatory power (See Fig. 1). Also, methods like PCA return a brand new set of directions, rather than a subset or scaling of the original features. Figure 1: The y axis explains most of the variance in the data. Nevertheless, the x axis clearly has more discriminative power about the three classes.

Previous Work Supervised Feature Selection  Filter methods: do not depend on the consequent classifier that is used. Rely purely on the data. Methods include the popular RELIEF (Kira and Rendell) and FOCUS (Almuallim and Dietterich) algorithms.  Wrapper methods: “wrap” around the classifier algorithm that is used. The classifier is used to evaluate particular feature subsets. The popular AdaBoost (Freund and Schapire) method is analogous to such a method if one considers weak classifiers that are decision stubs (work on only one feature).  Many of these methods are used for feature selection, but do so by computing feature scales and using some sort of threshold. Unsupervised Feature Selection  This field has been sparsely studied. Existing approaches include genetic algorithms that have no performance guarantees, and methods that make assumptions about the class distributions (i.e. assume the classes are either Gaussian or multinomial). Semi-supervised Feature Selection  Semi-supervised feature selection remains to be an untouched territory. Semi supervised learning has become increasingly popular in recent years, but surprisingly there is practically no literature about feature selection in this context.

Illustrations x2x2 x1x1 x2x2 x1x1 x2x2 x1x1 Supervised SettingSemi-supervised SettingUnsupervised Setting Figure 2: An illustration of the three different learning settings.

A Simple Algorithm for Feature Scaling A simple greedy algorithm for feature selection starts with an empty set and keeps adding features that improve some sort of criterion function score. The running time of the above algorithm is O(Nn) where n is the number of features, and N is the size of the feature subset. Very expensive computationally, and it’s not clear how this could be extended to feature scaling Instead, score each feature independently

Estimating the discriminative power of a feature Semi-supervised case  Note: For this study, assume k = 2, and that there is an equal number of +1 labeled points and -1 labeled points.  Project data (both labeled and unlabeled) onto feature j. Use k-means to cluster data. Let C 1 and C 2 be the two clusters, C ip be the set of +1 points in cluster i, and C in be the set of -1 points in cluster i. Unsupervised case  Used a stability measure: project data onto feature j. Run k-means N times and record the positions of the centroids. Let c 1,c 2 …c I be vectors of centroid positions for each run.

Testing on a Synthetic Dataset The first evaluation of the algorithms was done on a synthetic data set. The dataset contained two classes, with four valuable features and two features of pure noise. The data set is shown to the right. Relief 0.0086 0.2647 0.5809 0.1458 0.0000 0.0000 Semi-supervised Algorithm 0.2500 0.2500 0.2500 0.2500 0 0 Unsupervised Algorithm 25.4548 139.6496 0.0191 165.1732 0 0 Figure 3: Synthetic data set

Semi-supervised Case: Experimental Results on OCR Data Recognizing characters ‘4’ and ‘9’Recognizing characters ‘7’ and ‘9’ Figure 4: The semi-supervised algorithm was tested on the OCR dataset with the SVM classifier 1. First, the performance of SVM was measured on the original data. Then, the data was passed through RELIEF (as described in [1]), a popular supervised feature selection algorithm, and the performance of SVM was again measured. Lastly, the data was passed through the semi-supervised feature selection algorithm described, and the performance of SVM was measured. For each training set size, 100 trails were done, and the mean error was computed. 1 The SVM lite package, by Joachims et al., was used

Semi-supervised Case: Visualizing Results Semi-supervised Feature Selection Supervised Feature Selection (Relief Algorithm) Two classes: ‘4’ VS ‘9’ Two classes: ‘7’ VS ‘9’ # of training points: 20 100 20 100 Figure 5: A nice property of working with the OCR data set for experimentation, is that each feature corresponds to a pixel. Thus, it is easy to visualize which pixels are important to discriminate between certain digits. The above images are the visualizations of feature weights (dark pixels corresponding to high weighted features).

Unsupervised Case: Experimental Results on OCR Data ‘4’ VS ‘9’‘7’ VS ‘9’‘5’ VS ‘6’ % Error w/ feature scaling 25.193233.112825.9585 % Error w/o feature scaling 46.249140.85718.3823 Figure 6: Unsupervised algorithm results ‘4’ VS ‘9’ ‘7’ VS ‘9’ ‘5’ VS ‘6’

Limitations & Future Work Limitations:  Assumes features are independent.  Unsupervised algorithm seems to worsen classification results in cases when two classes are well separable. Possible extensions  Locate highly correlated features as a pre-processing step  Use some sort of information theoretic approach to evaluating discriminatory power of features

References

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Similar presentations

Presentation on theme: "Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Similar presentations

Presentation on theme: "Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling."— Presentation transcript:

Similar presentations

About project

Feedback