Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz, Elidan, Abbeel and Koller, JMLR 2008
Outline Introduction Standard SVM Max-Margin Formulation for Missing Features Three Algorithms Experimental Results Conclusions
Introduction (1) Pattern of missing features: due to measurement noise or corruption: existing but unknown due to the inherent properties of the instances: non-existing Example 1: Two subpopulation of instances (animals and buildings) with few overlapping features (body parts, architectural aspects ); Example 2: In a web-page task, one useful feature of a given page may be the most common topic of other sites that point to it, however, this particular page may have no such parents.
Introduction (2) Common methods for handling missing features: (Assume the features exist but their values are unknown) Single imputation: zeros, mean, kNN imputation by building probabilistic generative models Proposed method (Assume the features are structurally absent) : Each data instance resides in a lower dimensional subspace of the feature space, determined by its own existing features. We try to maximize the worst-case margin of the separating hyperplane, while measuring the margin of each data instance in its own lower- dimensional subspace.
Standard SVM (1) Binary classification real-valued predictors binary response A classifier could be defined as based on a linear function Parameters
Standard SVM (2) Functional margin for each instance Geometric margin for each instance Geometric margin of a hyper plane SVM:by fixing the functional margin to 1, i.e., ’s: slack variables C: cost Quadratic Programming (QP)
Max-Margin Formulation for Missing Features (1) A 2-D case with missing data margin in the subspace margin in the full feature space Margin of instances with missing features is underestimated.
Max-Margin Formulation for Missing Features (2) Instance margin is non-convex in w is instance dependent and thus cannot be taken out of the minimization It is difficult to solve this optimization problem directly. Optimization problem
Three Algorithms (1) A convex formulation for linearly separable case Introduce a lower bound for For a given, this is a second order cone program (SOCP), which is convex and can be solved efficiently. To find the optimal, do a bisection search over. Unfortunately, extending it to the non-separable case is difficult.
Three Algorithms (2) Average norm: a convex approximation for non-separable case define Get rid of the instance dependence non-separable case
Three Algorithms (3) Geometric margin: an exact non-convex approach for non- separable case define non-separable case QP for a given set of ’s
Three Algorithms (4) Pseudo-code Geometric margin: the exact non-convex approach for non- separable case The convergence is not always guaranteed. Cross validation is used to choose an early stopping point.
Experimental Results (1) Zero. Missing values were set to zero. Mean. Missing values were set to the average value of the feature over all data. Flag. Additional features (“flags”) were added, explicitly denoting whether a feature is missing for a given instance. kNN. Missing features were set with the mean value obtained from the K nearest neighbors instances. EM. A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the filled data and (2) re-filling missing values using cluster means, weighted by the posterior probability that a cluster generated the sample. Averaged norm (avg |w|). Proposed approximate convex approach. Geometric margin (geom). Proposed exact non-convex approach.
Experimental Results (2) UCI data sets (missing at random) Remove 90% of the features of each sample randomly Remove a patch covered 25% of pixels with location of the patch uniformly sampled. Digits 5 & 6 from MNIST
Experimental Results (3) Visual object recognition Task: to determine an automobile is present in a given image or not. Local edge information Generative model Likelihood of patches to match each of 19 landmarks Set a threshold (Up to 10) Candidate patches (21-by-21 pixels) for landmarks PCA First 10 principal components for each patch concatenate A feature vector (up to 1900 features) If the number of candidates for a given landmark is less than ten, we consider the rest to be structurally absent
Experimental Results (4) An example image: the best 5 candidates matched to the front windshield landmark
Experimental Results (5)
Experimental Results (6) Metabolic pathway reconstruction A fragment of the full metabolic pathway network Arrows: chemical reactions Purple boxed names: enzymes
Experimental Results (7) Three types of neighborhood relations between enzyme pairs: Linear chains (ARO7, PHA2) Forks (TRP2, ARO7): same input, different outputs Funnels (ARO9, PHA2): same output, different inputs One feature vector (represents an enzyme) Features for linear chain neighbor Features for fork neighbor Features for funnel neighbor A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors, e.g., PHA2 does not have a neighbor of type fork.
Experimental Results (8) Task: to identify if a candidate enzyme is in the right “neighborhood”. Data creation: Positive samples: from the reactions with known enzymes (in the right “neighborhood”); Negative samples: for each positive sample, replace the true enzyme with a random impostor, and calculate the features in such a wrong “neighborhood”. The impostor was uniformly chosen from the set of other enzymes.
Experimental Results (9)
Conclusions 1.The authors presented a modified SVM model for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain. 2.The authors directly classified instances by skipping the non- existing features, rather than filling them with hypothetical values. 3.The proposed model was competitive with a range of single imputation approaches when tested in missing-at-random (MAR) settings. 4.One variant (geometric margin) significantly outperformed other methods in two real problems with non-existing features.