Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and Horvitz
Outline Introduction SVM with Asymmetric Cost SVM Regularization Path (Hastie et al., 2005) Path with Cost Asymmetry Results Conclusions
Introduction (1) Binary classification A classifier could be defined as based on a linear decision function real-valued predictors binary response Parameters
Introduction (2) Two types of misclassification: false negative: cost false positive: cost Expected cost: In terms of 0-1 loss function Real loss function but Non-convex Non-differentiable
Introduction (3) Convex loss functions – surrogates for the 0-1 loss function (for training purpose)
Introduction (4) Empirical cost given n labeled data points Objective function regularization asymmetry Motivation: efficiently look at many training asymmetries even if the testing asymmetry is given. Since convex surrogates of the 0-1 loss function are used for training, the cost asymmetries for training and testing are mismatched.
SVM with Asymmetric Cost (1) hinge loss where SVM with asymmetric cost
SVM with Asymmetric Cost (2) The Lagrangian with dual variables Karush-Kuhn-Tucker (KKT) conditions
SVM with Asymmetric Cost (3) The dual problem where A quadratic optimization problem given a cost structure Computation will be intractable for the whole space Following the SVM regularization path algorithm (Hastie et al., 2005), the authors deal with (1)-(3) and KKT conditions instead of the dual problem.
SVM Regularization Path (1) SVM regularization path The cost is symmetric and thus searching is along the axis. Define active sets of data points: Margin: Left of margin: Right of margin: KKT conditions
SVM Regularization Path (2) Initialization ( ) Consider sufficiently large (C is very small), all the points are in L Remain Decrease One or more positive and negative examples hit the margin simultaneously with
SVM Regularization Path (3) Define The critical condition for first two points hitting the margin Initialization ( ) For, this initial condition keeps the same except the definition of.
SVM Regularization Path (4) The path: decrease, changes only for except that one of the following events happens A point from L or R has entered M; A point in M has left the set to join either R or L consider only the points on the margin where is some function of, Therefore, the for points on the margin proceed linearly in ; the function changes in a piecewise-inverse manner in
SVM Regularization Path (4) The path: decrease, changes only for except that one of the following events happens A point from L or R has entered M; A point in M has left the set to join either R or L consider only the points on the margin where is some function of, Therefore, the for points on the margin proceed linearly in ; the function changes in a piecewise-inverse manner in.
SVM Regularization Path (5) Update regularization Update active sets and solutions Stopping condition In the separable case, we terminate when L become empty; In the non-separable case, we terminate when for all the possible events
Path with Cost Asymmetry (1) Exploration in the 2-d space Path initialization: start at situations when all points are in L Follow the updating procedure in the 1-d case along the line Regularization is changing and the cost asymmetry is fixed. Among all the classifiers, find the best one, given user’s cost function Paths starting from
Path with Cost Asymmetry (2) Produce ROC Collecting R lines in the direction of, we can build three ROC curves
Results (1) For 1000 testing asymmetries, three methods are compared: “one” – take as training cost asymmetry; “int” – vary the intercept of “one” and build an ROC, then select the optimal classifier; “all” – select the optimal classifier from the ROC obtained by varying both the training asymmetry and the intercept. Use a nested cross-validation: The outer cross-validation: produce overall accuracy estimates for the classifier; The inner cross-validation: select optimal classifier parameters (training asymmetry and/or intercept).
Results (2)
Conclusions An efficient algorithm is presented to build ROC curves by varying the training cost asymmetries for SVMs. The main contribution is generalizing the SVM regularization path (Hastie et al., 2005) from a 1-d axis to a 2-d plane. Because of the usage of a convex surrogate, using the testing asymmetry for training leads to non-optimal classifier. Results show advantages of considering more training asymmetries.