Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering Seoul National University
(C) 2005, SNU Biointelligence Lab, The Basic Setting Definition 2.1 (Learning problem) finding the unknown (functional) relationship h between objects x and targets y based on a sample z of size m. For a given object x, evaluate the distribution and decide on the class y by Problems of estimating P Z based on the given sample. Cannot predict classes for a new object Need to constrain the set of possible mappings from objects to classes Definition 2.2 (Features and feature space) Definition 2.4 (Linear function and linear classifier) Similar objects are mapped to similar classes via linearity Linear classifier is unaffected by the scale of weight (and hence the weight is assumed to be of unit length.)
(C) 2005, SNU Biointelligence Lab, F is isomorphic to W The task of learning reduces to finding the best classifier in the hypothesis space F Properties for the goodness of a classifier: Dependent on the unknown P Z Make the maximization task computationally easier Pointwise wrt. the object-class pairs (due to the independence of samplings ) Expected risk: Example 2.7 (Cost matrices) In classifying handwritten digits, 0-1 loss function is inappropriate. (There are approximately 10 times more “no pictures of 1” than “pictures of 1”.)
(C) 2005, SNU Biointelligence Lab, Remark 2.8 (Geometrical picture) Linear classifiers, parameterized by weight w, are hyperplanes passing through the origin in feature space K. (Hypothesis space) (Feature space)
(C) 2005, SNU Biointelligence Lab, Learning by Risk Minimization Definition 2.9 (Learning algorithm) a mapping A such that (where X: object space, Y:output space, F: set of feature mappings) - we have no knowledge of the function (or P Z ) to be optimized Definition 2.10 (Generalization error) Definition 2.11 (Empirical Risk) the empirical risk functional over F or training error of f is defined as
(C) 2005, SNU Biointelligence Lab, The (Primal) Perceptron Algorithm When is misclassified by the linear classifier, the update step amounts changing into and thus attracts the hyperplane.
(C) 2005, SNU Biointelligence Lab, Definition 2.12 (Version Space) The set of all classifiers consistent with the training sample. Given the training sample and a hypothesis space, For linear classifiers, the set of consistent weights is called a version space:
(C) 2005, SNU Biointelligence Lab, Regularized Risk Functional Drawbacks of minimizing empirical risk ERM makes the learning task an ill-posed one. (a slight variation of training sample makes large deviation of expected risks, overfitting) Regularization is one way to overcome this problem Introduce a regularizer a-priori To restrict the space of solutions to be compact subsets of the (originally overly large space) F. This can be achieved by requiring the set to be compact for each positive number If we decrease for increasing sample size in the right way, it can be shown that the regularization method leads to as minimize only the empirical risk, minimizes only with regularizer
(C) 2005, SNU Biointelligence Lab, Structural risk minimization (SRM) by Vapnik: Define a structuring of the hypothesis space F into nested subsets of increasing complexity. In each hypothesis space, empirical minimization is preformed SRM returns the classifier with the smallest risk and can be used with complexity values. Maximum-a-posteriori (MAP) estimation: empirical risk as the negative log-probability of the training sample z, for a classifier f. MAP estimate maximized the mode of the posterior densitiy The choice of regularizer is comparable to the choice of prior probability in the Bayesian framework and reflects prior knowledge.
(C) 2005, SNU Biointelligence Lab,