Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07
Suppose you want to… use learning to solve some classification problem. E.g., given a set of images, learn a rule to distinguish men from women. The first thing you need to do is decide what you want as features. Or, for algs like SVM and Perceptron, can use a kernel function, which provides an implicit feature space. But then what kernel to use? Can Theory provide any help or guidance?
Plan for this talk Discuss a few ways theory might be of help: Algorithms designed to do well in large feature spaces when only a small number of features are actually useful. So you can pile a lot on when you don’t know much about the domain. Kernel functions. Standard theoretical view, plus new one that may provide more guidance. Bridge between “implicit mapping” and “similarity function” views. Talk about quality of a kernel in terms of more tangible properties. [work with Nina Balcan] Combining the above. Using kernels to generate explicit features.
A classic conceptual question How is it possible to learn anything quickly when there is so much irrelevant information around? Must there be some hard-coded focusing mechanism, or can learning handle it?
A classic conceptual question Let’s try a very simple theoretical model. Have n boolean features. Labels are + or Assume distinction is based on just one feature. How many prediction mistakes do you need to make before you’ve figured out which one it is? Can take majority vote over all possibilities consistent with data so far. Each mistake crosses off at least half. O(log n) mistakes total. log(n) is good: doubling n only adds 1 more mistake. Can’t do better (consider log(n) random strings with random labels. Whp there is a consistent feature in hindsight).
A classic conceptual question What about more interesting classes of functions (not just target a single feature)?
Littlestone’s Winnow algorithm [MLJ 1988] Motivated by the question: what if target is an OR of r n features? Majority vote scheme over all n r possibilities would make O(r log n) mistakes but totally impractical. Can you do this efficiently? Winnow is simple efficient algorithm that meets this bound. More generally, if exists LTF such that positives satisfy w 1 x 1 +w 2 x 2 +…+w n x n c, negatives satisfy w 1 x 1 +w 2 x 2 +…+w n x n c - , (W= i |w i |) Then # mistakes = O((W/ ) 2 log n). E.g., if target is “k of r” function, get O(r 2 log n). Key point: still only log dependence on n x 4 x 7 x 10
Littlestone’s Winnow algorithm [MLJ 1988] How does it work? Balanced version: Maintain weight vectors w + and w -. Initialize all weights to 1. Classify based on whether w + x or w - x is larger. (Have x 0 0) If make mistake on positive x, for each x i =1, w i + = (1+ )w i +, w i - = (1- )w i -. And vice-versa for mistake on negative x. Other properties: Can show this approximates maxent constraints. In other direction, [Ng’04] shows that maxent with L 1 regularization gets Winnow-like bounds w+w-w+w-
Practical issues On batch problem, may want to cycle through data, each time with smaller . Can also do margin version: update if just barely correct. If want to output a likelihood, natural is e w + x /[e w + x + e w - x ]. Can extend to multiclass too. William & Vitor have paper with some other nice practical adjustments.
Winnow versus Perceptron/SVM Winnow is similar at high level to Perceptron updates. What’s the difference? Suppose data is linearly separable by w x = 0 with |w x| . For Perceptron, mistakes/samples bounded by O((L 2 (w)L 2 (x)/ ) 2 ) For Winnow, mistakes/samples bounded by O((L 1 (w)L (x)/ ) 2 (log n)) For boolean features, L (x)=1. L 2 (x) can be sqrt(n). If target is sparse, examples dense, Winnow is better. -E.g., x random in {0,1} n, f(x)=x 1. Perceptron: O(n) mistakes. If target is dense (most features are relevant) and examples are sparse, then Perceptron wins
OK, now on to kernels…
Generic problem Given a set of images:, want to learn a linear separator to distinguish men from women. Problem: pixel representation no good. One approach: Pick a better set of features! But seems ad-hoc. Instead: Use a Kernel! K (, ) = ( ) ¢ ( ). is implicit, high-dimensional mapping. Perceptron/SVM only interact with data through dot- products, so can be “kernelized”. If data is separable in -space by large L 2 margin, don’t have to pay for it.
E.g., the kernel K(x,y) = (1+x ¢ y) d for the case of n=2, d=2, corresponds to the implicit mapping: x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z2z2 z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X Kernels
Kernels Perceptron/SVM only interact with data through dot- products, so can be “kernelized”. If data is separable in -space by large L 2 margin, don’t have to pay for it. E.g., K(x,y) = (1 + x y) d :(n-diml space) ! (n d -diml space). E.g., K(x,y) = e -(x-y) 2 Conceptual warning: You’re not really “getting all the power of the high dimensional space without paying for it”. The margin matters. E.g., K(x,y)=1 if x=y, K(x,y)=0 otherwise. Corresponds to mapping where every example gets its own coordinate. Everything is linearly separable but no generalization.
Question: do we need the notion of an implicit space to understand what makes a kernel helpful for learning?
Focus on batch setting Assume examples drawn from some probability distribution: Distribution D over x, labeled by target function c. Or distribution P over (x, l ) Will call P (or (c,D)) our “learning problem”. Given labeled training data, want algorithm to do well on new data.
On the one hand, operationally a kernel is just a similarity function: K(x,y) 2 [-1,1], with some extra requirements. [here I’m scaling to | (x)| = 1] And in practice, people think of a good kernel as a good measure of similarity between data points for the task at hand. But Theory talks about margins in implicit high- dimensional -space. K(x,y) = (x) ¢ (y). xyxy Something funny about theory of kernels
I want to use ML to classify protein structures and I’m trying to decide on a similarity fn to use. Any help? It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.
Umm… thanks, I guess. It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.
Something funny about theory of kernels Theory talks about margins in implicit high- dimensional -space. K(x,y) = (x) ¢ (y). Not great for intuition (do I expect this kernel or that one to work better for me) Can we connect better with idea of a good kernel being one that is a good notion of similarity for the problem at hand? -Motivation [BBV]: If margin in -space, then can pick Õ(1/ 2 ) random examples y 1,…,y n (“landmarks”), and do mapping x [K(x,y 1 ),…,K(x,y n )]. Whp data in this space will be apx linearly separable.
Goal: notion of “ good similarity function ” that… 1.Talks in terms of more intuitive properties (no implicit high-diml spaces, no requirement of positive-semidefiniteness, etc) 2.If K satisfies these properties for our given problem, then has implications to learning 3.Is broad: includes usual notion of “good kernel” (one that induces a large margin separator in -space). If so, then this can help with designing the K. [Recent work with Nina, with extensions by Nati Srebro]
Proposal satisfying (1) and (2): Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1] is ( , )-good for P if at least a 1- fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y) l (x)]+ Q: how could you use this to learn?
How to use it At least a 1- prob mass of x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y) l (x)]+ Draw S + of O(( 2 )ln 2 ) positive examples. Draw S - of O(( 2 )ln 2 ) negative examples. Classify x based on which gives better score. Hoeffding: for any given “good x”, prob of error over draw of S ,S at most . So, at most chance our draw is bad on more than fraction of “good x”. With prob ¸ 1- , error rate · + .
But not broad enough K(x,y)=x ¢ y has good separator but doesn’t satisfy defn. (half of positives are more similar to negs that to typical pos) ++ _ 30 o
But not broad enough Idea: would work if we didn’t pick y’s from top-left. Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. (even if don’t know R in advance) ++ _ 30 o
Broader defn… Say K:(x,y) ! [-1,1] is an ( , )-good similarity function for P if exists a weighting function w(y) 2 [0,1] s.t. at least 1- frac. of x satisfy: Can still use for learning: Draw S + = {y 1,…,y n }, S - = {z 1,…,z n }. n=Õ(1/ 2 ) Use to “triangulate” data: x [K(x,y 1 ), …,K(x,y n ), K(x,z 1 ),…,K(x,z n )]. w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] Whp, exists good separator in this space: w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] E y~D [w(y)K(x,y)| l (y)= l (x)] ¸ E y~D [w(y)K(x,y)| l (y) l (x)]+
Broader defn… Say K:(x,y) ! [-1,1] is an ( , )-good similarity function for P if exists a weighting function w(y) 2 [0,1] s.t. at least 1- frac. of x satisfy: E y~D [w(y)K(x,y)| l (y)= l (x)] ¸ E y~D [w(y)K(x,y)| l (y) l (x)]+ w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] Whp, exists good separator in this space: w = [w(y 1 ),…,w(y n ),-w(z 1 ),…,-w(z n )] *Technically bounds are better if adjust definition to penalize examples more that fail the inequality badly… So, take new set of examples, project to this space, and run your favorite linear separator learning algorithm.* So, take new set of examples, project to this space, and run your favorite linear separator learning algorithm.*
Algorithm Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/ 2 ) ln(1/ 2 )). Think of these as “landmarks”. Use to “triangulate” data: X [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Guarantee: with prob. ¸ 1- , exists linear separator of error · + at margin /4. Broader defn… Actually, margin is good in both L 1 and L 2 senses. This particular approach requires wasting examples for use as the “landmarks”. But could use unlabeled data for this part.
Interesting property of definition An ( , )-good kernel [at least 1- fraction of x have margin ¸ ] is an ( ’, ’)-good sim fn under this definition. But our current proofs suffer a penalty: ’ = + extra, ’ = 3 extra. So, at qualitative level, can have theory of similarity functions that doesn’t require implicit spaces. Nati Srebro has improved to 2, which is tight, + extended to hinge-loss.
Approach we’re investigating With Nina & Mugizi: Take a problem where original features already pretty good, plus you have a couple reasonable similarity functions K 1, K 2,… Take some unlabeled data as landmarks, use to enlarge feature space K 1 (x,y 1 ), K 2 (x,y 1 ), K 1 (x,y 2 ),… Run Winnow on the result. Can prove guarantees if some convex combination of the K i is good.
Open questions This view gives some sufficient conditions for a similarity function to be useful for learning but doesn’t have direct implications to direct use in SVM, say. Can one define other interesting, reasonably intuitive, sufficient conditions for a similarity function to be useful for learning?