Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala
Generic problem Given a set of images:, want to learn a linear separator to distinguish men from women. Problem: pixel representation no good. Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific. New style advice: Use a Kernel! K (, ) = ( ) ¢ ( ). is implicit, high-dimensional mapping. Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.
Generic problem Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific. New style advice: Use a Kernel! K (, ) = ( ) ¢ ( ). is implicit, high-dimensional mapping. Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator. E.g., K(x,y) = (x ¢ y + 1) m. :(n-diml space) ! (n m -diml space).
Claim: Can view new method as way of conducting old method. Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. “You give me a kernel, I give you a set of features” Do this using idea of random projection…
Claim: Can view new method as way of conducting old method. Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. E.g., sample z 1,...,z d from D. Given x, define x i = K(x,z i ). Implications: Practical: alternative to kernelizing the algorithm. Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.
Basic setup, definitions Instance space X. X Distribution D, target c. Use P = (D,c). K(x,y) = (x) ¢ (y). P is separable with margin in -space if 9 w s.t. Pr (x, l ) 2 P [ l (w ¢ (x)/ | (x)| ) < (|w|=1) P=(D,c) + - w Error at margin : replace “0” with “ ”. Goal is to use K to get mapping to low-dim’l space.
One idea: Johnson-Lindenstrauss lemma If P separable with margin in -space, then with prob 1- , a random linear projection down to space of dimension d = O((1/ 2 )log[1/( )]) will have a linear separator of error < . [Arriaga Vempala] XP=(D,c) + - If vectors are r 1,r 2,...,r d, then can view as features x i = (x) ¢ r i. Problem: uses . Can we do directly, using K as black- box, without computing ?
3 methods (from simplest to best) 1.Draw d examples z 1,...,z d from D. Use: F(x) = (K(x,z 1 ),..., K(x,z d )). [So, “x i ” = K(x,z i )] For d = (8/ )[1/ 2 + ln 1/ ], if P was separable with margin in -space, then whp this will be separable with error . (but this method doesn’t preserve margin). 2.Same d, but a little more complicated. Separable with error at margin /2. 3.Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/ , rather than linear. So, can set ¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.
Key fact Claim : If 9 perfect w of margin in -space, then if draw z 1,...,z d 2 D for d ¸ (8/ )[1/ 2 + ln 1/ ], whp (1- ) exists w’ in span( (z 1 ),..., (z d )) of error · at margin /2. Proof: Let S = examples drawn so far. Assume |w|=1, | (z)|=1 8 z. w in = proj(w,span(S)), w out = w – w in. Say w out is large if Pr z ( |w out ¢ (z)| ¸ /2 ) ¸ ; else small. If small, then done: w’ = w in. Else, next z has at least prob of improving S. |w out | 2 Ã |w out | 2 – ( /2) 2 Can happen at most 4/ 2 times. □
So.... If draw z 1,...,z d 2 D for d = (8/ )[1/ 2 + ln 1/ ], then whp exists w’ in span( (z 1 ),..., (z d )) of error · at margin /2. So, for some w’ = 1 (z 1 ) d (z d ), Pr (x, l ) 2 P [sign(w’ ¢ (x)) l ] · . But notice that w’ ¢ (x) = 1 K(x,z 1 ) d K(x,z d ). ) vector ( 1,... d ) is an -good separator in the feature space: x i = K(x,z i ). But margin not preserved because length of target, examples not preserved.
How to preserve margin? (mapping #2) We know 9 w’ in span( (z 1 ),..., (z d )) of error · at margin /2. So, given a new x, just want to do an orthogonal projection of (x) into that span. (preserves dot- product, decreases | (x)|, so only increases margin). Run K(z i,z j ) for all i,j=1,...,d. Get matrix M. Decompose M = U T U. (Mapping #2) = (mapping #1)U -1. □
Mapping #2, Details Draw a set S={z 1,..., z d } of d = (8/ )[1/ 2 + ln 1/ ], unlabeled examples from D. Run K(x,y) for all x,y 2 S, get M(S)=(K(z i,z j )) z i,z j 2 S. Place S into d-dim. space based on K (or M(S)). X z1z1 z3z3 z2z2 K(z 1,z 1 )=|F 2 (z 1 )| 2 F 2 (z 1 ) F 2 (z 2 ) K(z 2,z 2 ) K(z 1,z 2 ) K(z 3,z 3 ) F 2 (z 3 ) RdRd F1F1
Mapping #2, Details, cont What to do with new points? Extend the embedding F 1 to all of X: consider F 2 : X ! R d defined as follows: for x 2 X, let F 2 (x) 2 R d be the point of smallest length such that F 2 (x) ¢ F 2 (z i ) = K(x,z i ), for all i 2 {1,..., d}. The mapping is equivalent to orthogonally projecting (x) down to span( (z 1 ),…, (z d )).
How to improve dimension? Current mapping (F 2 ) gives d = (8/ )[1/ 2 + ln 1/ ]. Johnson-Lindenstrauss gives d 1 = O((1/ 2 ) log 1/( ) ). Nice because can have d ¿ 1/ . Answer: just combine the two... Run Mapping #2, then do random projection down from that. Gives us desired dimension (# features), though sample-complexity remains as in mapping #2.
X X X X X X O O O O X X O O O O X X X X X O O O O X X X RdRd JL F X X O O O O X X X RNRN F2F2 R d1
Mapping #3 Do JL(mapping2(x)). JL says: fix y,w. Random projection M down to space of dimension O(1/ 2 log 1/ ’) will with prob (1- ’) preserve margin of y up to § /4. Use ’ = . ) For all y, Pr M [failure on y] < , ) Pr D, M [failure on y] < , ) Pr M [fail on prob mass ] < . So, we get desired dimension (# features), though sample-complexity remains as in mapping #2.
Lower bound (on necessity of access to D) For arbitrary black-box kernel K, can’t hope to convert to small feature space without access to D. Consider X={0,1} n, random X’ ½ X of size 2 n/2, D = uniform over X’. c = arbitrary function (so learning is hopeless). But we have this magic kernel K(x,y) = (x) ¢ (y) (x) = (1,0) if x X’. (x) = (-½, p 3/2) if x 2 X’, c(x)=pos. (x) = (-½,- p 3/2) if x 2 X’, c(x)=neg. P is separable with margin p 3/2 in - space. But, without access to D, all attempts at running K(x,y) will give answer of 1.
Open Problems For specific natural kernels, like K(x,y) = (1 + x ¢ y) m, is there an efficient analog to JL, without needing access to D? Or, at least can one at least reduce the sample- complexity ? (use fewer accesses to D) Can one extend results (e.g., mapping #1: x [K(x,z 1 ),..., K(x,z d )]) to more general similarity functions K? Not exactly clear what theorem statement would look like.