Download presentation
Presentation is loading. Please wait.
1
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT
2
Generic problem Given a set of images:, want to learn a linear separator to distinguish men from women. Problem: pixel representation no good. Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific. New style advice: Use a Kernel! K (, ) = ( ) ¢ ( ). is implicit, high-dimensional mapping. Sounds more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.
3
Generic problem Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific. New style advice: Use a Kernel! K (, ) = ( ) ¢ ( ). is implicit, high-dimensional mapping. Sounds more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator. E.g., K(x,y) = (x ¢ y + 1) m. :(n-diml space) ! (n m -diml space).
4
Main point of this work: Can view new method as way of conducting old method. Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. “You give me a kernel, I give you a set of features”
5
Main point of this work: Can view new method as way of conducting old method. Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. E.g., sample z 1,...,z d from D. Given x, define x i = K(x,z i ). Implications: Practical: alternative to kernelizing the algorithm. Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.
6
Basic setup, definitions Instance space X. X Distribution D, target c. Use P = (D,c). K(x,y) = (x) ¢ (y). P is separable with margin in -space if 9 w s.t. Pr (x, l ) 2 P [ l (w ¢ (x)) < normalizing |w|=1, | (x)|=1) P=(D,c) + - w Error at margin : replace “0” with “ ”. Goal is to use K to get mapping to low-dim’l space.
7
Idea: Johnson-Lindenstrauss lemma If P separable with margin in -space, then with prob 1- , a random linear projection down to space of dimension d = O((1/ 2 )log[1/( )]) will have a linear separator of error < . [AV] XP=(D,c) + - If vectors are r 1,r 2,...,r d, then can view as features x i = (x) ¢ r i. Problem: uses . Can we do directly, using K as black- box, without computing ?
8
3 methods (from simplest to best) 1.Draw d examples z 1,...,z d from D. Use: F(x) = (K(x,z 1 ),..., K(x,z d )). [So, “x i ” = K(x,z i )] For d = (8/ )[1/ 2 + ln 1/ ], if P was separable with margin in -space, then whp this will be separable with error . (but this method doesn’t preserve margin). 2.Same d, but a little more complicated. Separable with error at margin /2. 3.Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/ , rather than linear. So, can set ¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.
9
Actually, the argument is pretty easy... (though we did try a lot of things first that didn’t work...)
10
Key fact Claim : If 9 perfect w of margin in -space, then if draw z 1,...,z d 2 D for d ¸ (8/ )[1/ 2 + ln 1/ ], whp (1- ) exists w’ in span( (z 1 ),..., (z d )) of error · at margin /2. Proof: Let S = examples drawn so far. Assume |w|=1, | (z)|=1 8 z. w in = proj(w,span(S)), w out = w – w in. Say w out is large if Pr z ( |w out ¢ (z)| ¸ /2 ) ¸ ; else small. If small, then done: w’ = w in. Else, next z has at least prob of improving S. |w out | 2 Ã |w out | 2 – ( /2) 2 Can happen at most 4/ 2 times. a
11
So.... If draw z 1,...,z d 2 D for d = (8/ )[1/ 2 + ln 1/ ], then whp exists w’ in span( (z 1 ),..., (z d )) of error · at margin /2. So, for some w’ = 1 (z 1 ) +... + d (z d ), Pr (x, l ) 2 P [sign(w’ ¢ (x)) l ] · . But notice that w’ ¢ (x) = 1 K(x,z 1 ) +... + d K(x,z d ). ) vector ( 1,... d ) is an -good separator in the feature space: x i = K(x,z i ). But margin not preserved because of length of target, examples.
12
How to preserve margin? (mapping #2) We know 9 w’ in span( (z 1 ),..., (z d )) of error · at margin /2. So, given a new x, just want to do an orthogonal projection into that span. (preserves dot-product, decreases |x|, so only increases margin). Run K(z i,z j ) for all i,j=1,...,d. Get matrix M. Decompose M = U T U. (Mapping #2) = (mapping #1)U -1. a
13
How to improve dimension? Current mapping gives d = (8/ )[1/ 2 + ln 1/ ]. Johnson-Lindenstrauss gives d = O((1/ 2 ) log 1/( ) ). JL is nice because can have ¿ 1/d. Good if alg wants data to be perfectly separable. (Learning a separator of margin can be done in time poly(1/ ), but if no perfect separator exists, minimizing error is NP-hard.) Answer: just combine the two...
14
X X X X X X O O O O X X O O O O X X X X X O O O O X X X RdRd R d1 JL F X X O O O O X X X RNRN F1F1
15
Mapping #3 Do JL(mapping2(x)). JL says: fix y,w. Random projection M down to space of dimension O(1/ 2 log 1/ ’) will with prob (1- ’) preserve margin of y up to § /4. Use ’ = . ) For all y, Pr M [failure on y] < , ) Pr D, M [failure on y] < , ) Pr M [fail on prob mass ] < . So, we get desired dimension (# features), though sample-complexity remains as in mapping #2.
16
Lower bound (on necessity of access to D) For arbitrary black-box kernel K, can’t hope to convert to small feature space without access to D. Consider X={0,1} n, random X’ ½ X of size 2 n/2, D = uniform over X’. c = arbitrary function (so learning is hopeless). But we have this magic kernel K(x,y) = (x) ¢ (y) (x) = (1,0) if x X’. (x) = (-½, p 3/2) if x 2 X’, c(x)=pos. (x) = (-½,- p 3/2) if x 2 X’, c(x)=neg. P is separable with margin p 3/2 in - space. But, without access to D, all attempts at running K(x,y) will give answer of 1.
17
Open Problems For specific, natural kernels, like, K(x,y) = (1 + x ¢ y) m, Is there an efficient (probability distribution over) mappings that is good for any P = (c,D) for which the kernel is good? I.e., an efficient analog to JL for these kernels. Or, at least can these mappings be constructed using less sample-complexity (fewer accesses to D)?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.