Download presentation
Presentation is loading. Please wait.
Published byKenneth Willis Modified over 9 years ago
1
Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)
2
Learning We consider the PAC model of [Valiant-84], in which learning a “concept class” C of boolean functions means: - a function f in C is selected, and also a probability distribution D over {+1,−1} n - the learning algorithm gets access to random examples, where the x ’s are drawn from D - goal: efficiently output a hypothesis h such that w.h.p., Pr x← D [ f(x) ≠ h(x)] < ε.
3
Learning example Example: C is the class of all conjunctions of variables. Perhaps the concept selected is: x 1 AND x 2 AND x 4. One might see examples: What is a learning algorithm for this class?
4
Halfspaces Let h be a hyperplane in R n : h = {x : ∑w i x i = θ}. h naturally induces a boolean function: f : {+1,−1} n → {+1,−1}, f (x) = sgn(∑w i x i − θ). We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example ( w i ≡ 1, θ = 0). i=1 n
5
Learning halfspaces Learning halfspaces is a very old problem; dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62]. The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89]. Indeed, this works over any distribution on R n, including those singling out {+1, −1} n.
6
Learning halfspaces Basic idea: given a bunch of examples, find a halfspace which classifies them correctly. By some learning theory technology (“Occam’s Razor”), this is a good algorithm. Consider the coefficients of a hypothesis halfspace to be unknowns, a 1, …, a n, θ. Each example induces some linear constraints: e.g., induces a 1 +a 2 −a 3 +a 4 −a 5 −a 6 > θ. Solve LP.
7
Learning intersections of halfspaces The next logical extension of this, and a very important one, is learning intersections of halfspaces. Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas… Learning them is also an important problem for computer vision, study of perceptrons. But very little is known.
8
Prior work - [Baum91]: poly time algorithm for intersection of two halfspaces through the origin under symmetric distributions (those satisfying D (x) = D (−x) ). - [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere: - not relevant for boolean halfspaces -[KwekPitt98] gave a polynomial time alg., but requires membership queries -also not relevant for boolean halfspaces
9
Our results Theorem 1: The concept class of arbitrary functions of k boolean halfspaces over {+1,−1} n is learnable under the uniform distribution to accuracy 1−ε in time: n O(k²/ε²). This is polynomial time if k = O(1), ε = Ω(1). (Prior to this, no algorithm could learn even an intersection of 2 arbitrary boolean halfspaces under the uniform distribution in subexponential time.)
10
Our results Theorem 2: The concept class of intersections of k boolean halfspaces with weight bound W is learnable under any probability distribution to accuracy 1−ε in time: n O(k log k log W) / ε. So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.
11
More results FunctionHalfspacesDistrib.Time any fcn. of k weight W any n O(k² log k logW) /ε weight k threshold (e.g., inters. of k ) weight W any n O(k log k logW) /ε intersection of k weight W any n O(√W log k) /ε read-once intersection of k arbitraryuniform n O((log(k)/ε)²) read-once majority of k arbitraryuniform n Õ((log(k)/ε) ) 4
12
Sketch of techniques For arbitrary distribution results: show that functions of low weight halfspaces have low degree polynomial threshold representations. For uniform distribution results: show that functions of halfspaces have low noise sensitivity. Both conclusions imply learning results generically.
13
Talk outline Plan for the rest of the talk: 1.Prove n O(k log k log W) bound for learning intersections of k weight- W halfspaces under arbitrary distributions. (Sketch other arbit. dist. results.) 2.Prove n O(k²/ε²) bound for learning arbitrary functions of k halfspaces under the uniform distribution. (Sketch other unif. dist. results.)
14
Polynomial threshold functions A (multilinear) polynomial p : R n → R is a PTF for f if it sign-represents f : f(x) = sgn(p(x)) for all x {+1, −1} n. - every boolean halfspace is a degree 1 PTF for itself - every boolean function has a degree n PTF By linear programming [KS01]: if every function in a class C has a PTF of degree d, then C is learnable in time n O(d) /ε.
15
PTFs for intersections of halfspaces Suppose f and g are hyperplanes, f(x) = ∑w i x i −θ, g(x) = ∑w i ' x i −θ'. We would like a PTF for sgn(f) sgn(g). Failed attempt 1: - try f(x)g(x) :is >0 if f(x)>0 and g(x)>0 is >0 if f(x)<0 and g(x)<0 Failed attempt 2: - try f(x)+g(x) : is >0 if f(x)>0 and g(x)>0 is 0 and g(x)<0
16
PTFs for intersections of halfspaces The solution: apply a (polynomial?) function to f and g to make them look more like their sign. Assume ∑|w i | < W. Then for all x {+1,−1} n, f(x), g(x) [-W,-1] ∪ [1,W]. Beigel et al. [BRS95] showed how to construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].
17
BRS’s sgn -approximator p(x)=(x-1)(x-2) 2 (x-4) 2 (x-8) 2 (x-16) 2 (x-32) 2 q(x) = Q is a rational function of degree O(log k log W) such that: Q(x) [1, 1+1/k] for x [1,W], Q(x) [-1-1/k, -1] for x [-W,-1]. p(-x)-p(x) p(-x)+p(x)
18
PTFs for intersections of halfspaces Now given weight W halfspaces h 1, …, h k, sgn(Q(h 1 (x)) + … + Q(h k (x)) − (k−½)) is a rational function which sign-represents the intersection. Once taken to a common denominator, it has degree O(k log k log W). Easy to get a polynomial: sgn(p/q)=sgn(pq). So we have a PTF for the intersection of k weight- W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time n O(k log k log W).
19
Talk outline Plan for the talk: 1.Prove n O(k log k log W) bound for learning intersections of k weight- W halfspaces under arbitrary distributions. 2.Prove n O(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.
20
Noise sensitivity Let f : {+1,−1} n → {+1,−1} be a boolean function. Pick x {+1,−1} n uniformly at random, and let y be an ε -corruption of x : flip each bit of x independently with probability ε. defn: The noise sensitivity of f is: NS ε (f) = Pr[f(x) ≠ f(y)].
21
Noise sensitivity examples Let f be a projection to one bit, f(x 1, …, x n ) = x 1. Then NS ε (f) = ε. Suppose f depends on only k bits. Then NS ε (f) ≤ k ε. PARITY is the most noise-sensitive function: NS ε (PARITY n ) = ½ − ½(1−2ε) n.
22
Noise sensitivity – study and apps. [Benjamini-Kalai-Schramm-98] – percolation, low-level circuit complexity [Kahn-Kalai-Linial-88] – random walks on the hypercube [Håstad-97] – probabilistically checkable proofs [Bshouty-Jackson-Tamon-99] – learning theory under noise [O-02] – Yao’s XOR Lemma, average case hardness of NP [Bourgain-02, Kindler-Safra-02, FKRSS-02] – study of juntas, Fourier analysis of boolean fcns.
23
Low noise sens. fast learning We show that if the noise sensitivity of all f in C is uniformly bounded: NS ε (f) ≤ α(ε), then C is learnable under the uniform distribution in time: n O(1) / α (ε/3). Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works. −1
24
Proof of NS-learning connection Actually, the intuition is wrong. Here is the proper proof sketch: Low noise sensitivity Fourier spectrum concentrated at low levels; this uses the formula: NS ε (f ) = ½−½ Σ(1−2ε) |S| f(S) 2 and a Markovish inequality. Low level Fourier concentration efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93]. ˆ
25
Noise sensitivity of halfspaces Function NS ε proof one boolean halfspace O(√ε) Y. Peres, ’98 any function of k halfspaces O(k√ε) union bound read-once intersection of k halfspaces O(√ε log k) difficult probabilistic analysis read-once majority of k halfspaces Õ((ε log k) ¼ )
26
Consequences Let C be the class of functions of k boolean halfspaces. Take α(ε) = O(k√ε), so all f C have NS ε (f) ≤ α(ε). α −1 (ε/3) = O(ε 2 /k 2 ). Hence we get Theorem 1: a uniform distribution learning algorithm running in time n O(k²/ε²).
27
Noise sensitivity of a halfspace We now sketch Peres’s beautiful proof that the noise sensitivity of a single halfspace is O(√ε). Suppose the halfspace is f = sgn(∑w i x i −θ). Without (much) loss of generality, one can assume θ = 0. Recall that x i ’s are selected randomly from {+1,−1} and the sum is formed; then each x i is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).
28
Noise sensitivity of a halfspace With high probability, the number of flipped bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.) We now model the problem thus: Pick signs x i at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X 1 = ∑w i x i, X 2 = ∑w i x i, etc. i=1…k i=k+1…2k
29
Noise sensitivity of a halfspace Write S = X 1 + … + X n/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X 1, so the sum before flipping is S'+X 1, and the sum after flipping is S'−X 1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X 1 |.
30
Noise sensitivity of a halfspace sgn(X 1 ) and S' are independent, so: Pr[sgn(X 1 ) ≠ sgn(S')] = ½. sgn(X 1 ) and |X 1 | are independent, so: Pr[sgn(X 1 ) ≠ sgn(S') | |S'| > |X 1 |] = ½ Pr[sgn(X 1 ) ≠ sgn(S) | |S'| > |X 1 |] = ½ Pr[sgn(X 1 ) ≠ sgn(S) & no flop ] = ½(1−P) Pr[sgn(X 1 ) ≠ sgn(S)] = ½(1−P) P = 2 E[½ – I[sgn(X 1 ) ≠ sgn(S)]].
31
Noise sensitivity of a halfspace Of course, there was nothing special about block X 1 as opposed to any other block. So in fact, P = 2 E[½ – I[sgn(X i ) ≠ sgn(S)]]. for all i = 1…n/k. Write τ=sgn(S), σ i =sgn(X i ), and average: P = 2 E[½ – (k/n) ∑ i I[τ ≠ σ i ]].
32
Noise sensitivity of a halfspace P = 2 E[½ – (k/n) ∑ i I[τ ≠ σ i ]] The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑ i I[1 ≠ σ i ] or ½ – (k/n) ∑ i I[−1 ≠ σ i ]. If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise: P ≤ 2 E[|½ – (k/n) ∑ i I[σ i =1]| + |½ – (k/n) ∑ i I[σ i =−1]|].
33
Noise sensitivity of a halfspace P ≤ 2 E[ |½ – ε ∑ I[σ i =1]| + |½ – ε ∑ I[σ i =−1]| ] But the σ i ’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε). i=1…1/ε
34
Extensions This concludes the proof that a single halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows. To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NS ε (h ) ≤ min{2p, C p (ε log(1/p)) ½ }.
35
Talk outline Plan for the talk: 1.Prove n O(k log k log W) bound for learning intersections of k weight- W halfspaces under arbitrary distributions. 2.Prove n O(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution.
36
Open technical challenges Give an upper bound on the degree necessary for a PTF which represents the AND of two arbitrary halfspaces. (For a new lower bound, see my talk tomorrow!) Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k) ½ ) ?
37
The huge open problem It still remains open how to learn an intersection of two arbitrary boolean halfspaces under an arbitrary distribution in subexponential time!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.