Download presentation
Presentation is loading. Please wait.
1
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
2
Overview Introduction Main Result Proof Idea Conclusion
3
Introduction
4
10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM The Spam Problem Learning: Use data seen so far to generate rules for future prediction. Motivating Example
5
The General Learning Framework Unknown probability distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n -> { +,- }. + - - - - + + + - - - - + + + + - After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [ h (x) ≠ f (x)] f h o: no 1: yes o: no 1: yes
6
What does learnable mean? Performance: The learning algorithm outputs high accuracy hypothesis with high probability. Efficiency: The algorithm has Polynomial running time. This is called the PAC learning Model.
7
Concept Class If the target function f can be arbitrary, we have no way of learning it without seeing all the examples. We may assume that f is from some simple concept (function) class such as Conjunctions Halfspaces Decision Tree, Decision List, Low Degree Polynomial, Neural Netwrok, etc…
8
Learning a Concept Class C Unknown distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n {+,-} + - - - - + + + - - - - + + + + - After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [h(x) ≠ f(x)] f h in concept class C.
9
Conjunctions (Monomials) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM “10 Million= yes” AND “Lottery=yes” AND “Pharmacy=yes” The Spam Problem
10
Halfspaces (Linear Threshold Functions) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YES NOT SPAM YESNOYESNOYESNOT SPAM YES NO SPAM sign(“10 Million= YES” + 2 “Lottery=YES”+ “Pharmacy = YES” – 3.5 ) The Spam Problem
11
Relationship Halfspaces Conjunctions (X 1 and X 2 …and X n ) = sgn(X 1 + X 2 …+ X n –n+0.5)
12
How to learn the concept class ? + - - - - + + + - - - - + + + + - + Algorithm: 1. Draw some examples. 2. Then we can use linear programming to find halfspace consistent with all examples. Unknown distribution D over {0,1} n, examples from D are labeled by an unknown conjunction f:{0,1} n -> {0,1}. Well-known theory (VC dimension) for any D random sample of O(n/ε) many examples yields 1- ε accurate hypothesis with high probability. Conjunctions
13
Learning Conjunctions from random examples Real-world data probably doesn’t come with guarantee that examples are labeled perfectly according to a conjunction. Linear Programing is brittle: noisy examples can easily result in no consistent hypothesis. is easy! …but not very realistic… perfectly labeled ^ + - - - - + + + - - - - + + + + - + - + + - Motivates study of noisy variants of learning conjunctions.
14
Learning Conjunctions under noise Unknown distribution D over {0,1} n examples and there is a conjunction with 1- ε accuracy. Goal: To find a hypothesis that has good accuracy (as good as 1- ε? Or just better than 50%?) This is also called “agnostic” noise model.
15
Another interpretation the noise model Unknown distribution D over {0,1} n, examples from D are perfectly labeled by an unknown conjunction f:{0,1} n {+,-}. + - - - - + + + - - - - + + + + - After receiving examples, ε fraction of the examples is corrupted. + + - - Only ε fraction of the data is corrupted, can we still find a good hypothesis?
16
Previous (Positive) No Noise: [Val84, BHW87, Lit88, Riv87]: Conjunction is Learnable Random Noise: [Kea93]: Conjunction is Learnable with random noise
17
Previous Work(Negative) For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No conjunction is ½ + ε consistent with the data. [Fel06, FGKP09] It is NP-hard to learn a 51%-accurate conjunction even if there exists a conjunction consistent with 99% of the examples.
18
Weakness of Previous Result We might still be able to learn conjunctions by outputting larger class of functions. E.g. [Lit88] use the Winnow algorithm which output halfspaces function. Linear Programming
19
Main Result
20
For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.
21
Why halfspaces? In practice, halfspaces are at the heart of many learning algorithms: Perceptron Winnow SVM (no kernel) Any Linear Classifier… Learning Theory Computational We can not learn a conjunction under a little bit noise using any of the above mentioned algorithms!
22
If we are learning algorithm designer To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.
23
Proof Idea
24
First Simplification Learning Conjunction = Learning Disjunction Why? Notice that ~ (x 1 AND x 2 …AND x n ) = (~ x 1 OR ~x 2 …OR ~x n ) If we have a good algorithm for learning Disjunction, we can apply this algorithm on example-label pair (~x,~f(x)).
25
We will prove a simpler theorem. It is NP-hard to tell whether Exists a disjunction consistent with 1- ε fraction of the data, No halfspace is ½+ε consistent with the data. It is NP-hard to learn a 60/88-accurate halfspace with threshold 0 even if there exists a conjunction consistent with 61/88 of the examples. 61/88 60/88 ^ with threshold 0
26
Halfspace with threshold 0 f(x) = sgn(w 1 x 1 + w 1 x 2 +.. w n x n ) Assuming sgn(0) = “-”, disjunction is sgn(x 1 + x 2 +.. x n )
27
A: Reduction from a known hard problem. Q: How can we prove a problem is hard?
28
Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges. 1 2 3 4
29
Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges. 1 2 3 4 Cut = 2
30
Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges. 1 2 3 4 Cut = 3
31
Starting Point of Reduction Following is a Theorem in [Has01]: Theorem: Given a Graph G (V,E), and Opt(G) = #maximum cut / # edges It is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22.
32
The reduction (0,1,0,1,1,0) : + (1,1,1,0,1,0) : - (0,1,1,1,1,0) : - (0,1,1,0,1,1) : - (0,1,1,1,1,0) : - Graph G Distribution on examples Finding a good cut Finding a good hypothesis Polynomial time reduction
33
Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)
34
With such a reduction We know it is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22. Therefore It is NP-hard to tell whether Exists a disjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data.
35
The reduction Given a graph G of n vertices, generating points in n dimension. P i : the example with 0 at all position except the ith coordinate to be 1. P ij : the example with 0 at all position except the ith and jth position. For example, when n = 4 P 1 : (1,0,0,0), P 2 : (0,1,0,0), P 3 : (0,0,1,0), P 12 : (1,1,0,0), P 23 : (0,1,1,0)
36
The reduction For each edge (i, j) in G, generating 4 examples: (P i -) (P j -) (P ij +)
37
The Reduction 1 2 3 4 For a edge (1,2) add: (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+)
38
The Reduction 1 2 3 4 (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -
39
Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)
40
Proof of Good Cut => Good Hypothesis Opt(G) > 17/22, then there is a disjunction agrees with probability 61/88 fraction of the example. Proof: Opt(G) > 17/22 means that there is a partition of G into (S,S) such that 17/22 fraction of the edge is in the cut. passes with probability 61/88. Why? -
41
Good Cut => Good Hypothesis Partition {1,3} is a good cut Disjunction x 1 OR x 3 is a good hypothesis. 1 2 3 4 (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -
42
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 1 is in the disjunction, 3 out of 4 is correct.
43
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 2 is in the disjunction, 3 out of 4 is correct.
44
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both in the disjunction, 2 out of 4 is correct.
45
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both not in the disjunction, 2 out of 4 is correct.
46
The Large Picture If we choose conjunction x 1 OR x 3 For edges in the cut, 3 out of 4 examples is satisfied. For edges not in the cut, 2 out of 4 of the examples is satisfied. 1 2 3 4 (1,0,0,0) - (0,1,0,0) - (1,1,0,0) + (0,1,0,1) - (1,0,1,1) - (0,1,1,0) + (0,0,1,0) - (1,1,0,1) - (0,0,1,1) + (1,0,0,0) - (0,0,1,0) - (1,0,1,0) +
47
Therefore, we prove: If a partition (S, S) has 17/22 fraction of the edges, then the disjunction is consistent with: (1/2) + (1/4)(17/22) = 61/88 fraction the examples. -
48
Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. (Good Hypothesis => Good Cut)
49
Proof of Good Hypothesis =>Good Cut If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. Suppose there is some halfspace sgn(w 1 x 1 + w 1 x 2 +.. w n x n ) has accuracy 60/88. We assign the partition of vertex i according to sgn(w i ) It has cut at least 16/22. Why?
50
Good Cut => Good Partition For edge (1,2) (1,0,0,0,..0) (-) w 1 ≤ 0 (0,1,0,0,..0) (-) w 2 ≤ 0 (1,1,0,0,…0) (+) w 1 +w 2 > 0 3 out of 4 are satisfied only when 1. w 1 >0,w 2 ≤0 2. w 2 >0,w 1 ≤0 At most 3 out of 4 is satisfied.
51
To finish the proof: 60/88 = (¼ ) 16/22 + ½, Therefore 16/22 fraction of the edge (i,j) must be that has different sign on w i,w j cut > 16/22.
52
What we prove in the talk: It is NP-hard to tell whether Exists a conjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data. It is NP-hard to find a 60/88-accurate halfsapce with 0 threshold even if there exists a conjunction consistent with 61/88 of the examples.
53
Main result in the paper: For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.
54
To get better hardness Starting from a problem called Label Cover. It is NP-hard to tell i)Opt > 0.99 ii)Opt < 0.01
55
The sketch of the proof “Smooth Label” Cover Label Cover Gadget: Dictatorship Testing Learning Conjunction Berry Esseen Critical Index
56
Conclusion Even weak learning of noisy conjunctions by halfspaces is NP-hard. To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.
57
Future Work Prove: For any ε > 0, given a set of training examples, even there is a conjunction consistent with 1- ε fraction of the data, it is NP-hard to find a degree d polynomial threshold function that is ½ + ε consistent with the data. Why low degree PTF? Corresponding to SVM with Polynomial Kernel Can be used to learn conjunctions/halfspaces under uniform distribution agnostically.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.