including joint work with:

including joint work with:
Learning DNF Formulas: the current state of affairs, and how you can win $1000 Ryan O’Donnell including joint work with: Nader Bshouty – Technion, Elchanan Mossel – UC Berkeley, Rocco Servedio – Columbia

Learning theory Computational learning theory deals with an algorithmic problem: Given random labeled examples from an unknown function f : {0,1}n → {0,1}, try to find a hypothesis h : {0,1}n → {0,1} which is good at predicting the label of future examples.

Valiant’s “PAC” model [Val84]
“PAC” = Probably Approximately Correct: learning problem is identified with a “concept class” C, which is a set of functions (“concepts”) f : {0,1}n → {0,1} nature/adversary chooses one particular f  C and a probability distribution on inputs D the learning algorithm now takes as inputs ε and δ, and also gets random examples x, f(x), x drawn from D goal: with probability 1-δ, output a hypothesis h which satisfies Prx←D [h(x) ≠ f(x)] < ε efficiency: running time of algorithm, counting time 1 for each example; hopefully poly(n, 1/ε , 1/δ)

Example – learning conjunctions
As an example, we present an algorithm ([Val84]) for learning the concept class of conjunctions – i.e., C is the set of all AND functions. start with the hypothesis h = x1  x2  ∙ ∙ ∙  xn draw O((n / ε) log(1/δ)) examples: whenever you see a positive example; e.g.,  , 1, you know that the zero coordinates (in this case, x3, x5, x8) can’t be in the target AND; delete them from the hypothesis It takes a little reasoning to show this works, but it does.

f = (x1  x2  x6 )  ( x1  x3)  (x4  x5  x7  x8).
Learning DNF formulas Probably the most important concept class we would like to learn is DNF formulas: e.g., the set of all functions like f = (x1  x2  x6 )  ( x1  x3)  (x4  x5  x7  x8). (We actually mean poly-sized DNF: the number of terms should be nO(1), where n is the number of variables.) Why so important? natural form of knowledge representation for people historical reasons: considered by Valiant, who called the problem “tantalizing” and “apparently [simple]” yet has proved a great challenge over the last 20 years

Talk overview In this talk I will:
1. Describe variants of the PAC model, and the fastest known algorithms for learning DNF in each of them 2. Point out road blocks for improving these results, and present some open problems which are simple, concrete, and lucrative I will not: discuss every known model consider the important problem of learning restricted DNF – e.g., monotone DNF, O(1)-term DNF, etc.

The original PAC model The trouble with this model is that, despite Valiant’s initial optimism, PAC-learning DNF formulas appears to be very hard. The fastest known algorithm is due to Klivans and Servedio [KS01], and runs in time exp(n1/3 log2n). Technique: They show that for any DNF formula, there is a polynomial in x1, …, xn of degree at most n1/3 log n which is positive whenever the DNF is true and negative whenever the DNF is false. Linear programming can be used to find a hypothesis consistent with every example in time exp(n1/3 log2n). Note: Consider the model, more difficult than PAC, in which the learner is forced to output a hypothesis which itself is a DNF. In this case, the problem is NP-hard.

Distributional issues
One aspect of PAC learning that makes it very difficult is that the adversary gets to pick a different probability distribution for the examples for every concept: Thus the adversary can pick a distribution which puts all the probability weight on the most “difficult” examples. A very commonly studied easier model of learning is called Uniform Distribution learning. Here, the adversary must always use the uniform distribution on {0,1}n. Under the uniform distribution, DNF can be learned in quasipolynomial time…

Uniform Distribution learning
In 1990, Verbeugt [Ver90] observed that, under the uniform distribution, any term in the target DNF which is longer than log(n/ε) is essentially always false, and thus irrelevant. This fairly easily leads to an algorithm for learning DNF under uniform in quasipolynomial time: roughly nlog n. In 1993, Linial, Mansour, and Nisan [LMN93] introduced a powerful and sophisticated method of learning under the uniform distribution based on Fourier analysis. In particular, their algorithm could learn depth d circuits in time roughly nlogdn. Fourier analysis proved very important for subsequent learning of DNF under the uniform distribution.

Membership queries Still, DNF are not known to be learnable in polynomial time even under the uniform distribution. Another common way to make learning easier is to allow the learner membership queries. By this we mean that in addition to getting random examples, the learner is allowed to ask for the value of f on any input it wants. In some sense this begins to stray away from the traditional model of learning in which the learner is passive. However, it’s a natural, commonly studied, important model. Angluin and Kharitonov [AK91] showed that membership queries do not help in PAC-learning DNF.

Uniform distribution with queries
However, membership queries are helpful in learning under the uniform distribution. Uniform Distribution learning with membership queries is the easiest model we’ve seen thus far, and in it there is finally some progress! Mansour [Man92], building on earlier work of Kushilevitz and Mansour [KM93], gave an algorithm in this model learning DNF in time nlog log n. Fourier based. Finally, in 1994 Jackson [Jac94] produced the celebrated “Harmonic Sieve,” a novel algorithm combining Fourier analysis and Freund and Schapire’s “boosting” to learn DNF in polynomial time.

Random Walk model Recently we’ve been able to improve on this last result. [Bshouty-Mossel-O-Servedio-03] considers a natural, passive model of learning of difficulty intermediate between Uniform Distribution learning and Uniform Distribution learning with membership queries: In the Random Walk model, examples are not given i.i.d. as usual, but are instead generated by a standard random walk on the hypercube. The learner’s hypothesis is evaluated under the uniform distribution. It can be shown that DNF are also learnable in polynomial time in this model. The proof begins with Jackson’s algorithm and does some extra Fourier analysis.

The current picture Learning model Time source nO(log n) EASIER
PAC learning (distributional) 2O(n1/3log2n) [KS01] Uniform Distribution nO(log n) [Ver90] Random Walk poly(n) [BMOS03] Uniform Distribution + Membership queries [Jac94] EASIER

Poly time under uniform?
Perhaps the biggest open problem in DNF learning (in all of learning theory?) is whether DNF can be learned in polynomial time under the uniform distribution. One might ask, What is the current stumbling block? Why can’t we seem to do better than nlog n time? The answer is that we’re stuck on the junta-learning problem. Definition: A k-junta is a function on n bits which happens to depend on only k bits. (All other n-k coordinates are irrelevant.)

Learning juntas Since every boolean function on k bits has a DNF of size 2k, it follows that the set of all log(n)-juntas is a subset of the set of all polynomial-size DNF formulas. Thus to learn DNF under uniform in polynomial time, we must be able to learn log(n)-juntas under uniform in polynomial time. The problem of learning k-juntas dates back to Blum and Blum and Langley [B94, BL94]. There is an extremely naive algorithm running in time nk – essentially, test all possible sets of k variables to see if they are the junta. However, even getting an n(1-Ω(1))k algorithm took some time…

Learning juntas [Mossel-O-Servedio-03] gave an algorithm for learning k-juntas under the uniform distribution which runs in time n.704k. The technique involves trading off different polynomial representations of boolean functions. This is not much of an improvement for the important case of k = log n. However at least it demonstrates that the nk barrier can be broken. It would be a gigantic breakthrough to learn k-juntas in time no(k), or even to learn ω(1)-juntas in polynomial time.

An open problem The junta-learning problem is a beautiful and simple to state problem: An unknown and arbitrary function f : {0,1}n → {0,1} is selected, which depends only on some k of the bits. The algorithm gets access to uniformly randomly chosen examples, x, f(x). With probability 1-δ the algorithm should output at least one bit (equivalently, all k bits) upon which f depends. The algorithm’s running time is considered to be of the form nα poly(n, 2k, 1/δ), and α is the important measure of complexity. Can one get α << k?

Cash money Avrim Blum has put up CA$H for anyone who can make progress on this problem. $1000 – Solve the problem for k = log n or k = log log n, in polynomial time. $500 – Solve the problem in polynomial time when the function is known to be MAJORITY on log n bits xored with PARITY on log n bits. $200 – Find an algorithm running in time n.499k.

including joint work with:

Similar presentations

Presentation on theme: "including joint work with:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

including joint work with:

Similar presentations

Presentation on theme: "including joint work with:"— Presentation transcript:

Similar presentations

About project

Feedback