including joint work with:

Slides:

Advertisements

Similar presentations

Quantum Software Copy-Protection Scott Aaronson (MIT) |

Advertisements

Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.

How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.

Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Max Cut Problem Daniel Natapov.

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

Great Theoretical Ideas in Computer Science for Some.

Computational problems, algorithms, runtime, hardness

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

A Quarter-Century of Efficient Learnability Rocco Servedio Columbia University Valiant 60 th Birthday Symposium Bethesda, Maryland May 30, 2009.

Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009.

Definition and Complexity of Some Basic Metareasoning Problems Vincent Conitzer Tuomas Sandholm Presented by Linli He 04/22/2004.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,

Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.

Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University

Lecture 20: April 12 Introduction to Randomized Algorithms and the Probabilistic Method.

Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.

Optimal Proof Systems and Sparse Sets Harry Buhrman, CWI Steve Fenner, South Carolina Lance Fortnow, NEC/Chicago Dieter van Melkebeek, DIMACS/Chicago.

Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)

Techniques for Proving NP-Completeness Show that a special case of the problem you are interested in is NP- complete. For example: The problem of finding.

1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.

Umans Complexity Theory Lectures Lecture 1a: Problems and Languages.

NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.

CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.

CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.

Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.

Learning abductive reasoning using random examples Brendan Juba Washington University in St. Louis.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

The NP class. NP-completeness

Lesson 8: Basic Monte Carlo integration

Computational problems, algorithms, runtime, hardness

12. Principles of Parameter Estimation

Introduction to Randomized Algorithms and the Probabilistic Method

Dana Ron Tel Aviv University

Yahoo predicts contest already!

Analysis of Algorithms

COMP108 Algorithmic Foundations Algorithm efficiency

Vitaly Feldman and Jan Vondrâk IBM Research - Almaden

Computational Learning Theory

Introduction to Machine Learning

Circuit Lower Bounds A combinatorial approach to P vs NP

Pseudo-derandomizing learning and approximation

Linear sketching with parities

Computational Learning Theory

Linear sketching over

Interactive Proofs Adapted from Oded Goldreich’s course lecture notes.

Linear Programming Duality, Reductions, and Bipartite Matching

Learning, testing, and approximating halfspaces

Computational Learning Theory

Linear sketching with parities

Chapter 11 Limitations of Algorithm Power

CPS 173 Computational problems, algorithms, runtime, hardness

Computational Learning Theory Eric Xing Lecture 5, August 13, 2010

A Quarter-Century of Efficient Learnability

Machine Learning: UNIT-3 CHAPTER-2

Lecture 14 Learning Inductive inference

CS151 Complexity Theory Lecture 7 April 23, 2019.

12. Principles of Parameter Estimation

Switching Lemmas and Proof Complexity

Presentation transcript:

including joint work with: Learning DNF Formulas: the current state of affairs, and how you can win $1000 Ryan O’Donnell including joint work with: Nader Bshouty – Technion, Elchanan Mossel – UC Berkeley, Rocco Servedio – Columbia

Learning theory Computational learning theory deals with an algorithmic problem: Given random labeled examples from an unknown function f : {0,1}n → {0,1}, try to find a hypothesis h : {0,1}n → {0,1} which is good at predicting the label of future examples.

Valiant’s “PAC” model [Val84] “PAC” = Probably Approximately Correct: learning problem is identified with a “concept class” C, which is a set of functions (“concepts”) f : {0,1}n → {0,1} nature/adversary chooses one particular f  C and a probability distribution on inputs D the learning algorithm now takes as inputs ε and δ, and also gets random examples x, f(x), x drawn from D goal: with probability 1-δ, output a hypothesis h which satisfies Prx←D [h(x) ≠ f(x)] < ε efficiency: running time of algorithm, counting time 1 for each example; hopefully poly(n, 1/ε , 1/δ)

Example – learning conjunctions As an example, we present an algorithm ([Val84]) for learning the concept class of conjunctions – i.e., C is the set of all AND functions. start with the hypothesis h = x1  x2  ∙ ∙ ∙  xn draw O((n / ε) log(1/δ)) examples: whenever you see a positive example; e.g., 11010110, 1, you know that the zero coordinates (in this case, x3, x5, x8) can’t be in the target AND; delete them from the hypothesis It takes a little reasoning to show this works, but it does.

f = (x1  x2  x6 )  ( x1  x3)  (x4  x5  x7  x8). Learning DNF formulas Probably the most important concept class we would like to learn is DNF formulas: e.g., the set of all functions like f = (x1  x2  x6 )  ( x1  x3)  (x4  x5  x7  x8). (We actually mean poly-sized DNF: the number of terms should be nO(1), where n is the number of variables.) Why so important? natural form of knowledge representation for people historical reasons: considered by Valiant, who called the problem “tantalizing” and “apparently [simple]” yet has proved a great challenge over the last 20 years

Talk overview In this talk I will: 1. Describe variants of the PAC model, and the fastest known algorithms for learning DNF in each of them 2. Point out road blocks for improving these results, and present some open problems which are simple, concrete, and lucrative I will not: discuss every known model consider the important problem of learning restricted DNF – e.g., monotone DNF, O(1)-term DNF, etc.

The original PAC model The trouble with this model is that, despite Valiant’s initial optimism, PAC-learning DNF formulas appears to be very hard. The fastest known algorithm is due to Klivans and Servedio [KS01], and runs in time exp(n1/3 log2n). Technique: They show that for any DNF formula, there is a polynomial in x1, …, xn of degree at most n1/3 log n which is positive whenever the DNF is true and negative whenever the DNF is false. Linear programming can be used to find a hypothesis consistent with every example in time exp(n1/3 log2n). Note: Consider the model, more difficult than PAC, in which the learner is forced to output a hypothesis which itself is a DNF. In this case, the problem is NP-hard.

Distributional issues One aspect of PAC learning that makes it very difficult is that the adversary gets to pick a different probability distribution for the examples for every concept: Thus the adversary can pick a distribution which puts all the probability weight on the most “difficult” examples. A very commonly studied easier model of learning is called Uniform Distribution learning. Here, the adversary must always use the uniform distribution on {0,1}n. Under the uniform distribution, DNF can be learned in quasipolynomial time…

Uniform Distribution learning In 1990, Verbeugt [Ver90] observed that, under the uniform distribution, any term in the target DNF which is longer than log(n/ε) is essentially always false, and thus irrelevant. This fairly easily leads to an algorithm for learning DNF under uniform in quasipolynomial time: roughly nlog n. In 1993, Linial, Mansour, and Nisan [LMN93] introduced a powerful and sophisticated method of learning under the uniform distribution based on Fourier analysis. In particular, their algorithm could learn depth d circuits in time roughly nlogdn. Fourier analysis proved very important for subsequent learning of DNF under the uniform distribution.

Membership queries Still, DNF are not known to be learnable in polynomial time even under the uniform distribution. Another common way to make learning easier is to allow the learner membership queries. By this we mean that in addition to getting random examples, the learner is allowed to ask for the value of f on any input it wants. In some sense this begins to stray away from the traditional model of learning in which the learner is passive. However, it’s a natural, commonly studied, important model. Angluin and Kharitonov [AK91] showed that membership queries do not help in PAC-learning DNF.

Uniform distribution with queries However, membership queries are helpful in learning under the uniform distribution. Uniform Distribution learning with membership queries is the easiest model we’ve seen thus far, and in it there is finally some progress! Mansour [Man92], building on earlier work of Kushilevitz and Mansour [KM93], gave an algorithm in this model learning DNF in time nlog log n. Fourier based. Finally, in 1994 Jackson [Jac94] produced the celebrated “Harmonic Sieve,” a novel algorithm combining Fourier analysis and Freund and Schapire’s “boosting” to learn DNF in polynomial time.

Random Walk model Recently we’ve been able to improve on this last result. [Bshouty-Mossel-O-Servedio-03] considers a natural, passive model of learning of difficulty intermediate between Uniform Distribution learning and Uniform Distribution learning with membership queries: In the Random Walk model, examples are not given i.i.d. as usual, but are instead generated by a standard random walk on the hypercube. The learner’s hypothesis is evaluated under the uniform distribution. It can be shown that DNF are also learnable in polynomial time in this model. The proof begins with Jackson’s algorithm and does some extra Fourier analysis.

The current picture Learning model Time source nO(log n) EASIER PAC learning (distributional) 2O(n1/3log2n) [KS01] Uniform Distribution nO(log n) [Ver90] Random Walk poly(n) [BMOS03] Uniform Distribution + Membership queries [Jac94] EASIER

Poly time under uniform? Perhaps the biggest open problem in DNF learning (in all of learning theory?) is whether DNF can be learned in polynomial time under the uniform distribution. One might ask, What is the current stumbling block? Why can’t we seem to do better than nlog n time? The answer is that we’re stuck on the junta-learning problem. Definition: A k-junta is a function on n bits which happens to depend on only k bits. (All other n-k coordinates are irrelevant.)

Learning juntas Since every boolean function on k bits has a DNF of size 2k, it follows that the set of all log(n)-juntas is a subset of the set of all polynomial-size DNF formulas. Thus to learn DNF under uniform in polynomial time, we must be able to learn log(n)-juntas under uniform in polynomial time. The problem of learning k-juntas dates back to Blum and Blum and Langley [B94, BL94]. There is an extremely naive algorithm running in time nk – essentially, test all possible sets of k variables to see if they are the junta. However, even getting an n(1-Ω(1))k algorithm took some time…

Learning juntas [Mossel-O-Servedio-03] gave an algorithm for learning k-juntas under the uniform distribution which runs in time n.704k. The technique involves trading off different polynomial representations of boolean functions. This is not much of an improvement for the important case of k = log n. However at least it demonstrates that the nk barrier can be broken. It would be a gigantic breakthrough to learn k-juntas in time no(k), or even to learn ω(1)-juntas in polynomial time.

An open problem The junta-learning problem is a beautiful and simple to state problem: An unknown and arbitrary function f : {0,1}n → {0,1} is selected, which depends only on some k of the bits. The algorithm gets access to uniformly randomly chosen examples, x, f(x). With probability 1-δ the algorithm should output at least one bit (equivalently, all k bits) upon which f depends. The algorithm’s running time is considered to be of the form nα poly(n, 2k, 1/δ), and α is the important measure of complexity. Can one get α << k?

Cash money Avrim Blum has put up CA$H for anyone who can make progress on this problem. $1000 – Solve the problem for k = log n or k = log log n, in polynomial time. $500 – Solve the problem in polynomial time when the function is known to be MAJORITY on log n bits xored with PARITY on log n bits. $200 – Find an algorithm running in time n.499k.