LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Slides:



Advertisements
Similar presentations
1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
Advertisements

Quantum Lower Bounds You probably Havent Seen Before (which doesnt imply that you dont know OF them) Scott Aaronson, UC Berkeley 9/24/2002.
New Evidence That Quantum Mechanics Is Hard to Simulate on Classical Computers Scott Aaronson Parts based on joint work with Alex Arkhipov.
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.
Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin
Polynomial Interpolation over Composites. Parikshit Gopalan Georgia Tech.
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
New degree bounds for polynomials with prescribed signs Ryan ODonnell (MIT) Rocco Servedio (Harvard/Columbia)
Complexity Theory Lecture 6
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Computational Applications of Noise Sensitivity Ryan O’Donnell.
The Max-Cut problem: Election recounts? Majority vs. Electoral College? 7812.
Parallel algorithms for expression evaluation Part1. Simultaneous substitution method (SimSub) Part2. A parallel pebble game.
Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia.
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Extremal properties of polynomial threshold functions Ryan O’Donnell (MIT / IAS) Rocco Servedio (Columbia)
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
The General Linear Model. The Simple Linear Model Linear Regression.
Dictator tests and Hardness of approximating Max-Cut-Gain Ryan O’Donnell Carnegie Mellon (includes joint work with Subhash Khot of Georgia Tech)
Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.
Chapter 11: Limitations of Algorithmic Power
1 On The Learning Power of Evolution Vitaly Feldman.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
Ryan O’Donnell Carnegie Mellon University. Part 1: A. Fourier expansion basics B. Concepts: Bias, Influences, Noise Sensitivity C. Kalai’s proof of Arrow’s.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Umans Complexity Theory Lectures Lecture 1a: Problems and Languages.
SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,
CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.
1 Simple Linear Regression and Correlation Least Squares Method The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES.
Smooth Boolean Functions are Easy: Efficient Algorithms for Low-Sensitivity Functions Rocco Servedio Joint work with Parikshit Gopalan (MSR) Noam Nisan.
Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Imperfectly Shared Randomness
Estimating standard error using bootstrap
Dana Ron Tel Aviv University
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
Computational Learning Theory
Introduction to Machine Learning
Circuit Lower Bounds A combinatorial approach to P vs NP
Umans Complexity Theory Lectures
Analysis and design of algorithm
Pseudo-derandomizing learning and approximation
Linear sketching with parities
The Regression Model Suppose we wish to estimate the parameters of the following relationship: A common method is to choose parameters to minimise the.
Linear sketching over
Learning, testing, and approximating halfspaces
Linear sketching with parities
Chapter 11 Limitations of Algorithm Power
including joint work with:
Imperfectly Shared Randomness
CSE 589 Applied Algorithms Spring 1999
Switching Lemmas and Proof Complexity
CS151 Complexity Theory Lecture 5 April 16, 2019.
Presentation transcript:

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006

Re: How to make $1000! A Grand of George W.s: A Hundred Hamiltons: A Cool Cleveland:

The junta learning problem f : {1,+1} n ! {1,+1} is an unknown Boolean function. f depends on only k ¿ n bits. May generate examples, h x, f(x) i, where x is generated uniformly at random. Task: Identify the k relevant variables., Identify f exactly., Identify one relevant variable. DNA

Run time efficiency Information theoretically: Algorithmically: Naive algorithm: Time n k. Best known algorithm: Time = n.704 k [Mossel-O-Servedio 04] Need only ¼ 2 k log n examples. Seem to need n (k) time steps.

How to get the money Learning log n-juntas in poly(n) time gets you $1000. Learning log log n-juntas in poly(n) time gets you $1000. Learning n (1)-juntas in poly(n) time gets you $200. The case k = log n is a subproblem of the problem ofLearning polynomial-size DNF under the uniform distribution.

Time: n Algorithmic attempts For each x i, measure empirical correlation with f(x): E[ f(x) x i ]. Different from 0 ) x i must be relevant. Converse false: x i can be influential but uncorrelated. (e.g., k = 4, f = exactly 2 out of 4 bits are +1) Try measuring f s correlation with pairs of variables: E[ f(x) x i x j ]. Different from 0 ) both x i and x j must be relevant. Still might not work. (e.g., k ¸ 3, f = parity on k bits) So try measuring correlation with all triples of variables… Time: n 2 Time: n 3

A result In time n d, you can check correlation with all d-bit functions. What kind of Boolean functions on k bits could be uncorrelated with all functions on d or fewer bits?? [Mossel-O-Servedio 04]: Proves structure theorem about such functions. (They must be expressible as parities of ANDs of small size.) Can apply a parity-learning algorithm in that case. End result: An algorithm running in time (Well, parities on > d bits, e.g.…) Uniform-distribution learning results often implied by structural results about Boolean functions. ÆÆÆÆ

PAC Learning PAC Learning: There is an unknown f : {1,+1} n ! {1,+1}. Algorithm gets i.i.d. examples, h x, f(x) i Task: Learn. Given, find a hypothesis function h which is (w.h.p.) -close to f. Goal: Running-time efficiency. CIRCUITS OF THE MIND unknown dist. Uniform Distribution

Running-time efficiency The more complex f is, the more time its fair to allow. Fix some measure of complexity or size, s = s( f ). Goal: run in time poly(n, 1/, s). Often focus on fixing s = poly(n), learning in poly(n) time. e.g., size of smallest DNF formula

The junta problem Fits into the formulation (slightly strangely): is fixed to 0. (Equivalently, 2k.) Measure of size is 2 (# of relevant variables). s = 2 k. [Mossel-O-Servedio 04] had running time essentially Even under this extremely conservative notion of size, we dont know how to learn in poly(n) time for s = poly(n).

complexity measure sfastest known algorithm DNF sizen O(log s) [V 90] 2 (# of relevant variables) n.704 log 2 s [MOS 04] depth d circuit sizen O(log d-1 s) [LMN 93, H 02] Assuming factoring is hard, n log (d) s time is necessary. Even with queries. [K 93] Decision Tree size n O(log s) [EH 89] Any algorithm that works in the Statistical Query model requires time n k. [BF 02]

What to do? 1. Give Learner extra help: Queries: Learner can ask for f(x) for any x. ) Can learn DNF in time poly(n, s). [Jackson 94] More structured data: Examples are not i.i.d., are generated by a standard random walk. Examples come in pairs, h x, f(x) i, h x', f(x') i, where x, x' share a > ½ fraction of coordinates. ) Can learn DNF in time poly(n, s). [Bshouty-Mossel-O-Servedio 05]

What to do? (rest of the talk) 2. Give up on trying to learn all functions. Rest of the talk: Focus on just learn monotone functions. f is monotone, changing a 1 to a +1 in the input can only make f go from 1 to +1, not the reverse Long history in PAC learning [HM91, KLV94, KMP 94, B95, BT96, BCL98, V98, SM00, S04, JS05...] f has DNF size s and is monotone ) f has a size s monotone DNF:

Why does monotonicity help? 1. More structured. 2. You can identify relevant variables. Fact: If f is monotone, then f depends on x i iff it has correlation with x i ; i.e., E[ f(x) x i ] 0. Proof: If f is monotone, its variables have only nonnegative correlations.

complexity measure sfastest known algorithm DNF sizepoly(n, s log s ) [Servedio 04] 2 (# of relevant variables) poly(n, 2 k ) = poly(n, s) depth d circuit size Decision Tree size poly(n, s) [O-Servedio 06] Monotone case any function

Learning Decision Trees Non-monotone (general) case: Structural result: Every size s decision tree (# of leaves = s) is -close to a decision tree with depth d := log 2 (s/ ). Proof: Truncate to depth d. Probability any input would use a longer path is · 2d = /s. There are at most s such paths. Use the union bound. x3x3 x5x5 x1x1 x1x1 x5x5 x4x x2x2 1

Learning Decision Trees Structural result: Any depth d decision tree can be expressed as a degree d (multilinear) polynomial over R. Proof: Given a path in the tree, e.g.,x 1 = +1, x 3 = 1, x 6 = +1, output +1, there is a degree d expression in the variables which is: 0 if the path is not followed, path-output if the path is followed. Now just add these.

Learning Decision Trees Cor: Every size s decision tree is -close to a degree log(s/ ) multilinear polynomial. Least-squares polynomial regression (Low Degree Algorithm) Draw a bunch of data. Try to fit it to degree d multilinear polynomial over R. Minimizing L 2 error is a linear least-squares problem over n d many variables (the unknown coefficients). ) learn size s DTs in time poly(n d ) = poly(n log s ).

Learning monotone Decision Trees [O-Servedio 0?]: 1.Structural theorem on DTs: For any size s decision tree (not nec. monotone), the sum of the n degree 1 correlations is at most 2.Easy fact weve seen: For monotone functions, variable correlations = variable influence. 3.Theorem of [Friedgut 96]: If the total influence of f is at most t, then f essentially has at most 2 O(t) relevant variables. 4.Folklore Fourier analysis fact: If the total influence of f is at most t, then f is close to a degree-O(t) polynomial.

Learning monotone Decision Trees Conclusion: If f is monotone and has a size s decision tree, then it has essentially only relevant variable and essentially only degree Algorithm: Identify the essentially relevant variables (by correlation estimation). Run the Polynomial Regression algorithm up to degree, but only using those relevant variables. Total time:

Open problem Learn monotone DNF under uniform in polynomial time! A source of help: There is a poly-time algorithm for learning almost all randomly chosen monotone DNF of size up to n 3. [Servedio-Jackson 05] Structured monotone DNF – monotone DTs – are efficiently learnable. Typical-looking monotone DNF are efficiently learnable (at least up to size n 3 ). So… all monotone DTs are efficiently learnable? I think this problem is great because it is: a) Possibly tractable. b) Possibly true. c) Interesting to complexity theory people. d) Would close the book on learning monotone fcns under uniform!