An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.

Slides:



Advertisements
Similar presentations
1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
Advertisements

How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.
Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin
LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Foundations of Cryptography Lecture 2: One-way functions are essential for identification. Amplification: from weak to strong one-way function Lecturer:
On-line learning and Boosting
Fast Algorithms For Hierarchical Range Histogram Constructions
BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Longin Jan Latecki Temple University
Dictator tests and Hardness of approximating Max-Cut-Gain Ryan O’Donnell Carnegie Mellon (includes joint work with Subhash Khot of Georgia Tech)
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009.
On Uniform Amplification of Hardness in NP Luca Trevisan STOC 05 Paper Review Present by Hai Xu.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Probably Approximately Correct Model (PAC)
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Ensemble Learning: An Introduction
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Evaluating Hypotheses
Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,
Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.
Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.
Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 On The Learning Power of Evolution Vitaly Feldman.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Autocorrelation Lecture 18 Lecture 18.
Haim Kermany Learning Decision Trees using the Fourier Spectrum By Eyal Kushilevitz Yishay Mansour.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Dana Moshkovitz, MIT Joint work with Subhash Khot, NYU.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Moment Generating Functions 1/33. Contents Review of Continuous Distribution Functions 2/33.
Probability theory: (lecture 2 on AMLbook.com)
Moment Generating Functions
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
CS 391L: Machine Learning: Ensembles
Preference elicitation Communicational Burden by Nisan, Segal, Lahaie and Parkes October 27th, 2004 Jella Pfeiffer.
Benk Erika Kelemen Zsolt
Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
1 CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman Machine Learning: The Theory of Learning R&N 18.5.
Sampling and estimation Petter Mostad
Lattice-based cryptography and quantum Oded Regev Tel-Aviv University.
1 What happens to the location estimator if we minimize with a power other that 2? Robert J. Blodgett Statistic Seminar - March 13, 2008.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Pseudo-random generators Talk for Amnon ’ s seminar.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Dana Ron Tel Aviv University
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
Computational Learning Theory
Introduction to Machine Learning
Tight Fourier Tails for AC0 Circuits
Pseudo-derandomizing learning and approximation
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
The
Learning, testing, and approximating halfspaces
including joint work with:
Presentation transcript:

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar Aizikowitz

2 Presentation Outline Introduction Algorithms We Use Estimating Expected Values Hypothesis Boosting Finding Weak-approximating Parity Functions Learning DNF With Respect to Uniform Existence of Weak Approximating Parity Functions for every f, D Nonuniform Weak DNF Learning Strongly Learning DNF

3 Introduction DNF is weakly-learnable with respect to the uniform distribution as shown by Kushilevitz and Mansour. We show that DNF is weakly learnable with respect to a certain class of nonuniform distributions. We then use a method based on Freund’s boosting algorithm to produce a strong learner with respect to uniform.

4 Algorithms We Use Our learning algorithm makes use of several previous algorithms. Following is a short reminder of these algorithms.

5 Estimating Expected Values The AMEAN Algorithm: Efficiently estimates the expectancy of a random variable. Based on Hoeffding ’ s inequality: Let X i be independent random variables such that:  X i, X i  [a,b] and E [ X i ] =μ then:

6 The AMEAN Algorithm Input: random X  [a,b] b – a λ,  > 0 Output: μ’ such that Pr [ | E[ X ] – μ’ |  λ ]  1 – δ Running time: O ( (b-a) 2 log(δ -1 ) / λ 2 )

7 Hypothesis Boosting Our algorithm is based on boosting weak hypotheses into a final strong hypothesis. We use a boosting method very similar to Freund’s boosting algorithm. We refer to Freund’s original algorithm as F1.

8 The F1 Boosting Algorithm Input: positive ε, δ and γ (½ – γ)-approximate PAC learner for representation class  EX( f,D) for some f in  and any distribution D Output: ε-approximation for f with respect to D with probability at least 1 – δ Running time: polynomial in n, s, γ -1, ε -1, and log(δ -1 )

9 The Idea Behind F1 (1) The algorithm generates a series of weak hypotheses h i. h 0 is a weak approximator for f with respect to the distribution D. Each subsequent h i is a weak approximator for f with respect to the distribution D i.

10 The Idea Behind F1 (2) Each distribution D i focuses weight on those areas where slightly more than half the hypotheses already generated were incorrect. The final hypothesis h is a majority vote on all the h i -s.

11 If a sufficient number of weak hypotheses is generated then h will be an ε-approximator for f with respect to the distribution D. Freund showed that ½γ -2 ln(ε -1 ) weak hypotheses suffice. The Idea Behind F1 (3)

12 Finding Weak-approximating Parity Functions In order to use the boosting algorithm, we need to be able to generate weak- approximators for our DNF f with respect to the distributions D i. Our algorithm is based on the Weak Parity algorithm (WP) by Kushilevitz and Mansour.

13 The WP Algorithm Finds the large Fourier coefficients of a Boolean function f on {0,1} n using a Membership Oracle for f.

14 The WP’ Algorithm (1) Our learning algorithm will need to find the large coefficients of a non-Boolean function. The basic WP algorithm can be extended to the WP’ algorithm which works for non- Boolean f as well. WP’ gives us a weak approximator for a non- Boolean f with respect to the uniform distribution.

15 The WP’ Algorithm (2) Input: MEM( f ) for f:{0,1} n →  θ, δ, n, L  ( f ) > 0 Output: With probability at least 1 – δ, WP’ outputs a set S such that for all A : Running time:

16 We now show the main result: DNF is learnable with respect to uniform. We begin by showing that for every DNF f and distribution D there exists a parity function that weakly approximates f with respect to D. We use this to produce an algorithm for weakly learning DNF with respect to certain nonuniform distributions. Finally we show that this weak learner can be boosted into a strong learner with respect to the uniform distribution. Learning DNF with Respect to Uniform

17 Existence of Weak Approximating Parity Functions for every f, D (1) For every DNF f and every distribution D there exists a parity function that weakly approximates f with respect to D. The more difficult case is when E D [ f ] ~ 0.

18 Existence of Weak Approximating Parity Functions for every f, D (2) Let f be a DNF such that E [ f ] ~ 0. Let s be the number of terms in f. Let T(x) be the {-1,+1} valued function equivalent to the term in f best correlated with f with respect to D.

19 Existence of Weak Approximating Parity Functions for every f, D (3)

20 T is a term of f  Pr D [ T(x) = f(x) | f(x) = -1 ] = 1 There are s terms in f, T is the best correlated with f  Pr D [ T(x) = f(x) | f(x) = 1 ] ≥ 1 / s  Pr D [ T(x) = f(x) ] ≥ 1 / 2 (1 + 1 / s )  E D [ f  T ] ≥ 1 / s Existence of Weak Approximating Parity Functions for every f, D (4)

21 Existence of Weak Approximating Parity Functions for every f, D (5) T can be represented using the Fourier transform. Define:

22 Nonuniform Weak DNF Learning (1) We have shown that for every DNF f and every distribution D there exists a parity function that is a weak approximator for f with respect to D. How can we find such a parity function? We want an algorithm that when given a threshold θ and a distribution D finds a parity such that, say:

23 Nonuniform Weak DNF Learning (2)

24 We have reduced the problem of finding a well correlated parity to finding a large Fourier coefficient of g. g is not Boolean  therefore we use WP’. Invocation: WP’(n,MEM(g),θ,L  (g),  ) Nonuniform Weak DNF Learning (3) MEM(g)(x)  2 n  MEM( f )(x)  D

25 The WDNF Algorithm (1) We define a new algorithm: Weak DNF (WDNF). WDNF finds the large Fourier coefficients of g(x)=2 n f(x)D(x) therefore finding a parity that is well correlated with f with respect to the distribution D. WDNF makes use of the WP’ algorithm for finding the Fourier coefficients of the non- Boolean g.

26 Proof of Existence: Let g(x)=2 n f(x)D(x) Output with prob. 1 –  : Running Time: poly. in n, s, log(  -1 ), and L  (2 n D) The WDNF Algorithm (2)

27 The WDNF Algorithm (3) Input: EX( f,D) MEM( f ) D δ > 0 Output: With probability at least 1 – δ : parity function h (possibly negated) s.t.: E D [ fh ] = Ω(s -1 ) Running time: polynomial in n, s, log(  -1 ), and L  (2 n D)

28 The WDNF Algorithm (4) WDNF is polynomial in L  (g) = L  (2 n D).  If D is at most poly(n,s,ε,  -1 ) / 2 n then WDNF runs polynomially in the normal parameters. Such D is referred to as polynomially-near uniform.  WDNF weakly learns DNF with respect to any polynomially-near uniform distribution D.

29 We define the Harmonic Sieve Algorithm (HS). HS is an application of the F1 boosting algorithm on the weak learner generated by WDNF. The main difference between HS and F1 is the need to supply WDNF with an oracle for distribution D i at each stage of boosting. Strongly Learning DNF

30 The HS Algorithm (1) Input: EX( f,D) MEM( f ) D s ε,  > 0 Output: With probability 1 –  : h s.t. h is an ε-approximator of f with respect to D. Running Time: polynomial in n, s, ε -1, log(  -1 ), and L  (2 n D)

31 For WDNF to work, and work efficiently, two requirements must be met: An oracle for the distribution must be provided for the learner. The distribution must be polynomially-near uniform. We show how to simulate an approximate oracle D i ’ that can be provided to the weak learner instead of an exact one. We then show that the distributions D i are in fact polynomially-near uniform. The HS Algorithm (2)

32 Simulating D i (1) Define:  To provide an exact oracle we need to compute the denominator which could potentially take an exponentially long time. Instead we will estimate the value of using AMEAN.

33 Simulating D i (2).

34 Implications of Using D i ’ Note that: g i ’ = 2 n f D i ’ = 2 n f c i D i = c i g i  Multiplying the distribution oracle by a constant is like multiplying all the coefficients of g i by the same constant.  The relative sizes of the coefficients stay the same.  WDNF will be able to find the large coefficients. The running time is not adversely affected.

35 Bound on Distributions D i It can be shown that for each i: Thus D i is bounded by a polynomial in L  (D) and ε -1.  If is D polynomially-near uniform then D i is also polynomially-near.  HS strongly learns DNF with respect to the uniform distribution.

36 Summary DNF can be weakly learned with respect to polynomially-near distributions using the WDNF algorithm. The HS algorithm strongly learns DNF with respect to the uniform distribution by boosting the WDNF weak learner.