Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.

Slides:

Advertisements

Similar presentations

1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )

Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin

Reductions to the Noisy Parity Problem TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A Vitaly Feldman Parikshit.

Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.

Counting Algorithms for Knapsack and Related Problems 1 Raghu Meka (UT Austin, work done at MSR, SVC) Parikshit Gopalan (Microsoft Research, SVC) Adam.

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

A general agnostic active learning algorithm

Batch online learning Toyota Technological Institute (TTI)transductive [Littlestone89] i.i.d.i.i.d. Sham KakadeAdam Kalai.

CMPUT 466/551 Principal Source: CMU

Longin Jan Latecki Temple University

Dictator tests and Hardness of approximating Max-Cut-Gain Ryan O’Donnell Carnegie Mellon (includes joint work with Subhash Khot of Georgia Tech)

Review of : Yoav Freund, and Robert E

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR) TexPoint fonts used in EMF. Read the TexPoint manual before.

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

A Quarter-Century of Efficient Learnability Rocco Servedio Columbia University Valiant 60 th Birthday Symposium Bethesda, Maryland May 30, 2009.

Sparse vs. Ensemble Approaches to Supervised Learning

Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009.

Active Learning of Binary Classifiers

On Uniform Amplification of Hardness in NP Luca Trevisan STOC 05 Paper Review Present by Hai Xu.

Computational Learning Theory; The Tradeoff between Computational Complexity and Statistical Soundness Shai Ben-David CS Department, Cornell and Technion,

Ensemble Learning: An Introduction

Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Learning and testing k-modal distributions Rocco A. Servedio Columbia University Joint work (in progress) with Ilias Diakonikolas UC Berkeley Costis Daskalakis.

Machine Learning: Ensemble Methods

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

Finding Almost-Perfect

Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.

The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.

Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

CS 391L: Machine Learning: Ensembles

Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.

Benk Erika Kelemen Zsolt

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Finding Correlations in Subquadratic Time Gregory Valiant.

Ensemble Methods in Machine Learning

Lattice-based cryptography and quantum Oded Regev Tel-Aviv University.

Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.

Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.

Approximation Algorithms based on linear programming.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.

Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University.

Machine Learning: Ensemble Methods

Dana Ron Tel Aviv University

The Boosting Approach to Machine Learning

The Boosting Approach to Machine Learning

ECE 5424: Introduction to Machine Learning

CSCI B609: “Foundations of Data Science”

Learning, testing, and approximating halfspaces

including joint work with:

Ensemble learning.

Model Combination.

A Quarter-Century of Efficient Learnability

Switching Lemmas and Proof Complexity

Presentation transcript:

Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning

Defs Agnostic Learning = learning with adversarial noise Boosting = turn weak learner into strong learner Parities = parities of subsets of the bits f:{0,1} n →{0,1}. f(x)=x 1  x 3  x 7 1. Agnostic Boosting Turning a weak agnostic learner to a strong agnostic learner 2. 2 O(n/logn) -time algorithm for agnostically learning parities over any distribution Outline

Agnostic Booster Agnostic boosting Weak learner. For any noise rate < ½ produces a better-than-trivial hypothesis Strong Learner. Produces almost-optimal hypothesis Runs weak learner as black box

Learning with Noise Learning without noise well understood* with random noisewell understood* (SQ model) with agnostic noise * up to well-studied open problems (i.e. we know where we’re stuck) It’s, like, a really hard model!!!

Agnostic Learning: some known results classGround distributionnotes Halfspaces Kalai, Klivans, Mansour, Servedio] uniform, log-concave Parities [Feldman, Gopalan, Khot, Ponnuswami] uniform2 O(n/logn) Decision Trees [Gopalan, Kalai, Klivans] uniformwith MQ Disjunctions Kalai, Klivans, Mansour, Servedio] all distributions2 O(√n) ???all distributions

Agnostic Learning: some known results classGround distributionnotes Halfspaces Kalai, Klivans, Mansour, Servedio] uniform, log-concave Parities [Feldman, Gopalan, Khot, Ponnuswami] uniform2 O(n/logn) Decision Trees [Gopalan, Kalai, Klivans] uniformwith MQ Disjunctions Kalai, Klivans, Mansour, Servedio] all distributions2 O(√n) ???all distributions Due to hardness, or lack of tools??? Agnostic boosting: strong tool, makes it easier to design algorithms.

Why care about agnostic learning? More relevant in practice Impossibility results might be useful for building cryptosystems Non-noisy learning ≈ CSP Agnostic Learning ≈ MAX-CSP

Noisy learning No noise Random noise Adversarial (≈agnostic) noise f f f Learning algorithm. Should approximate g up to error  +  g Learning algorithm. Should approximate f up to error  Learning algorithm. Should approximate f up to error  f:{0,1} n →{0,1} from class F. alg gets samples where x is drawn from distribution D. allowed to corrupt  -fraction  % noise

Agnostic learning (geometric view) F opt opt +  g PROPER LEARNING f Parameters: F, metric Input: oracle for g Goal: return some element of blue ball

Agnostic boosting weak learner err D (g,h)· ½ -   w.h.p. h opt · ½ -  g definition DD

Agnostic boosting weak learner err D (g,h)· ½ -   w.h.p. h opt · ½ -  g Agnostic Booster w.h.p. h’ Sample s from g Runs weak learner poly(    times err D (g,h’) · opt +  DD

Agnostic boosting ( ,  )-weak learner err D (g,h)· ½ -  w.h.p. h opt · ½ -  g Agnostic Booster w.h.p. h’ Sample s from g Runs weak learner poly(  times err D (g,h’) · opt +  +  DD

Agnostic Booster Agnostic boosting Weak learner. For any noise rate < ½ produces a better-than-trivial hypothesis Strong Learner. Produces almost-optimal hypothesis

“Approximation Booster” Analogy poly-time MAX-3-SAT algorithm that when opt=7/8+ε produces solution with value 7/8+ε 100 algorithm for MAX-3-SAT produces solution with value opt +  running time poly(n, 

Gap ½01 No hardness gap close to ½ booster no gap anywhere (additive PTAS)

Agnostic boosting New Analysis for Mansour-McAllester booster.  uses branching programs; nodes are weak hypotheses Previous Agnostic Boosting:  Ben-David+Long+Mansour, and Gavinsky, defined agnostic boosting differently.  Their result cannot be used for our application

Booster h1h1 x h 1 (x)=1 h 1 (x)=0 1 0

Booster: Split step h1h1 x h 1 (x)=1 h 1 (x)=0 1 h2h2 h 2 (x)=1 h 2 (x)=0 1 0 different distribution h1h1 h 1 (x)=1 h 1 (x)=0 h2’h2’ h 2 ‘(x)=1 h 2 ‘(x)= different distribution choose the “better” option

Booster: Split step h1h1 x h 1 (x)=1 h 1 (x)=0 1 h2h2 h 2 (x)=1 h 2 (x)=0 0 h3h3 h 3 (x)=1 H 3 (x)=0 1 0

Booster: Split step h1h1 x h 1 (x)=1 h 1 (x)=0 h2h2 h 2 (x)=1 h 2 (x)=0 0 h3h3 h 3 (x)=1 H 3 (x)=0 1 0 h4h4 h 4 (x)=1 H 4 (x)=0 1 0 …

Booster: Merge step h1h1 x h 1 (x)=1 h 1 (x)=0 h2h2 h 2 (x)=1 h 2 (x)=0 0 h3h3 h 3 (x)=1 H 3 (x)=0 1 0 h4h4 h 4 (x)=1 H 4 (x)=0 1 0 Merge if “similar”

Booster: Merge step h1h1 x h 1 (x)=1 h 1 (x)=0 h2h2 h 2 (x)=1 h 2 (x)=0 0 h3h3 h 3 (x)=1 H 3 (x)=0 0 h4h4 h 4 (x)=1 H 4 (x)=0 1 0

Booster: Another split step h1h1 x h 1 (x)=1 h 1 (x)=0 h2h2 h 2 (x)=1 h 2 (x)=0 0 h3h3 h 3 (x)=1 H 3 (x)=0 0 h4h4 h 4 (x)=1 H 4 (x)=0 0 h5h5 01 …

Booster: final result h1h1 x h1h1 h1h1 h1h1 h1h1 h1h1 h1h1 h1h1 h1h1 h1h1 h1h1 0 1

Agnostically learning parities

Application: Parity with Noise Uniform distribution Any distribution Random Noise2 O(n/logn) [Blum Kalai Wasserman] Agnostic learning 2 O(n/logn) [Feldman Gopalan Khot Ponnuswami], via Fourier 2 O(n/logn) This work* * non-proper learner. hypothesis is circuit with 2 O(n/logn) gates Feldman et al give black-box reduction to random-noise case. We give direct result Theorem:  ε, have weak learner that for noise ½-ε produces an hypothesis which is wrong on ½-(2ε) n /2 fraction of space. Running time 2 O(n/logn)

Corollary: Learners for many classes (without noise) Can learn without noise any class with “guaranteed correlated parity”, in time 2 O(n/logn)  e.g. DNF, any others? A weak parity learner that runs in 2 O(n 0.32 ) time would beat the best algorithm known for learning DNF  Good evidence that parity with noise is hard efficient cryptosystems [Hopper-Blum, Blum-Furst-etal, and many others] ?

Main Idea: 1. Take Learner which resists random noise (BKW) 2. Add Randomness to its behavior, until you get a Weak Agnostic learner. “Between two evils, I pick the one I haven’t tried before” – Mae West “Between two evils, I pick uniformly at random” – CS folklore Idea of weak agnostic parity learner

Summary Problem: It is difficult but perhaps possible to design agnostic learning algorithms. Proposed Solution: Agnostic Boosting. Contributions: 1. Right(er) definition for weak agnostic learner 2. Agnostic boosting 3. Learning Parity with noise in hardest noise model 4. Entertaining STOC ’08 participants

Open Problems 1. Find other applications for Agnostic Boosting 2. Improve PwN algorithms. Get proper learner for parity with noise Reduce PwN with agnostic noise to PwN with random noise 3. Get evidence that PwN is hard Prove that if parity with noise is easy then FACTORING is easy. 128$ reward!

May the parity be with you! The end.

Sketch of weak parity learner

Weak parity learner Sample labeled points from distribution, sample unlabeled x, let’s guess f(x) + ++ to next round Bucket according to last 2n/logn bits

Weak parity learner + ++ LAST ROUND: =0 √n vectors with sum=0. gives guess for f(x)

Weak parity learner + ++ LAST ROUND: =0 √n vectors with sum=0. gives guess for f(x) by symmetry, prob. of mistake = %mistakes Claim: %mistakes (Cauchy-Schwartz)

Intuition behind two main parts

Intuition behind Boosting

decrease weight increase weight

Intuition behind Boosting decrease weight increase weight Run, reweight, run, reweight, …. Take majority of hypotheses. Algorithmic & Efficient Yao-von Neumann Minimax Principle 1