Massive Online Teaching to Bounded Learners Brendan Juba (Harvard) Ryan Williams (Stanford)

Slides:

Advertisements

Similar presentations

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

Advertisements

The Communication Complexity of Approximate Set Packing and Covering

Circuit Complexity and Derandomization Tokyo Institute of Technology Akinori Kawachi.

Better Pseudorandom Generators from Milder Pseudorandom Restrictions Raghu Meka (IAS) Parikshit Gopalan, Omer Reingold (MSR-SVC) Luca Trevian (Stanford),

Derandomized parallel repetition theorems for free games Ronen Shaltiel, University of Haifa.

Using Nondeterminism to Amplify Hardness Emanuele Viola Joint work with: Alex Healy and Salil Vadhan Harvard University.

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

Time vs Randomness a GITCS presentation February 13, 2012.

Computability and Complexity 20-1 Computability and Complexity Andrei Bulatov Random Sources.

CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.

Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.

Non-Uniform ACC Circuit Lower Bounds Ryan Williams IBM Almaden TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A.

CS151 Complexity Theory Lecture 7 April 20, 2004.

Derandomization: New Results and Applications Emanuele Viola Harvard University March 2006.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

ACT1 Slides by Vera Asodi & Tomer Naveh. Updated by : Avi Ben-Aroya & Alon Brook Adapted from Oded Goldreich’s course lecture notes by Sergey Benditkis,

CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

Arithmetic Hardness vs. Randomness Valentine Kabanets SFU.

So far we have learned about:

–Def: A language L is in BPP c,s ( 0  s(n)  c(n)  1,  n  N) if there exists a probabilistic poly-time TM M s.t. : 1.  w  L, Pr[M accepts w]  c(|w|),

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

CS151 Complexity Theory Lecture 8 April 22, 2015.

1 Recap (I) n -qubit quantum state: 2 n -dimensional unit vector Unitary op: 2 n  2 n linear operation U such that U † U = I (where U † denotes the conjugate.

Quantum Algorithms II Andrew C. Yao Tsinghua University & Chinese U. of Hong Kong.

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

In a World of BPP=P Oded Goldreich Weizmann Institute of Science.

CS151 Complexity Theory Lecture 9 April 27, 2004.

1 On the Power of the Randomized Iterate Iftach Haitner, Danny Harnik, Omer Reingold.

Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?

Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.

Optimal Proof Systems and Sparse Sets Harry Buhrman, CWI Steve Fenner, South Carolina Lance Fortnow, NEC/Chicago Dieter van Melkebeek, DIMACS/Chicago.

Theory of Computing Lecture 17 MAS 714 Hartmut Klauck.

Zeev Dvir Weizmann Institute of Science Amir Shpilka Technion Locally decodable codes with 2 queries and polynomial identity testing for depth 3 circuits.

On Constructing Parallel Pseudorandom Generators from One-Way Functions Emanuele Viola Harvard University June 2005.

1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)

Using Nondeterminism to Amplify Hardness Emanuele Viola Joint work with: Alex Healy and Salil Vadhan Harvard University.

Resource bounded dimension and learning Elvira Mayordomo, U. Zaragoza CIRM, 2009 Joint work with Ricard Gavaldà, María López-Valdés, and Vinodchandran.

On Constructing Parallel Pseudorandom Generators from One-Way Functions Emanuele Viola Harvard University June 2005.

Polynomials Emanuele Viola Columbia University work partially done at IAS and Harvard University December 2007.

Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)

The Price of Uncertainty in Communication Brendan Juba (Washington U., St. Louis) with Mark Braverman (Princeton)

Umans Complexity Theory Lectures Lecture 17: Natural Proofs.

1 Introduction to Quantum Information Processing CS 467 / CS 667 Phys 667 / Phys 767 C&O 481 / C&O 681 Richard Cleve DC 653 Lecture.

CS151 Complexity Theory Lecture 16 May 20, The outer verifier Theorem: NP  PCP[log n, polylog n] Proof (first steps): –define: Polynomial Constraint.

Pseudorandom Bits for Constant-Depth Circuits with Few Arbitrary Symmetric Gates Emanuele Viola Harvard University June 2005.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Pseudo-random generators Talk for Amnon ’ s seminar.

Comparing Notions of Full Derandomization Lance Fortnow NEC Research Institute With thanks to Dieter van Melkebeek.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Almost SL=L, and Near-Perfect Derandomization Oded Goldreich The Weizmann Institute Avi Wigderson IAS, Princeton Hebrew University.

Approximation Algorithms based on linear programming.

Pseudorandomness: New Results and Applications Emanuele Viola IAS April 2007.

Umans Complexity Theory Lectures Lecture 9b: Pseudo-Random Generators (PRGs) for BPP: - Hardness vs. randomness - Nisan-Wigderson (NW) Pseudo- Random Generator.

Complexity Theory and Explicit Constructions of Ramsey Graphs Rahul Santhanam University of Edinburgh.

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Information Complexity Lower Bounds

New Characterizations in Turnstile Streams with Applications

Introduction to Machine Learning

Igor Carboni Oliveira University of Oxford

Vapnik–Chervonenkis Dimension

Pseudorandomness when the odds are against you

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Pseudo-derandomizing learning and approximation

Neuro-RAM Unit in Spiking Neural Networks with Applications

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Machine Learning: UNIT-3 CHAPTER-2

Emanuele Viola Harvard University June 2005

Switching Lemmas and Proof Complexity

Stronger Connections Between Circuit Analysis and Circuit Lower Bounds, via PCPs of Proximity Lijie Chen Ryan Williams.

Emanuele Viola Harvard University October 2005

Presentation transcript:

Massive Online Teaching to Bounded Learners Brendan Juba (Harvard) Ryan Williams (Stanford)

Teaching

Massive Online Teaching Arbitrary “consistent (proper) learner” [Goldman-Kearns, Shinohara-Miyano] f ∈ C f:{0,1} n →{0,1} x ∈ {0,1} n f(x) ∈ {0,1} y f(y) … Bounded complexity “consistent learner” (possibly improper) [this work] g?g? (g ∈ C ) g?g? f?f?

THIS WORK We design strategies for teaching consistent learners of bounded computational complexity.

A hard concept (1,0,0,0) (0,1,0,0) (0,0,1,0) (0,0,0,1) (0,0,0,0) Requires all 2 n examples!! Prop’n: teaching this concept to the class of consistent learners with linear-size AC 0 circuits requires sending all 2 n examples. (learners’ initial hypothesis may be arbitrary) We primarily focus on the number of mistakes made during learning, not the number of examples sent

WE SHOW 1)The state complexity of the learners controls the optimal number of mistakes in teaching 2)It also controls the optimal length of an individually tailored sequence of examples 3)The strategy establishing (1) can be partially derandomized (to a polynomial-size seed), but full derandomization implies strong circuit lower bounds.

I.The model (cont’d) II.Theorems: Teaching state- bounded learners III.Theorems: Derandomizing teaching

Bounded consistent learners Learner given by pair of bounded functions, EVAL and UPDATE Consistent: learner correct on all seen examples – …for f ∈ C f(x) ∈ {0,1} σ (g?) σ’ (f?) x ∈ {0,1} n EVAL(σ,x) = g(x) (σ ∈ {0,1} s(n) ) UPDATE(σ,x,f(x)) = σ’ (Require EVAL(σ’,x) = f(x)) We consider all bounded & C -consistent (EVAL,UPDATE) f∈Cf∈C

I.The model II.Theorems: Teaching state- bounded learners III.Theorems: Derandomizing teaching

Uniform random examples are good Theorem: under uniform random examples, any consistent s(n)-state bounded learner identifies the concept with probability 1-δ after O(s(n)(n+log 1/δ)) mistakes. Corollary: Every consistent learner with S(n)- size (bdd. fan-in) circuits identifies the concept after O(S(n) 2 log S(n)) mistakes on an example sequence drawn from the uniform dist. (whp).

A lower bound Theorem: There is a consistent learner for singleton/empty with s-bit states (and O(sn)- size AC 0 circuits) that makes s-1 mistakes on the empty concept. (2 n ≥ s ≥ n) Idea: Divide {0,1} n into s-1 intervals; the learner’s state initially indicates whether each interval should be labeled 0 or (by default) 1. It switches to a singleton hypothesis on a positive example, and switches off the corresponding interval on negative examples. vs. O(sn)

Main Lemma After s(n) mistakes, the fraction of {0,1} n that the learner could ever label incorrectly is reduced by a ¾ factor whp. Suppose not: then since the learner is consistent, the mistakes are on ¼ of this initial set of examples W Uniform dist: hit S w.p. < ¼ conditioned on hitting W {0,1} n W S

Main Lemma Each mistake must come from W (by def.); to reach the given state, they all must hit S. The ≥s(n) draws from W all fall into S w.p. < ¼ s(n) ⇒ we reach the state with ≥¾ of W remaining w.p. < ½ 2s(n) Union bound over the (≤2 s(n) ) states with ≥¾ of W remaining Custom sequences: sample from W directly. {0,1} n W S

RECAP Theorem: under uniform random examples, any consistent s(n)-state bounded learner identifies the concept with probability 1-δ after O(s(n)(n+log 1/δ)) mistakes. Theorem: Every deterministic consistent s(n)-state bounded learner has a sequence of examples of length O(s(n) n) after which it identifies the concept.

I.The model II.Theorems: Teaching state- bounded learners III.Theorems: Derandomizing teaching

Derandomization Our strategy uses ≈n 2 2 n random bits The learner takes examples given by blocks of n uniform-random bits at a time and stores only s(n) bits between examples ☞ Nisan’s pseudorandom generator should apply!

Using Nisan’s generator Theorem: Nisan’s generator produces a sequence of O(n2 n ) examples from a seed of length O((n+log 1/δ)(s(n)+n+log 1/δ)) s.t. w.p. 1-δ, a s(n)-state bounded learner identifies the concept and makes at most O(s(n)(n+log 1/δ)) mistakes. Idea: consider A’ that simulates A and counts its mistakes using n more bits. The “good event” for A can be defined in terms of the states of A’.

Using Nisan’s generator still requires poly(n) random bits. Can we construct a low-mistake sequence deterministically? Theorem: Suppose there is an EXP algorithm that for every polynomial S(n) and suff. large n, produces a sequence s.t. every S(n)-size circuit learner makes less than 2 n mistakes to learn the empty concept. Then EXP ⊄ P/poly. Idea: there is then an EXP learner that switches to the next singleton in the sequence as long as the examples remain on the sequence. This learner makes 2 n mistakes, and so can’t be in P/poly.

WE SHOWED 1)The state complexity of the learners controls the optimal number of mistakes in teaching 2)It also controls the optimal length of an individually tailored sequence of examples 3)The strategy establishing (1) can be partially derandomized (to a polynomial-size seed), but full derandomization implies strong circuit lower bounds. {0,1} n W S

Open problems O(sn) mistake upper bound vs. Ω(s) mistake lower bound—what’s the right bound? Can we weaken the consistency requirement? Can we establish EXP ⊄ ACC by constructing deterministic teaching sequences for ACC? Does EXP ⊄ P/poly conversely imply deterministic algorithms for teaching?

WE SHOWED 1)The state complexity of the learners controls the optimal number of mistakes in teaching 2)It also controls the optimal length of an individually tailored sequence of examples 3)The strategy establishing (1) can be partially derandomized (to a polynomial-size seed), but full derandomization implies strong circuit lower bounds. {0,1} n W S