Batch online learning Toyota Technological Institute (TTI)transductive [Littlestone89] i.i.d.i.i.d. Sham KakadeAdam Kalai.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
Quantum Software Copy-Protection Scott Aaronson (MIT) |
Informational Complexity Notion of Reduction for Concept Classes Shai Ben-David Cornell University, and Technion Joint work with Ami Litman Technion.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
Learning with Online Constraints: Shifting Concepts and Active Learning Claire Monteleoni MIT CSAIL PhD Thesis Defense August 11th, 2006 Supervisor: Tommi.
Online learning and game theory Adam Kalai (joint with Sham Kakade)
C&O 355 Mathematical Programming Fall 2010 Lecture 21 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.
Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Notation Intro. Number Theory Online Cryptography Course Dan Boneh
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
A general agnostic active learning algorithm
Probabilistic Algorithms Michael Sipser Presented by: Brian Lawnichak.
1 L is in NP means: There is a language L’ in P and a polynomial p so that L 1 · L 2 means: For some polynomial time computable map r : 8 x: x 2 L 1 iff.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR) TexPoint fonts used in EMF. Read the TexPoint manual before.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Probably Approximately Correct Learning Yongsub Lim Applied Algorithm Laboratory KAIST.
The Mean Square Error (MSE):. Now, Examples: 1) 2)
Regression trees and regression graphs: Efficient estimators for Generalized Additive Models Adam Tauman Kalai TTI-Chicago.
Active Learning of Binary Classifiers
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Probably Approximately Correct Model (PAC)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Presenting: Assaf Tzabari
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.
CSE 421 Algorithms Richard Anderson Lecture 27 NP Completeness.
Hopefully a clearer version of Neural Network. With Actual Weights.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
1 A New Interactive Hashing Theorem Iftach Haitner and Omer Reingold WEIZMANN INSTITUTE OF SCIENCE.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Dan Boneh Intro. Number Theory Modular e’th roots Online Cryptography Course Dan Boneh.
Online Function Tracking with Generalized Penalties Marcin Bieńkowski Institute of Computer Science, University of Wrocław, Poland Stefan Schmid Deutsche.
Dan Boneh Intro. Number Theory Intractable problems Online Cryptography Course Dan Boneh.
Machine Learning Algorithms in Computational Learning Theory
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Zeev Dvir Weizmann Institute of Science Amir Shpilka Technion Locally decodable codes with 2 queries and polynomial identity testing for depth 3 circuits.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))
Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew.
Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Distribution Specific Learnability – an Open Problem in Statistical Learning Theory M. Hassan Zokaei-Ashtiani December 2013.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
New Characterizations in Turnstile Streams with Applications
Computational Learning Theory
Introduction to Machine Learning
Generalization and adaptivity in stochastic convex optimization
Pseudorandomness when the odds are against you
A general agnostic active learning algorithm
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
CSCI B609: “Foundations of Data Science”
Online Learning Kernels
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Computational Learning Theory
Computational Learning Theory
Presentation transcript:

batch online learning Toyota Technological Institute (TTI)transductive [Littlestone89] i.i.d.i.i.d. Sham KakadeAdam Kalai

Family of functions F ( e.g. halfspaces ) Batch learning vs. –+ dist.  over X £ {–,+} – – + – – – – – – – – – – – – – – – – ––– h Online learning –+ (x 1,y 1 )…(x n,y n ) 2 X £ {–,+}X h1h1 arbitrary– x1x1 X [Kearns,Sch- Agnostic model [Kearns,Sch- apire,Sellie94] apire,Sellie94] Alg. H (x 1,y 1 ),…,(x n,y n )h 2 F Def. H learns F if, 8  E[err(h)] · min f 2F err(f)+n -c and H runs in time poly(n) dist  ERM = “best on data”

h2h2 Family of functions F ( e.g. halfspaces ) Batch learning vs. –+ dist.  over X £ {–,+} – – + – – – – – – – – – – – – – – – – ––– X h Online learning –+ (x 1,y 1 )…(x n,y n ) 2 X £ {–,+}X arbitrary– x1x1 + x2x2 ERM = “best on data”

Family of functions F ( e.g. halfspaces ) Batch learning vs. –+ dist.  over X £ {–,+} – – + – – – – – – – – – – – – – – – – ––– X h Online learning –+ (x 1,y 1 )…(x n,y n ) 2 X £ {–,+}X h3h3 arbitrary– x1x1 + x2x2 + x3x3 Goal: err(alg) · min f 2F err(f) + ERM = “best on data”

– x1x1 + x2x2 + x3x3 Family of functions F ( e.g. halfspaces ) Batch learning vs. –+ dist.  over X £ {–,+} – – + – – – – – – – – – – – – – – – – ––– X h Online learning –+ (x 1,y 1 )…(x n,y n ) 2 X £ {–,+}X h3h3 arbitrary – x1x1 + x2x2 + x3x3 – x4x4 + x5x5+ x6x6 Goal: err(alg) · min f 2F err(f) + ERM = “best on data”

[Ben-David,Kushilevitz,Mansour95] h3h3 h2h2 Family of functions F ( e.g. halfspaces ) Batch learning vs. –+ dist.  over X £ {–,+} ERM = “best on data” – – + – – – – – – – – – – – – – – – – ––– X h Online learning Goal: err(alg) · min f 2F err(f) + –+ (x 1,y 1 )…(x n,y n ) 2 X £ {–,+}X arbitrary Transductive equivalent “proper” learning output h (i) 2F h1h1– x1x1 + x2x2 + x3x3 Analogous definition: Alg. H (x 1,y 1 ),…,(x i-1,y i-1 ) hi 2 Fhi 2 F H learns F if, 8 (x 1,y 1 ),…,(x n,y n ): E[err(H)] · min f 2F err(f)+n -c and H runs in time poly(n) {x 1,x 2,…,x n }

Our results Theorem 1. In online trans. setting, H HERM requires one ERM computation per sample. Theorem 2. These are equivalent for proper learning:, F is agnostically learnable, ERM agnostically learns F (ERM can be done efficiently and VC( F ) is finite), F is online transductively learnable, H, HERM online transductively learns F H HERM = Hallucination + ERM H 4

Online ERM algorithm – x 1 = (0,0) y 1 = – + x 2 = (0,0) y 2 = + – x 3 = (0,0) y 3 = – + x 4 = (0,0) y 4 = + … Choose h i 2 F with minimal errors on (x 1,y 1 ),…,(x i-1,y i-1 ) h i = argmin f 2F | { j<i| f(x j )  y j } | + h 1 (x) = + – h 2 (x) = – + h 3 (x) = + – h 4 (x) = – … –+ F = {–,+} X = { (0,0) } (sucks)

Online ERM algorithm err(ERM) · min f 2F err(f) + P i 2 {1, …,n} [h i  h i+1 ] Choose h i 2 F with minimal errors on (x 1,y 1 ),…,(x i-1,y i-1 ) h i = argmin f 2F | { j<i| f(x j )  y j } | Online “stability” lemma: [KVempala01] Proof by induction on n = #examples easy !

random from {1,2,…,R} H Online HERM algorithm Inputs:  ={x 1,x 2,…,x n }, int R + – For each x 2  hallucinate r x copies of (x, + ) & r x copies of (x, – ) Choose h i 2 F that minimizes errors on hallucinated data + (x 1,y 1 ),…,(x i-1,y i-1 ) + - P r x i,r x i [h i  h i+1 ] · R Stability: 8 i, (x i, + ), (x i, + ) +++ (x i, + ),(x i, + ),…,(x i, + ) rxirxi + … James Hannan

random from {1,2,…,R} H Online HERM algorithm Inputs:  ={x 1,x 2,…,x n }, int R + – For each x 2  hallucinate r x copies of (x, + ) & r x copies of (x, – ) Choose h i 2 F that minimizes errors on hallucinated data + (x 1,y 1 ),…,(x i-1,y i-1 ) + - P r x i,r x i [h i  h i+1 ] · R Stability: 8 i, Online “stability” lemma Hallucination cost Theorem 1 For R=n ¼ : It requires one ERM computation per example. H 4

Being more adaptive (shifting bounds) (x i,y i ),…(x i+W,y i+W ) (x 1,y 1 ),…,(x i,y i ),…(x i+W,y i+W ),…(x n,y n ) window 4

Related work Inequivalence of batch and online learning in noiseless setting –ERM black box is noiseless –For computational reasons! Inefficient alg. for online trans. learning: –List all · (n+1) VC( F ) labelings (Sauer’s lemma) –Run weighted majority  [Ben-David,Kushilevitz,Mansour95]  [Blum90,Balcan06]  [Littlestone,Warmuth92]

Alg. for removing iid assumption, efficiently, using unlabeled data Interesting way to use unlabeled data online, reminiscent of bootstrap/bagging Adaptive version: can do well on every window Find “right” algorithm/analysisFind “right” algorithm/analysis Conclusions