Online Learning Kernels

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
INTRODUCTION TO Machine Learning 2nd Edition
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
SOFT LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Chapter 9 Perceptrons and their generalizations. Rosenblatt ’ s perceptron Proofs of the theorem Method of stochastic approximation and sigmoid approximation.
Support Vector Machines and Kernel Methods
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Kernel Technique Based on Mercer’s Condition (1909)
Lecture 4: Linear Programming Computational Geometry Prof. Dr. Th. Ottmann 1 Linear Programming Overview Formulation of the problem and example Incremental,
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture 10: Support Vector Machines
PATTERN RECOGNITION AND MACHINE LEARNING
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
SVMs in a Nutshell.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Support vector machines
CSSE463: Image Recognition Day 14
PREDICT 422: Practical Machine Learning
Large Margin classifiers
ECE 5424: Introduction to Machine Learning
Dan Roth Department of Computer and Information Science
Dan Roth Department of Computer and Information Science
Introduction to Machine Learning
Geometrical intuition behind the dual problem
Kernels Usman Roshan.
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Hyperparameters, bias-variance tradeoff, validation
CSCI B609: “Foundations of Data Science”
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Kai-Wei Chang University of Virginia
Support vector machines
CSSE463: Image Recognition Day 15
CSSE463: Image Recognition Day 15
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 15
CSSE463: Image Recognition Day 14
Usman Roshan CS 675 Machine Learning
Support vector machines
CSSE463: Image Recognition Day 15
Support vector machines
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 15
Support Vector Machines 2
Introduction to Machine Learning
Presentation transcript:

Online Learning Kernels Seminar on Foundations of Data Science Prof. Haim Kaplan, TAU Matan Hasson

Online Model קלאספיקציה בינארית

Online Model

1 Online Model אין התפלגות – "יריב" דוגמאות: שעשועון – כמה שפחות טעויות יותר כסף מייל – "חשוב" למול "לא חשוב" 1

Online Model At time t=1,2,3,… ℓ 𝑡 Get 𝑥 𝑡 ∈Χ and predict ℓ 𝑡 Get 𝑐 ∗ ( 𝑥 𝑡 ) If 𝑐 ∗ 𝑥 𝑡 ≠ ℓ 𝑡 𝑀=𝑀+1 Goal: Minimal total mistakes amount 𝑀 (mistake-bound) ℓ 𝑡 חסם שגיאה – כי אולי לא נגיע לM (תלוי סדר) גם כאן יש מחלקת השערות 𝑥 𝑡 𝑋

Online vs Batch Input Data Output Goal Which is harder? TBD… אין השערה מסויימת – כי לא ידוע מתי יטעה Online קשה יותר – הסרת הדרישה להתפלגות קבועה Which is harder? TBD…

Ex: Disjunction Recall: Online algorithm: 𝑥= 𝑎 1 𝑎 2 … 𝑎 𝑑 ∈ 0,1 𝑑 ℋ 𝑑𝑖𝑠 = ℎ 𝐼 = 𝑖∈𝐼 𝑎 𝑖 𝐼⊆ 1,…,𝑑 𝑐 ∗ = ℎ 𝐼 ∗ Online algorithm: 𝑑=5 𝑐 ∗ 𝑥 = 𝑎 1 ∨ 𝑎 4 ℎ 0 = 𝑎 1 ∨ 𝑎 2 ∨ 𝑎 3 ∨ 𝑎 4 ∨ 𝑎 5 𝑥 1 =(0,1,0,1,0) ℎ 1 = 𝑎 1 ∨ 𝑎 2 ∨ 𝑎 3 ∨ 𝑎 4 ∨ 𝑎 5 𝑥 2 =(0,1,0,0,0) ℎ 2 = 𝑎 1 ∨ 𝑎 2 ∨ 𝑎 3 ∨ 𝑎 4 ∨ 𝑎 5 𝑥 3 =(0,1,0,0,1) ℎ 3 = 𝑎 1 ∨ 𝑥 2 ∨ 𝑎 3 ∨ 𝑎 4 ∨ 𝑎 5

Ex: Disjunction Theorem 5.7: For any deterministic 𝐴 there exists a sequence of examples 𝜎 and 𝑐 ∗ ∈ ℋ 𝑑𝑖𝑠 s.t 𝑀 𝐴 𝜎, 𝑐 ∗ ≥𝑑 Proof: 𝐴 𝜎 (1,0,0,…,0) (0,1,0,…,0) (0,0,1,…,0) (0,0,0,…,1) דטרמיניסטי, לכן לא רימינו כשבנינו תוך כדי 𝑐 ∗ (𝑥)= 𝑎 1 ∨ 𝑎 3 1

The Halving Algorithm ℋ אם זמן הריצה פחות חשוב

The Halving Algorithm ℋ אם זמן הריצה פחות חשוב

The Halving Algorithm ℋ אם זמן הריצה פחות חשוב

The Halving Algorithm ℋ אם זמן הריצה פחות חשוב

The Halving Algorithm ℋ אם זמן הריצה פחות חשוב

The Halving Algorithm ℋ 𝒱 𝑡 = 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡 𝒽∈ℋ 𝑀≤ log 2 (|ℋ|) אם זמן הריצה פחות חשוב

The Perceptron Algorithm Linear Separator 𝑥∈ 𝑅 𝑑 , ℓ∈{+1,−1} Separating hyperplane 𝑤 ∗ ∈ 𝑅 𝑑 , 𝑏 ∗ ∈𝑅 ℓ=𝑠𝑖𝑔𝑛( 𝑥 𝑇 𝑤 ∗ + 𝑏 ∗ ) 𝑥 𝑇 𝑤 ∗ + 𝑏 ∗ =0 יעיל מקבל נקודה נקודה 𝑤 ∗

The Perceptron Algorithm Perceptron Assumptions: 𝑏 ∗ =0 Margin 𝛾= 1 𝑤 ∗ For positive 𝑥 : 𝑥 𝑇 𝑤 ∗ ≥1 For negative 𝑥 : 𝑥 𝑇 𝑤 ∗ ≤−1 Goal: Minimal 𝑀 𝛾 b=0 לא נורא: מוסיפים לx 1, והאיבר האחרון בw הוא b בשוליים אין נקודות |𝑥 𝑇 𝑤 ∗ | 𝑤 ∗ ≥𝛾= 1 𝑤 ∗

The Perceptron Algorithm

The Perceptron Algorithm

The Perceptron Algorithm

The Perceptron Algorithm

The Perceptron Algorithm

The Perceptron Algorithm

The Perceptron Algorithm

The Perceptron Algorithm 𝑤=0 For 𝑡=1,2,3,…: Given 𝑥 𝑡 predict ℓ t =𝑠𝑖𝑔𝑛( 𝑥 𝑡 𝑇 𝑤) If 𝑐 ∗ 𝑥 𝑡 ≠ ℓ 𝑡 : If 𝑐 ∗ 𝑥 𝑡 >0 : 𝑤←𝑤+ 𝑥 𝑡 If 𝑐 ∗ 𝑥 𝑡 <0 : 𝑤←𝑤− 𝑥 𝑡 עם הזמן השינויים יותר עדינים

The Perceptron Algorithm Theorem 5.8: On any sequence 𝑥 1 , 𝑥 2 ,…, if there exists a linear separator 𝑤 ∗ of margin 𝛾= 1 𝑤 ∗ , then 𝑀≤ 𝑅 𝛾 2 = 𝑅 2 𝑤 ∗ 2 , where 𝑅= max 𝑡 𝑥 𝑡 אינטואיציה- כשמכפילים פי 100 נק ניתן לחלק פי 100 w

The Perceptron Algorithm Theorem 5.8: On any sequence 𝑥 1 , 𝑥 2 ,…, if there exists a linear separator 𝑤 ∗ of margin 𝛾= 1 𝑤 ∗ , then 𝑀≤ 𝑅 𝛾 2 = 𝑅 2 𝑤 ∗ 2 , where 𝑅= max 𝑡 𝑥 𝑡 השוליים נותנים מרחב תמרון

The Perceptron Algorithm Theorem 5.8: On any sequence 𝑥 1 , 𝑥 2 ,…, if there exists a linear separator 𝑤 ∗ of margin 𝛾= 1 𝑤 ∗ , then 𝑀≤ 𝑅 𝛾 2 = 𝑅 2 𝑤 ∗ 2 , where 𝑅= max 𝑡 𝑥 𝑡

The Perceptron Algorithm Theorem 5.8: On any sequence 𝑥 1 , 𝑥 2 ,…, if there exists a linear separator 𝑤 ∗ of margin 𝛾= 1 𝑤 ∗ , then 𝑀≤ 𝑅 𝛾 2 = 𝑅 2 𝑤 ∗ 2 , where 𝑅= max 𝑡 𝑥 𝑡

The Perceptron Algorithm Theorem 5.8: On any sequence 𝑥 1 , 𝑥 2 ,…, if there exists a linear separator 𝑤 ∗ of margin 𝛾= 1 𝑤 ∗ , then 𝑀≤ 𝑅 𝛾 2 = 𝑅 2 𝑤 ∗ 2 , where 𝑅= max 𝑡 𝑥 𝑡

The Perceptron Algorithm Theorem 5.8: On any sequence 𝑥 1 , 𝑥 2 ,…, if there exists a linear separator 𝑤 ∗ of margin 𝛾= 1 𝑤 ∗ , then 𝑀≤ 𝑅 𝛾 2 = 𝑅 2 𝑤 ∗ 2 , where 𝑅= max 𝑡 𝑥 𝑡 הוכחת לוח

Inseparable Data What if w ∗ is not quite perfect?

Inseparable Data What if w ∗ is not quite perfect? Hinge-loss For positive: max⁡(0, 1− 𝑥 𝑡 𝑇 𝑤 ∗ ) For negative: max 0,1+ 𝑥 𝑡 𝑇 𝑤 ∗ 𝐿 ℎ𝑖𝑛𝑔𝑒 𝑤 ∗ ,𝑆 = 𝑥∈𝑆 𝐿 ℎ𝑖𝑛𝑔𝑒 ( 𝑤 ∗ ,𝑥) 1 𝑤 ∗ − 𝑥 𝑡 𝑇 𝑤 ∗ 𝑤 ∗ נשברת הנחת השוליים כמה נדרש להזיז כדי לקיים הנחת שוליים 𝐿 ℎ𝑖𝑛𝑔𝑒 ( 𝑥 𝑡 𝑇 𝑤 ∗ )

Inseparable Data What if w ∗ is not quite perfect? Hinge-loss For positive: max⁡(0, 1− 𝑥 𝑡 𝑇 𝑤 ∗ ) For negative: max 0,1+ 𝑥 𝑡 𝑇 𝑤 ∗ 𝐿 ℎ𝑖𝑛𝑔𝑒 𝑤 ∗ ,𝑆 = 𝑥∈𝑆 𝐿 ℎ𝑖𝑛𝑔𝑒 ( 𝑤 ∗ ,𝑥) Theorem 5.9: On any sequence 𝑆= 𝑥 1 , 𝑥 2 ,… 𝑀 𝑝𝑒𝑟𝑐𝑒𝑝𝑡𝑟𝑜𝑛 ≤ min 𝑤 ∗ 𝑅 𝛾 2 +2 𝐿 ℎ𝑖𝑛𝑔𝑒 𝑤 ∗ ,𝑆

Inseparable Data SVM – Support Vector Machine Min 𝑐 𝑤 2 + 𝑖 𝑠 𝑖 Given 𝑆= 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 solves the convex optimization: Min 𝑐 𝑤 2 + 𝑖 𝑠 𝑖 s.t 𝑥 𝑖 𝑇 𝑤≥1− 𝑠 𝑖 ∀𝑝𝑜𝑠 𝑥 𝑖 𝑥 𝑖 𝑇 𝑤≤−1+ 𝑠 𝑖 ∀𝑛𝑒𝑔 𝑥 𝑖 𝑠 𝑖 ≥0 𝑠 𝑖 = max 0,1− 𝑥 𝑖 𝑇 𝑤 ∀𝑝𝑜𝑠 𝑥 𝑖 𝑠 𝑖 = max 0,1+ 𝑥 𝑖 𝑇 𝑤 ∀𝑛𝑒𝑔 𝑥 𝑖 ⇒ ⇒ 𝑠 𝑖 is 𝐿 ℎ𝑖𝑛𝑔𝑒 𝑤, 𝑥 𝑖 ! Min 𝑐 1 𝛾 2 + 𝐿 ℎ𝑖𝑛𝑔𝑒 (𝑤,𝑆)

Kernel Functions 𝑤 ∗ with high hinge-loss? בקודם היינו מוכנים לספוג טעות. כאן אנו תוהים באשר לכלי ההפרדה הלינארית אך לא רוצים לוותר 𝜙 ( 𝑎 1 , 𝑎 2 ) =( 𝑎 1 , 𝑎 2 )

Kernel Functions 𝑤 ∗ with high hinge-loss? 𝜙 𝑥 =(𝑥, 𝑥 2 )

Kernel Functions 𝑤 ∗ with high hinge−loss? 𝜙−𝑠𝑝𝑎𝑐𝑒 𝜙: ℝ 𝑑 → ℝ 𝑁 , 𝑁≫𝑑 Any data can be separated! 𝑆={ 𝑥 1 , 𝑥 2 ,…} 𝜙 𝑥 𝑖 = 𝑒 𝑖 ∈ ℝ 𝑆 𝑤 𝑖 = 𝑐 ∗ 𝑥 𝑖 𝜙 𝑥 𝑖 𝑇 𝑤= 𝑒 𝑖 𝑇 𝑤= 𝑤 𝑖 = 𝑐 ∗ ( 𝑥 𝑖 ) כל S ניתנת להפרדה לינארית ע"י מיפוי לוקטורי יחידה, wi=yi

Kernel Functions The Perceptron Algorithm: 𝑤=0 For 𝑡=1,2,3,…: Given 𝑥 𝑡 predict ℓ t =𝑠𝑖𝑔𝑛(𝜙 𝑥 𝑡 𝑇 𝑤) If 𝑐 ∗ 𝑥 𝑡 ≠ ℓ 𝑡 : If 𝑐 ∗ 𝑥 𝑡 >0 : 𝑤←𝑤+ 𝜙(𝑥 𝑡 ) If 𝑐 ∗ 𝑥 𝑡 <0 : 𝑤←𝑤− 𝜙(𝑥 𝑡 ) Computational problem? The Kernel Trick! 𝑤 𝑡 = 𝑖=1 𝑡 𝛼 𝑖 𝜙( 𝑥 𝑖 ) , 𝛼 𝑖 ∈ −1,0,1 𝜙 𝑥 𝑡 𝑇 𝑤 𝑡−1 = 𝑖=1 𝑡−1 𝛼 𝑖 𝜙 𝑥 𝑡 𝑇 𝜙( 𝑥 𝑖 ) 𝐾( 𝑥 𝑡 , 𝑥 𝑖 )

Kernel Functions The Perceptron Algorithm - kernelized: 𝑤=0 For 𝑡=1,2,3,…: Given 𝑥 𝑡 predict ℓ t =𝑠𝑖𝑔𝑛( 𝑖=1 𝑡−1 𝛼 𝑖 𝐾( 𝑥 𝑡 , 𝑥 𝑖 ) ) If 𝑐 ∗ 𝑥 𝑡 ≠ ℓ 𝑡 : If 𝑐 ∗ 𝑥 𝑡 >0 : 𝛼 𝑡 =1 If 𝑐 ∗ 𝑥 𝑡 <0 : 𝛼 𝑡 =−1 Computational problem? The Kernel Trick! 𝑤 𝑡 = 𝑖=1 𝑡 𝛼 𝑖 𝜙( 𝑥 𝑖 ) , 𝛼 𝑖 ∈ −1,0,1 𝜙 𝑥 𝑡 𝑇 𝑤 𝑡−1 = 𝑖=1 𝑡−1 𝛼 𝑖 𝜙 𝑥 𝑡 𝑇 𝜙( 𝑥 𝑖 ) יעיל חישובית - מכפלה פנימית היא מספר! אפשרי גם לsvm 𝐾( 𝑥 𝑡 , 𝑥 𝑖 )

Kernel Functions Polynomial Kernel 𝑐=1, 𝑑=2, 𝑘=2: 𝐾 𝑥, 𝑥 ′ = 𝑐+ 𝑥 𝑇 𝑥 ′ 𝑘 , 𝑐≥0, 𝑘≥1 𝜙−𝑠𝑝𝑎𝑐𝑒 of dimension 𝑁≈ 𝑑 𝑘 𝑐=1, 𝑑=2, 𝑘=2: ריבועי – קמורים ריבועיים כמו אליפסות, פרבולות, היפרבולות,... 𝐾 𝑥, 𝑥 ′ = 1+ 𝑎 1 𝑎 1 ′ + 𝑎 2 𝑎 2 ′ 2 =1+2 𝑎 1 𝑎 1 ′ +2 𝑎 2 𝑎 2 ′ + 𝑎 1 2 𝑎 1 ′ 2 +2 𝑎 1 𝑎 2 𝑎 1 ′ 𝑎 2 ′ + 𝑎 2 2 𝑎 2 ′2 =𝜙 𝑥 𝑇 𝜙(𝑥′) 𝜙 𝑥 = 1, 2 𝑎 1 , 2 𝑎 2 , 𝑎 1 2 , 2 𝑎 1 𝑎 2 , 𝑎 2 2 ∈ ℝ 6 𝜙 𝑥′ = 1, 2 𝑎′ 1 , 2 𝑎′ 2 , 𝑎 1 ′2 , 2 𝑎′ 1 𝑎′ 2 , 𝑎′ 2 2 ∈ ℝ 6

Kernel Functions Gaussian Kernel (RBF) 𝐾 𝑥, 𝑥 ′ = 𝑒 −𝑐 𝑥− 𝑥 ′ 2 , 𝑐= 1 2𝜎 2 Similarity decreases exponentially with the squared distance Radial Basis function פיתוח טיילור

Kernel Functions Gaussian Kernel (RBF) 𝐾 𝑥, 𝑥 ′ = 𝑒 −𝑐 𝑥− 𝑥 ′ 2 , 𝑐= 1 2𝜎 2 Similarity decreases exponentially with the squared distance 𝜙−𝑠𝑝𝑎𝑐𝑒 of dimension ∞ 𝑒 −𝑐 𝑥− 𝑥 ′ 2 = 𝑒 −𝑐 𝑥 2 𝑒 −𝑐 𝑥 ′ 2 𝑒 2𝑐 𝑥 𝑇 𝑥 ′ Radial Basis function פירוק טיילור = 𝑒 −𝑐 𝑥 2 𝑒 −𝑐 𝑥 ′ 2 𝑗=1 ∞ 2𝑐𝑥 𝑇 𝑥 ′ 𝑗 𝑗!

Kernel Functions Theorem 5.10: Supposed 𝐾 1 , 𝐾 2 are kernel functions, then: 1. 𝑐 𝐾 1 , 𝑐≥0 2. 𝐾 1 + 𝐾 2 3. 𝐾 1 𝐾 2 4. 𝐾 3 𝑥, 𝑥 ′ =𝑓 𝑥 𝑓 𝑥 ′ 𝐾 1 (𝑥, 𝑥 ′ ) , 𝑓: ℝ 𝑑 →ℝ are legal kernels הוכחה אם יש זמן

Online vs Batch Which is harder? Online < Batch? No! Disjunction Online > Batch? Random Stopping Controlled Testing No! Yes! האם לבעיית batch יש פתרון אונלייני

Online to Batch Random Stopping Theorem 5.11: 𝔼 𝑒𝑟 𝑟 𝒟 ℎ 𝑡 ≤𝜖 Given an online algorithm 𝐴 with mistake-bound 𝑀 Run it on 𝑆, 𝑆 = 𝑀 𝜖 Stop at random time 1≤𝑡≤|𝑆| Return ℎ 𝑡 Theorem 5.11: 𝔼 𝑒𝑟 𝑟 𝒟 ℎ 𝑡 ≤𝜖 עד שאין טעויות לא טוב- לא ידוע מתי יקרה...

Online to Batch Proof of Theorem 5.11: 𝑋 𝑡 - 𝐴 makes a mistake on 𝑥 𝑡 ∀𝑆 𝑡=1 𝑆 𝑋 𝑡 ≤𝑀 ⇒ 𝔼 𝑠~ 𝒟 𝑆 𝑡=1 𝑆 𝑋 𝑡 ≤𝑀 1 𝑆 𝑡=1 𝑆 𝔼[𝑋 𝑡 ] ≤ 𝑀 𝑆 =𝜖 𝔼 𝑒𝑟 𝑟 𝒟 ℎ 𝑡 =𝔼 𝑋 𝑡 ≤𝜖

Online to Batch Controlled Test Given an online algorithm 𝐴 with mistake-bound 𝑀 Define 𝛿 𝑖 = 𝛿 𝑖+2 2 , so 𝑖=0 ∞ 𝛿 𝑖 = 𝜋 2 6 −1 𝛿≤𝛿 For t=1,2,3,… Sample 𝑛 𝑡 = 1 𝜖 log⁡( 1 𝛿 𝑡 ) examples If ℎ 𝑡 doesn’t mistake – return ℎ 𝑡 Else, feed 𝐴 with a mistake 𝑥 𝑡 , and get ℎ 𝑡+1 Theorem 5.12: Halts after 𝑂 𝑀 𝜖 log 𝑀 𝛿 examples, with Pr 𝑒𝑟 𝑟 𝒟 ℎ 𝑡 ≤𝜖 ≥1−𝛿 עוצרים כי כמות הטעויות מוגבלת