Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.

Slides:



Advertisements
Similar presentations
Vector Spaces A set V is called a vector space over a set K denoted V(K) if is an Abelian group, is a field, and For every element vV and K there exists.
Advertisements

Koby Crammer Department of Electrical Engineering
Optimal Bus Sequencing for Escape Routing in Dense PCBs H.Kong, T.Yan, M.D.F.Wong and M.M.Ozdal Department of ECE, University of Illinois at U-C ICCAD.
Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
VC Dimension – definition and impossibility result
Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
Mathematical Induction
On-line learning and Boosting
Fast Algorithms For Hierarchical Range Histogram Constructions
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Machine Learning Week 3 Lecture 1. Programming Competition
Regret in the On-Line Decision Problem Dean Foster & Rakesh Vohara Presented by: Tom Whipple 2/7/2006.
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
Entropy and Shannon’s First Theorem
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
The loss function, the normal equation,
Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale.
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
ACT1 Slides by Vera Asodi & Tomer Naveh. Updated by : Avi Ben-Aroya & Alon Brook Adapted from Oded Goldreich’s course lecture notes by Sergey Benditkis,
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Learning to Align Polyphonic Music. Slide 1 Learning to Align Polyphonic Music Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Video Trails: Representing and Visualizing Structure in Video Sequences Vikrant Kobla David Doermann Christos Faloutsos.
Online Learning Algorithms
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Lecture 9. Arithmetic and geometric series and mathematical induction
Online Learning by Projecting: From Theory to Large Scale Web-spam filtering Yoram Singer Koby Crammer (Upenn), Ofer Dekel (Google/HUJI), Vineet Gupta.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
An Introduction to Support Vector Machines (M. Law)
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Time Complexity of Algorithms (Asymptotic Notations)
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
Online Transfer Learning Algorithm ~ The Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS2009) Propose the first framework.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
INDUCTION Slides of Ken Birman, Cornell University.
Web-Mining Agents: Transfer Learning TrAdaBoost R. Möller Institute of Information Systems University of Lübeck.
Smooth ε -Insensitive Regression by Loss Symmetrization Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer School of Computer Science and Engineering The Hebrew.
Infinite Sequences and Series 8. Sequences Sequences A sequence can be thought of as a list of numbers written in a definite order: a 1, a 2, a.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
An Efficient Online Algorithm for Hierarchical Phoneme Classification
New Characterizations in Turnstile Streams with Applications
Dana Ron Tel Aviv University
Depth Estimation via Sampling
CSCI B609: “Foundations of Data Science”
The
Online Learning Kernels
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Computational Learning Theory
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Computational Learning Theory
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
CSCI B609: “Foundations of Data Science”
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.
Machine Learning: UNIT-3 CHAPTER-2
Presentation transcript:

Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem

Power of Selective Memory. Slide 2 Outline Online learning, loss bounds etc. Hypotheses space – PST Margin of prediction and hinge-loss An online learning algorithm Trading margin for depth of the PST Automatic calibration A self-bounded online algorithm for learning PSTs

Power of Selective Memory. Slide 3 Online Learning For Get an instance Predict a target based on Get true update and suffer loss Update prediction mechanism

Power of Selective Memory. Slide 4 Analysis of Online Algorithm Relative loss bounds (external regret): For any fixed hypothesis h :

Power of Selective Memory. Slide 5 Prediction Suffix Tree (PST) Each hypothesis is parameterized by a triplet: context function

Power of Selective Memory. Slide 6 PST Example

Power of Selective Memory. Slide 7 Margin of Prediction Margin of prediction Hinge loss

Power of Selective Memory. Slide 8 Complexity of hypothesis Define the complexity of hypothesis as We can also extend g s.t. and get

Power of Selective Memory. Slide 9 Algorithm I : Learning Unbounded-Depth PST Init: For t=1,2,… Get and predict Get and suffer loss Set Update weight vector Update tree

Power of Selective Memory. Slide 10 Example y = 0 y = ?

Power of Selective Memory. Slide 11 Example y = + 0 y = ?

Power of Selective Memory. Slide 12 Example y = + 0 y = ??

Power of Selective Memory. Slide 13 Example y = +- 0 y = ??

Power of Selective Memory. Slide 14 Example y = +- 0 y = ???

Power of Selective Memory. Slide 15 Example y = y = ???

Power of Selective Memory. Slide 16 Example y = y = ???

Power of Selective Memory. Slide 17 Example y = y = ???

Power of Selective Memory. Slide 18 Example y = y = ???

Power of Selective Memory. Slide 19 Example y = y = ???

Power of Selective Memory. Slide 20 Analysis Let be a sequence of examples and assume that Let be an arbitrary hypothesis Let be the loss of on the sequence of examples. Then,

Power of Selective Memory. Slide 21 Proof Sketch Define Upper bound Lower bound Upper + lower bounds give the bound in the theorem

Power of Selective Memory. Slide 22 Proof Sketch (Cont.) Where does the lower bound come from? For simplicity, assume that and Define a Hilbert space: The context function g t+1 is the projection of g t onto the half-space where f is the function

Power of Selective Memory. Slide 23 Example revisited The following hypothesis has cumulative loss of 2 and complexity of 2. Therefore, the number of mistakes is bounded above by 12. y =

Power of Selective Memory. Slide 24 Example revisited The following hypothesis has cumulative loss of 1 and complexity of 4. Therefore, the number of mistakes is bounded above by 18. But, this tree is very shallow y = Problem: The tree we learned is much more deeper !

Power of Selective Memory. Slide 25 Geometric Intuition

Power of Selective Memory. Slide 26 Geometric Intuition (Cont.) Lets force g t+1 to be sparse by “canceling” the new coordinate

Power of Selective Memory. Slide 27 Geometric Intuition (Cont.) Now we can show that:

Power of Selective Memory. Slide 28 Trading margin for sparsity We got that If is much smaller than we can get a loss bound ! Problem: What happens if is very small and therefore ? Solution: Tolerate small margin errors ! Conclusion: If we tolerate small margin errors, we can get a sparser tree

Power of Selective Memory. Slide 29 Automatic Calibration Problem: The value of is unknown Solution: Use the data itself to estimate it ! More specifically: Denote If we keep then we get a mistake bound

Power of Selective Memory. Slide 30 Algorithm II : Learning Self Bounded-Depth PST Init: For t=1,2,… Get and predict Get and suffer loss If do nothing! Otherwise: Set Update w and the tree as in Algo. I, up to depth d t

Power of Selective Memory. Slide 31 Analysis – Loss Bound Let be a sequence of examples and assume that Let be an arbitrary hypothesis Let be the loss of on the sequence of examples. Then,

Power of Selective Memory. Slide 32 Analysis – Bounded depth Under the previous conditions, the depth of all the trees learned by the algorithm is bounded above by

Power of Selective Memory. Slide 33 Example revisited Performance of Algo. II y = … Only 3 mistakes The last PST is of depth 5 The margin is 0.61 (after normalization) The margin of the max margin tree (of infinite depth) is

Power of Selective Memory. Slide 34 Conclusions Discriminative online learning of PSTs Loss bound Trade margin and sparsity Automatic calibration Future work Experiments Features selection and extraction Support vectors selection