Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]

Slides:



Advertisements
Similar presentations
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Advertisements

Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Online Learning for Online Pricing Problems Maria Florina Balcan.
Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
MIT and James Orlin © Game Theory 2-person 0-sum (or constant sum) game theory 2-person game theory (e.g., prisoner’s dilemma)
Regret Minimizing Audits: A Learning-theoretic Basis for Privacy Protection Jeremiah Blocki, Nicolas Christin, Anupam Datta, Arunesh Sinha Carnegie Mellon.
Online learning, minimizing regret, and combining expert advice
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
Maria-Florina Balcan Approximation Algorithms and Online Mechanisms for Item Pricing Maria-Florina Balcan & Avrim Blum CMU, CSD.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
UNIT II: The Basic Theory Zero-sum Games Nonzero-sum Games Nash Equilibrium: Properties and Problems Bargaining Games Bargaining and Negotiation Review.
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
An Intro to Game Theory Avrim Blum 12/07/04.
Experts Learning and The Minimax Theorem for Zero-Sum Games Maria Florina Balcan December 8th 2011.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Online Learning Algorithms
CPS Learning in games Vincent Conitzer
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
online convex optimization (with partial information)
SVM by Sequential Minimal Optimization (SMO)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Approximating Market Equilibria Kamal Jain, Microsoft Research Mohammad Mahdian, MIT Amin Saberi, Georgia Tech.
Online Learning Yiling Chen. Machine Learning Use past observations to automatically learn to make better predictions or decisions in the future A large.
Issues on the border of economics and computation נושאים בגבול כלכלה וחישוב Speaker: Dr. Michael Schapira Topic: Dynamics in Games (Part III) (Some slides.
Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan Lecture 2, August 26 th 2010.
Shall we play a game? Game Theory and Computer Science Game Theory /06/05 - Zero-sum games - General-sum games.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Improved Equilibria via Public Service Advertising Maria-Florina Balcan TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Experts and Multiplicative Weights slides from Avrim Blum.
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.
Carnegie Mellon University
Dan Roth Department of Computer and Information Science
Better Algorithms for Better Computers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Communication Complexity as a Lower Bound for Learning in Games
ECE 5424: Introduction to Machine Learning
CS 4/527: Artificial Intelligence
Machine Learning Today: Reading: Maria Florina Balcan
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
CSCI B609: “Foundations of Data Science”
Online Learning Kernels
Chapter 2: Evaluative Feedback
The Weighted Majority Algorithm
Minimax strategies, alpha beta pruning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Chapter 2: Evaluative Feedback
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]

“No-regret” algorithms for repeated decisions:  Algorithm has N options. World chooses cost vector. Can view as matrix like this (maybe infinite # cols)  At each time step, algorithm picks row, life picks column. Alg pays cost (or gets benefit) for action chosen. Alg gets column as feedback (or just its own cost/benefit in the “bandit” model). Goal: do nearly as well as best fixed row in hindsight. Algorithm World – life - fate Recap

World – life - fate RWM (1-  c 1 1 ) (1-  c 2 1 ) (1-  c 3 1 ). (1-  c n 1 ) scaling so costs in [0,1] c1c1 c2c2 (1-  c 1 2 ) (1-  c 2 2 ) (1-  c 3 2 ). (1-  c n 2 ) Guarantee: E[cost] · OPT + 2(OPT ¢ log n) 1/2 Since OPT · T, this is at most OPT + 2(Tlog n) 1/2. So, regret/time step · 2(Tlog n) 1/2 /T ! 0.

[ACFS02]: applying RWM to bandits  What if only get your own cost/benefit as feedback?  Use of RWM as subroutine to get algorithm with cumulative regret O( (TN log N) 1/2 ). [average regret O( ((N log N)/T) 1/2 ).]  Will do a somewhat weaker version of their analysis (same algorithm but not as tight a bound).  For fun, talk about it in the context of online pricing…

Online pricing Say you are selling lemonade (or a cool new software tool, or bottles of water at the world cup). For t=1,2,…T –Seller sets price p t –Buyer arrives with valuation v t –If v t ¸ p t, buyer purchases and pays p t, else doesn’t. –Repeat. Assume all valuations · h. $500 a glass $2 $3.00 a glass Goal: do nearly as well as best fixed price in hindsight. View each possible price as a different row/expert

Multi-armed bandit problem Exponential Weights for Exploration and Exploitation (exp 3 ) RWM n = #experts Exp3 Distrib p t Expert i ~ q t $1.25 Gain g i t Gain vector ĝ t qtqt q t = (1- ° )p t + ° unif ĝ t = (0,…,0, g i t /q i t,0,…,0) OPT 1. RWM believes gain is: p t ¢ ĝ t = p i t (g i t /q i t ) ´ g t RWM 3. Actual gain is: g i t = g t RWM (q i t /p i t ) ¸ g t RWM (1- ° ) 2.  t g t RWM ¸ /(1+ ² ) - O( ² -1 nh/ ° log n) OPT 4. E[ ] ¸ OPT. OPT Because E[ĝ j t ] = (1- q j t )0 + q j t (g j t /q j t ) = g j t, so E[max j [  t ĝ j t ]] ¸ max j [ E[  t ĝ j t ] ] = OPT. · nh/ ° [Auer,Cesa-Bianchi,Freund,Schapire]

Multi-armed bandit problem Exponential Weights for Exploration and Exploitation (exp 3 ) RWM n = #experts Exp3 Distrib p t Expert i ~ q t $1.25 Gain g i t Gain vector ĝ t qtqt q t = (1- ° )p t + ° unif ĝ t = (0,…,0, g i t /q i t,0,…,0) OPT Conclusion ( ° = ² ) : E[Exp3] ¸ OPT/(1+ ² ) 2 - O( ² -2 nh log(n)) [Auer,Cesa-Bianchi,Freund,Schapire] · nh/ ° Balancing would give O((OPT nh log n) 2/3 ) in bound because of ² -2. But can reduce to ² -1 and O((OPT nh log n) 1/2 ) more care in analysis.

A natural generalization (Going back to full-info setting, thinking about paths…) rainy rainy  A natural generalization of our regret goal is: what if we also want that on rainy days, we do nearly as well as the best route for rainy days. Mondays Mondays  And on Mondays, do nearly as well as best route for Mondays.  More generally, have N “rules” (on Monday, use path P). Goal: simultaneously, for each rule i, guarantee to do nearly as well as it on the time steps in which it fires.  For all i, want E[cost i (alg)] · (1+  )cost i (i) + O(  -1 log N). (cost i (X) = cost of X on time steps where rule i fires.)  Can we get this?

A natural generalization  This generalization is esp natural in machine learning for combining multiple if-then rules.  E.g., document classification. Rule: “if appears then predict ”. E.g., if has football then classify as sports.  So, if 90% of documents with football are about sports, we should have error · 11% on them. “Specialists” or “sleeping experts” problem.  Assume we have N rules, explicitly given.  For all i, want E[cost i (alg)] · (1+  )cost i (i) + O(  -1 log N). (cost i (X) = cost of X on time steps where rule i fires.)

A simple algorithm and analysis (all on one slide)  Start with all rules at weight 1.  At each time step, of the rules i that fire, select one with probability p i / w i.  Update weights: If didn’t fire, leave weight alone. If did fire, raise or lower depending on performance compared to weighted average: r i = [  j p j cost(j)]/(1+  ) – cost(i) w i à w i (1+  ) r i So, if rule i does exactly as well as weighted average, its weight drops a little. Weight increases if does better than weighted average by more than a (1+  ) factor. This ensures sum of weights doesn’t increase.  Final w i = (1+  ) E[cost i (alg)]/(1+  )-cost i (i). So, exponent ·  -1 log N.  So, E[cost i (alg)] · (1+  )cost i (i) + O(  -1 log N).

Lots of uses  Can combine multiple if-then rules  Can combine multiple learning algorithms: Back to driving, say we are given N “conditions” to pay attention to (is it raining?, is it a Monday?, …). if day satisfies condition i, then use output of Alg i Create N rules: “if day satisfies condition i, then use output of Alg i ”, where Alg i is an instantiation of an experts algorithm you run on just the days satisfying that condition. Simultaneously, for each condition i, do nearly as well as Alg i which itself does nearly as well as best path for condition i.

Adapting to change  What if we want to adapt to change - do nearly as well as best recent expert?  For each expert, instantiate copy who wakes up on day t for each 0 · t · T-1.  Our cost in previous t days is at most (1+ ² )(best expert in last t days) + O( ² -1 log(NT)).  (not best possible bound since extra log(T) but not bad).

Summary Algorithms for online decision-making with strong guarantees on performance compared to best fixed choice. Application: play repeated game against adversary. Perform nearly as well as fixed strategy in hindsight. Can apply even with very limited feedback. Application: which way to drive to work, with only feedback about your own paths; online pricing, even if only have buy/no buy feedback.

More general forms of regret 1.“best expert” or “external” regret: –Given n strategies. Compete with best of them in hindsight. 2.“sleeping expert” or “regret with time-intervals”: –Given n strategies, k properties. Let S i be set of days satisfying property i (might overlap). Want to simultaneously achieve low regret over each S i. 3.“internal” or “swap” regret: like (2), except that S i = set of days in which we chose strategy i.

Internal/swap-regret E.g., each day we pick one stock to buy shares in. –Don’t want to have regret of the form “every time I bought IBM, I should have bought Microsoft instead”. Formally, regret is wrt optimal function f:{1,…,n} ! {1,…,n} such that every time you played action j, it plays f(j).

Weird… why care? “Correlated equilibrium” Distribution over entries in matrix, such that if a trusted party chooses one at random and tells you your part, you have no incentive to deviate. E.g., Shapley game. -1,-1 -1,1 1,-1 1,-1 -1,-1 -1,1 -1,1 1,-1 -1,-1 RPSRPS R P S In general-sum games, if all players have low swap- regret, then empirical distribution of play is apx correlated equilibrium.

Internal/swap-regret, contd Algorithms for achieving low regret of this form: –Foster & Vohra, Hart & Mas-Colell, Fudenberg & Levine. –Will present method of [BM05] showing how to convert any “best expert” algorithm into one achieving low swap regret.

Can convert any “best expert” algorithm A into one achieving low swap regret. Idea: –Instantiate one copy A j responsible for expected regret over times we play j. Alg q1q1 Play p = pQ Cost vector c q2q2 A1A1 A2A2 AnAn qnqn Q –Allows us to view p j as prob we play action j, or as prob we play alg A j. pncpnc p1cp1c p2cp2c –Give A j feedback of p j c. –A j guarantees  t (p j t c t ) ¢ q j t · min i  t p j t c i t + [regret term] –Write as:  t p j t (q j t ¢ c t ) · min i  t p j t c i t + [regret term]

Can convert any “best expert” algorithm A into one achieving low swap regret. Idea: –Instantiate one copy A j responsible for expected regret over times we play j. Alg q1q1 Play p = pQ Cost vector c q2q2 A1A1 A2A2 AnAn qnqn Q –Sum over j, get: pncpnc p1cp1c p2cp2c  t p t Q t c t ·  j min i  t p j t c i t + n [regret term] –Write as:  t p j t (q j t ¢ c t ) · min i  t p j t c i t + [regret term] Our total cost For each j, can move our prob to its own i=f(j)

Itinerary Stop 1: Minimizing regret and combining advice. –Randomized Wtd Majority / Multiplicative Weights alg –Connections to game theory Stop 2: Extensions –Online learning from limited feedback (bandit algs) –Algorithms for large action spaces, sleeping experts Stop 3: Powerful online LTF algorithms –Winnow, Perceptron Stop 4: Powerful tools for using these algorithms –Kernels and Similarity functions Stop 5: Something completely different –Distributed machine learning

Transition… So far, we have been examining problems of selecting among choices/algorithms/experts given to us from outside. Now, turn to design of online algorithms for learning over data described by features.

A typical ML setting Say you want a computer program to help you decide which messages are urgent and which can be dealt with later. Might represent each message by n features. (e.g., return address, keywords, header info, etc.) On each message received, you make a classification and then later find out if you messed up. Goal: if there exists a “simple” rule that works (is perfect? low error?) then our alg does well.

Simple example: disjunctions Suppose features are boolean: X = {0,1} n. Target is an OR function, like x 3 v x 9 v x 12. Can we find an on-line strategy that makes at most n mistakes? (assume perfect target) Sure. –Start with h(x) = x 1 v x 2 v... v x n –Invariant: {vars in h } ¶ {vars in f } –Mistake on negative: throw out vars in h set to 1 in x. Maintains invariant and decreases |h| by 1. –No mistakes on positives. So at most n mistakes total.

Simple example: disjunctions Suppose features are boolean: X = {0,1} n. Target is an OR function, like x 3 v x 9 v x 12. Can we find an on-line strategy that makes at most n mistakes? (assume perfect target) Compare to “experts” setting: –Could define 2 n experts, one for each OR fn. –#mistakes · log(# experts) –This way is much more efficient… –…but, requires some expert to be perfect.

Simple example: disjunctions Algorithm makes at most n mistakes. No deterministic alg can do better: or   or  or   or ...

Simple example: disjunctions But what if we believe only r out of the n variables are relevant? I.e., in principle, should be able to get only O(log n r ) = O(r log n) mistakes. Can we do it efficiently?

Winnow algorithm Winnow algorithm for learning a disjunction of r out of n variables. eg f(x)= x 3 v x 9 v x 12 h(x): predict pos iff w 1 x 1 + … + w n x n ¸ n. Initialize w i = 1 for all i. –Mistake on pos: w i à 2w i for all x i =1. –Mistake on neg: w i à 0 for all x i =1.

Winnow algorithm Winnow algorithm for learning a disjunction of r out of n variables. eg f(x)= x 3 v x 9 v x 12 h(x): predict pos iff w 1 x 1 + … + w n x n ¸ n. Initialize w i = 1 for all i. –Mistake on pos: w i à 2w i for all x i =1. –Mistake on neg: w i à 0 for all x i =1. Thm: Winnow makes at most O(r log n) mistakes. Proof: Each M.o.p. doubles at least one relevant weight (and note that rel wts never set to 0). At most r(1+log n) of these. Each M.o.p. adds < n to total weight. Each M.o.n removes at least n from total weight. So #(M.o.n) · 1+ #(M.o.p). That’s it!

A generalization Winnow algorithm for learning a linear separator with non-neg integer weights: e.g., 2x 3 + 4x 9 + x x 12 ¸ 5. h(x): predict pos iff w 1 x 1 + … + w n x n ¸ n. Initialize w i = 1 for all i. –Mistake on pos: w i à w i (1+ ² ) for all x i =1. –Mistake on neg: w i à w i /(1+ ² ) for all x i =1. –Use ² = O(1/W), W = sum of wts in target. Thm: Winnow makes at most O(W 2 log n) mistakes.

Winnow for general LTFs More generally, can show the following: Suppose 9 w* s.t.: w* ¢ x ¸ c on positive x, w* ¢ x · c - ° on negative x. Then mistake bound is O((L 1 (w*)/ ° ) 2 log n) Multiply by L 1 (X) if features not {0,1}.

Perceptron algorithm An even older and simpler algorithm, with a bound of a different form. Suppose 9 w* s.t.: w* ¢ x ¸ ° on positive x, w* ¢ x · - ° on negative x. Then mistake bound is O((L 2 (w*)L 2 (x)/ ° ) 2 ) L 2 margin of examples

Perceptron algorithm Thm: Suppose data is consistent with some LTF w* ¢ x > 0, where we scale so L 2 (w*)=1, L 2 (x) · 1, ° = min x |w* ¢ x| Then # mistakes · 1/ °   w* Algorithm: Initialize w=0. Use w ¢ x > 0. Mistake on pos: w à w+x. Mistake on neg: w à w-x.

Perceptron algorithm Example: - (0,1) – (1,1) + (1,0) Algorithm: Initialize w=0. Use w ¢ x > 0. Mistake on pos: w à w+x. Mistake on neg: w à w-x.

Analysis Thm: Suppose data is consistent with some LTF w* ¢ x > 0, where ||w*||=1 and ° = min x |w* ¢ x| (after scaling so all ||x|| · 1) Then # mistakes · 1/ ° 2. Proof: consider |w ¢ w*| and ||w|| Each mistake increases |w ¢ w*| by at least °. (w + x) ¢ w* = w ¢ w* + x ¢ w* ¸ w ¢ w* + °. Each mistake increases w ¢ w by at most 1. (w + x) ¢ (w + x) = w ¢ w + 2(w ¢ x) + x ¢ x · w ¢ w + 1. So, in M mistakes, ° M · |w ¢ w*| · ||w|| · M 1/2. So, M · 1/ ° 2.

What if no perfect separator? In this case, a mistake could cause |w ¢ w*| to drop. Impact: magnitude of x ¢ w* in units of °. Mistakes(perceptron) · 1/ ° 2 + O (how much, in units of °, you would have to move the points to all be correct by ° ) Proof: consider |w ¢ w*| and ||w|| Each mistake increases |w ¢ w*| by at least °. (w + x) ¢ w* = w ¢ w* + x ¢ w* ¸ w ¢ w* + °. Each mistake increases w ¢ w by at most 1. (w + x) ¢ (w + x) = w ¢ w + 2(w ¢ x) + x ¢ x · w ¢ w + 1. So, in M mistakes, ° M · |w ¢ w*| · ||w|| · M 1/2. So, M · 1/ ° 2.

What if no perfect separator? In this case, a mistake could cause |w ¢ w*| to drop. Impact: magnitude of x ¢ w* in units of °. Mistakes(perceptron) · 1/ ° 2 + O (how much, in units of °, you would have to move the points to all be correct by ° ) Note that ° was not part of the algorithm. So, mistake-bound of Perceptron · min ° (above). Equivalently, mistake bound · min w* ||w*|| 2 + O(hinge loss(w*)).