Online Learning in Complex Environments

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Dialogue Policy Optimisation
Markov Decision Process
Reinforcement Learning
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.
Taming the monster: A fast and simple algorithm for contextual bandits
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas Krause Caltech Sham Kakade Matthias Seeger.
Niranjan Srinivas Andreas Krause Caltech Caltech
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
The Value of Knowing a Demand Curve: Regret Bounds for Online Posted-Price Auctions Bobby Kleinberg and Tom Leighton.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Presenting: Assaf Tzabari
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005.
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Reinforcement Learning
Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Robust Optimization and Applications Laurent El Ghaoui IMA Tutorial, March 11, 2003.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.
Basics of Multi-armed Bandit Problems
Keep the Adversary Guessing: Agent Security by Policy Randomization
Tradeoffs Between Fairness and Accuracy in Machine Learning
Math 6330: Statistical Consulting Class 11
Probability Theory and Parameter Estimation I
Xi Chen Mentor: Denny Zhou In collaboration with: Qihang Lin
Feedback-Aware Social Event-Participant Arrangement
Tuning bandit algorithms in stochastic environments
CAP 5636 – Advanced Artificial Intelligence
Chapter 2: Evaluative Feedback
Thompson Sampling for learning in online decision making
CS639: Data Management for Data Science
Reinforcement Learning Dealing with Partial Observability
Chapter 2: Evaluative Feedback
Presentation transcript:

Online Learning in Complex Environments Aditya Gopalan (ece @ iisc) MLSIG, 30 March 2015 (joint work with Shie Mannor, Yishay Mansour)

Machine Learning Algorithms/systems for learning to “do stuff” … … with data/observations of some sort. - R. E. Schapire Learning by Experience = Reinforcement Learning (RL) (Action 1, Reward 1), (Action 2, Reward 2), … Act based on past data, future data/rewards depend on present action; maximize some notion of utility / reward Data interactively gathered

(Simple) Multi-Armed Bandit … 1 2 3 N “arms” or actions (items in a recommender system, transmission freqs, trades, …) each arm is an unknown probability distribution with parameter and mean (think Bernoulli)

(Simple) Multi-Armed Bandit Time 1 2 … 1 3 N Play arm, collect iid “reward” (ad clicks, data rate, profit, …)

(Simple) Multi-Armed Bandit Time 2 1 … 2 3 N

(Simple) Multi-Armed Bandit Time 3 2 … 1 3 N

(Simple) Multi-Armed Bandit Time 4 3 … 1 2 N

(Simple) Multi-Armed Bandit Time N … 1 2 3 Play awhile …

Performance Metrics Total (expected) reward at time Regret: Probability of identifying the best arm Risk aversion: (Mean – Variance) of reward …

Motivation, Applications Clinical trials (original) Internet Advertising A/B testing Comment Scoring Cognitive Radio Dynamic Pricing Sequential Investment Noisy Function Optimization Adaptive Routing/Congestion Control Job Scheduling Bidding in auctions Crowdsourcing Learning in games

Upper Confidence Bound algo [AuerEtAl’02] Idea 1: Consider variance of estimates! Idea 2: Be optimistic under uncertainty! … 1 2 3 N Play arm maximizing

UCB Performance [AuerEtAl’02] After plays, UCB’s expected reward is Per-round regret vanishes as t becomes large Learning

Variations on the Theme (Idealized) assumption in MAB: All arms’ rewards independent of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] Arm 3 Each arm is a vector Arm 2 Playing arm at time gives reward Arm 4 Arm 1 where is unknown and Arm 5

Variations on the Theme (Idealized) assumption in MAB: All arms’ rewards independent of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] Arm 3 Regret after time steps: Arm 2 Arm 4 Arm 1 Arm 5

Variations on the Theme (Idealized) assumption in MAB: All arms’ rewards independent of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] e.g. Binary vectors representing Paths in a graph Collection of subsets of a ground set (budgeted ad display) representing Per edge/per element cost/utility Arm 3 Arm 2 Arm 4 Arm 1 Arm 5

Variations on the Theme (Idealized) assumption in MAB: All arms’ distributions/parameters independent/separate of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] The LinUCB algo Build a point estimate (least squares estimate) and a confidence region (ellipsoid) Play the most optimistic action w.r.t. this ellipsoid,

Variations on the Theme Variation 2: Non-parametric or X-Armed Bandits [Agrawal’95, Kleinberg’04, …] Noisy reward Function from some smooth function class e.g., Lipschitz Mean Reward Arm 1 UCB style algs based on Hierarchical/Adaptive Discretization

What would a “Bayesian” do? The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 “Prior” distribution for Arm 1’s mean “Prior” distribution for Arm 2’s mean

What would a “Bayesian” do? The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Random samples “Prior” distribution for Arm 1’s mean “Prior” distribution for Arm 2’s mean

What would a “Bayesian” do? The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Play best arm assuming sampled means = true means

What would a “Bayesian” do? The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Update to “Posterior”, Bayes’ Rule

What would a “Bayesian” do? The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Random samples

What would a “Bayesian” do? The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Play best arm assuming sampled means = true means

What would a “Bayesian” do? The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Update to “Posterior”, Bayes’ Rule

What we know [Thompson 1933], [Ortega-Braun 2010] [Agr-Goy 2011,2012] Optimal for standard MAB Linear contextual bandits [Kaufmann et al. 2012,2013] Standard MAB optimality Purely Bayesian setting – Bayesian regret [Russo-VanRoy 2013] [Bubeck-Liu 2013] But analysis doesn’t generalize Specific conjugate priors, No closed-form for complex bandit feedback e.g. MAX

TS for Linear Bandits Idea: Use same Least Squares estimate as Lin-UCB, but sample from (multivariate Gaussian) posterior and act greedily! (no need to optimize over an ellipsoid) Shipra Agrawal, Navin Goyal. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML 2013.

More Generally – “Complex Bandits” … 1 2 3 N

e.g. Subset Selection

e.g. Job Scheduling

e.g. Ranking

General Thompson Sampling Imagine ‘fictitious’ prior distribution over all parameters

General Thompson Sampling Sample a set of parameters Prior

General Thompson Sampling Assume is true, play BestAction( )

General Thompson Sampling Get reward , Update prior to posterior (Bayes’ Theorem)

Thompson Sampling Alg = Space of all basic parameters A fictitious prior measure E.g., = Uniform( ) At each time SAMPLE by current prior: PLAY best complex action given sample: GET reward: UPDATE prior to posterior:

“Information Complexity” Main Result Gap/problem-dependent Regret Bound Under any “reasonable” discrete prior, finite # of actions, with probability at least . “Information Complexity” Captures structure of complex bandit Can be much smaller than #Actions! Solution to optimization problem in “path space” LP interpretation General feedback! Previously: only LINEAR complex bandits [DaniEtAl’08, Abbasi-YadkoriEtAl’11, Cesa-Bianchi-Lugosi’12] * Aditya Gopalan, Shie Mannor & Yishay Mansour, “Complex Bandit Problems and Thompson Sampling”, ICML 2014

Example 1: “Semi-bandit” Pick size-K subsets of arms, Observe All K chosen rewards “Semi-bandit” Regret Bound Under a reasonable prior, with probability at least , actions but regret only

Example 2: MAX Feedback Pick size-K subsets of arms, Observe MAX of K chosen rewards Regret Bound under Max Feedback Structure Under a reasonable prior, with probability at least , SAVING! BOUND #ACTIONS

Numerics: Play Subsets, See MAX UCB still exploring (linear region)!

Markov Decision Process States Actions Transition Probabilities Rewards Special case: Multi-armed Bandit

Markov Decision Process Source: Wikipedia 3 states, 2 actions

Online Reinforcement Learning Suppose true MDP parameter is , but this is unknown to the decision maker a priori. Must “LEARN optimal policy” - what action to take in each state to maximize equivalently, minimize regret

Online Reinforcement Learning Tradeoff: Explore the state space or Exploit existing knowledge to design good current policy? Upper Confidence-based approaches: Build confidence intervals per state-action pair, be optimistic! Rmax [Brafman-Tennenholtz 2001] UCRL2 [JakschEtAl 2007] Key Idea: Maintain estimates + high-confidence sets for transition probability & reward for every (state,action) pair “Wasteful” if transitions/rewards have structure/relations

Parameterized MDP – Queueing System 1 2 N Single queue with N states, discrete-time Bernoulli( ) arrivals at every state 2 actions: {FAST, SLOW}, (Bernoulli) service rates { } Assume service rates known, uncertainty in only

Thompson Sampling Draw an MDP instance ~ Prior over possible MDPs Compute optimal policy for instance (Value Iteration, Linear Programming, simulation, …) Play action prescribed by optimal policy for current state Repeat indefinitely. In fact, consider the following variant (TSMDP): Designate a marker state Divide time into visits to marker state (epochs) At each epoch Sample an MDP ~ Curent posterior Compute optimal policy for sampled MDP Play policy until end of epoch Update posterior using epoch samples

Numerics –Queueing System

Main Result [G.-Mannor’15] Under suitably “nice” prior on , with probability at least (1 - ), TSMDP gives regret B + C log(T) in T rounds, where B depends on , and the prior, C depends only on , the true model and, more importantly, the “effective dimension” of . Implication: Provably rapid learning if effective dimensionality of MDP is small, e.g., queueing system with single scalar uncertain parameter

Future Directions Continuum-armed bandits? Risk-averse decision making Misspecified models (e.g., “almost-linear bandits”) Relax epoch structure? Relax ergodicity assumptions? Infinite State/Action Spaces Function Approximation for State/Policy Representations?

Thank you