1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Reinforcement Learning
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Markov Decision Processes Basics Concepts Alan Fern.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Sample-based Planning for Continuous Action Markov Decision Processes Ari Weinstein.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
Markov Decision Processes
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Multi-armed Bandit Problems with Dependent Arms
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Monte-Carlo Planning Look Ahead Trees
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
RL for Large State Spaces: Policy Gradient
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Monte-Carlo Tree Search
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
1 Monte-Carlo Planning: Basic Principles and Recent Progress Most slides by Alan Fern EECS, Oregon State University A few from me, Dan Klein, Luke Zettlmoyer,
Upper Confidence Trees for Game AI Chahine Koleejan.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
1 Markov Decision Processes Basics Concepts Alan Fern.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.
Monte-Carlo Planning:
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CASE − Cognitive Agents for Social Environments
CS 188: Artificial Intelligence Fall 2007
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Classifier-Based Approximate Policy Iteration
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

1 Monte-Carlo Planning: Policy Improvement Alan Fern

2 Monte-Carlo Planning  Often a simulator of a planning domain is available or can be learned from data 2 Fire & Emergency Response Conservation Planning

3 Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator 3 World Simulator Real World action State + reward

4 MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T, I:  finite state set S (|S|=n and is generally very large)  finite action set A (|A|=m and will assume is of reasonable size)  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Stochastic transition function T(s,a) = s’ (i.e. a simulator)  Stochastically returns a state s’ given input s and a  Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  Stochastic initial state function I.  Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!

5 Outline  You already learned how to evaluate a policy given a simulator  Just run the policy multiple times for a finite horizon and average the rewards  In next two lectures we’ll learn how to use the simulator in order to select good actions in the real world

6 Monte-Carlo Planning Outline  Single State Case (multi-armed bandits)  A basic tool for other algorithms  Monte-Carlo Policy Improvement  Policy rollout  Policy Switching  Monte-Carlo Tree Search  Sparse Sampling  UCT and variants Today

7 Single State Monte-Carlo Planning  Suppose MDP has a single state and k actions  Can sample rewards of actions using calls to simulator  Sampling action a is like pulling slot machine arm with random payoff function R(s,a) s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …

Multi-Armed Bandits  We will use bandit algorithms as components for multi-state Monte-Carlo planning  But they are useful in their own right  Pure bandit problems arise in many applications  Applicable whenever:  We have a set of independent options with unknown utilities  There is a cost for sampling options or a limit on total samples  Want to find the best option or maximize utility of our samples

Multi-Armed Bandits: Examples  Clinical Trials  Arms = possible treatments  Arm Pulls = application of treatment to inidividual  Rewards = outcome of treatment  Objective = determine best treatment quickly  Online Advertising  Arms = different ads/ad-types for a web page  Arm Pulls = displaying an ad upon a page access  Rewards = click through  Objective = find best add quickly (the maximize clicks)

10 Simple Regret Objective  Different applications suggest different types of bandit objectives.  Today minimizing simple regret will be the objective  Simple Regret Minimization (informal): quickly identify arm with close to optimal expected reward s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …

11 Simple Regret Objective: Formal Definition

12 UniformBandit Algorith (or Round Robin) Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19),

13 Can we do better? Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence. Often is more effective than UniformBandit in practice.

14 Monte-Carlo Planning Outline  Single State Case (multi-armed bandits)  A basic tool for other algorithms  Monte-Carlo Policy Improvement  Policy rollout  Policy Switching  Monte-Carlo Tree Search  Sparse Sampling  UCT and variants Today

Policy Improvement via Monte-Carlo  Now consider a very large multi-state MDP.  Suppose we have a simulator and a non-optimal policy  E.g. policy could be a standard heuristic or based on intuition  Can we somehow compute an improved policy? 15 World Simulator + Base Policy Real World action State + reward

16 Policy Improvement Theorem

17 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) … How to implement SimQ?

18 Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q s … … … … a1 a2 Trajectory under  Sum of rewards = SimQ(s,a 1,π,h) akak Sum of rewards = SimQ(s,a 2,π,h) Sum of rewards = SimQ(s,a k,π,h)

19 Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q  Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards  Expected value of SimQ(s,a,π,h) is Q π (s,a,h)  So averaging across multiple runs of SimQ quickly converges to Q π (s,a,h)

20 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) …

21 UniformRollout s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. Samples of SimQ(s,a i,π,h)

22 s a1a1 a2a2 akak … q 11 q 12 … q 1u q 21 q 22 … q 2v q k1 … … … … …  Allocates a non-uniform number of trials across actions (focuses on more promising actions)

23 Executing Rollout in Real World …… s a1a1 a 2 a k … … … … … … … … … a1a1 a 2 a k … … … … … … … … … a2a2 akak run policy rollout Real world state/action sequence Simulated experience How much time does each decision take?

24 Policy Rollout: # of Simulator Calls Total of n SimQ calls each using h calls to simulator and policy Total of hn calls to the simulator and to the policy (dominates time to make decision) a1a1 a2a2 akak … … … …… … …… …… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. s

25 Practical Issues: Accuracy In general, larger values are better, but this increases time.

26 Practical Issues: Speed  There are three ways to speedup decision making time 1. Use a faster policy

27 Practical Issues: Speed  There are three ways to speedup decision making time 1. Use a faster policy 2. Decrease the number of trajectories n  Decreasing Trajectories:  If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often  One way to get away with a smaller n is to use an action filter  Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider  You can use your domain knowledge to filter out obviously bad actions  Rollout decides among the remaining actions returned by f(s)  Since rollout only tries actions in f(s) can use a smaller value of n

28 Practical Issues: Speed  There are three ways to speedup either rollout procedure 1. Use a faster policy 2. Decrease the number of trajectories n 3. Decrease the horizon h  Decrease Horizon h:  If h is too small compared to the “real horizon” of the problem, then the Q-estimates may not be accurate  Can get away with a smaller h by using a value estimation heuristic  Heuristic function: a heuristic function v(s) returns an estimate of the value of state s  SimQ is adjusted to run policy for h steps ending in state s’ and returns the sum of rewards up until s’ added to the estimate v(s’)

29 Multi-Stage Rollout  A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π  We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout  Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting from two iterations of policy improvement  Can nest this arbitrarily  Gives a way to use more time in order to improve performance

30 Multi-Stage Rollout a1a1 a2a2 akak … … … …… … …… … … Trajectories of SimQ(s,a i,Rollout[π,h,w],h) Each step requires nh simulator calls for Rollout policy Two stage: compute rollout policy of “rollout policy of π” Requires (nh) 2 calls to the simulator for 2 stages In general exponential in the number of stages s

31 Example: Rollout for Solitaire [Yan et al. NIPS’04]  Multiple levels of rollout can payoff but is expensive PlayerSuccess RateTime/Game Human Expert36.6%20 min (naïve) Base Policy 13.05%0.021 sec 1 rollout31.20%0.67 sec 2 rollout47.6%7.13 sec 3 rollout56.83%1.5 min 4 rollout60.51%18 min 5 rollout70.20%1 hour 45 min

32 Rollout in 2-Player Games s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… p1 p2

33 Another Useful Technique: Policy Switching  Suppose you have a set of base policies {π 1, π 2,…, π M }  Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select.  Policy switching is a simple way to select which policy to use at a given step via a simulator

34 Another Useful Technique: Policy Switching s Sim(s,π 1,h) Sim(s,π 2,h) Sim(s,π M,h) …  The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s  Implement by simply simulating π starting in s for h steps and returning discounted total reward  Use Bandit algorithm to select best policy and then select action chosen by that policy π 1 π 2 πMπM

35 PolicySwitching PolicySwitch[{π 1, π 2,…, π M },h,n](s) 1. Define bandit with M arms giving rewards Sim(s,π i,h) 2. Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials 3. Return action π i* (s) s π 1 π 2 πMπM … v 11 v 12 … v 1w v 21 v 22 … v 2w v M1 v M2 … v Mw … … … … … … ……… Sim(s,π i,h) trajectories Each simulates following π i for h steps. Discounted cumulative rewards

36 Executing Policy Switching in Real World …… s 2 k … … … … … … … … … 1 2 k … … … … … … … … … 2 (s) k (s’) run policy rollout Real world state/action sequence Simulated experience

37 Policy Switching: Quality

Policy Switching in 2-Player Games

MaxiMin Policy Switching …. Build Game Matrix Game Simulator Current State s Each entry gives estimated value (for max player) of playing a policy pair against one another Each value estimated by averaging across w simulated games.

MaxiMin Switching …. Build Game Matrix Game Simulator Current State s Can switch between policies based on state of game!

MaxiMin Switching …. Build Game Matrix Game Simulator Current State s

42 Policy Switching: Quality  MaxiMin policy switching will often do better than any single policy in practice  The theoretical guarantees for basic MaxiMin policy switching are quite weak  Tweaks to the algorithm can fix this  For single-agent MDPs, policy switching is guaranteed to improve over the best policy in the set.