1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Decision Theoretic Planning
1 Markov Decision Processes Basics Concepts Alan Fern.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Sample-based Planning for Continuous Action Markov Decision Processes Ari Weinstein.
Planning under Uncertainty
Bandits for Taxonomies: A Model-based Approach Sandeep Pandey Deepak Agarwal Deepayan Chakrabarti Vanja Josifovski.
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
Multi-armed Bandit Problems with Dependent Arms
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Monte-Carlo Planning Look Ahead Trees
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
RL for Large State Spaces: Policy Gradient
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Monte-Carlo Tree Search
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
1 Monte-Carlo Planning: Basic Principles and Recent Progress Most slides by Alan Fern EECS, Oregon State University A few from me, Dan Klein, Luke Zettlmoyer,
Upper Confidence Trees for Game AI Chahine Koleejan.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
1 Markov Decision Processes Basics Concepts Alan Fern.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Game Playing. Introduction One of the earliest areas in artificial intelligence is game playing. Two-person zero-sum game. Games for which the state space.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
Reinforcement Learning
1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Monte-Carlo Planning:
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Tuning bandit algorithms in stochastic environments
Announcements Homework 3 due today (grace period through Friday)
CASE − Cognitive Agents for Social Environments
CS 188: Artificial Intelligence Fall 2007
Multiagent Systems Repeated Games © Manfred Huber 2018.
CS 188: Artificial Intelligence Fall 2008
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Classifier-Based Approximate Policy Iteration
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

1 Monte-Carlo Planning: Policy Improvement Alan Fern

2 Monte-Carlo Planning  Often a simulator of a planning domain is available or can be learned from data 2 Fire & Emergency Response Conservation Planning

3 Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator 3 World Simulator Real World action State + reward

4 MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T, I:  finite state set S (|S|=n and is generally very large)  finite action set A (|A|=m and will assume is of reasonable size)  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Stochastic transition function T(s,a) = s’ (i.e. a simulator)  Stochastically returns a state s’ given input s and a  Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  Stochastic initial state function I.  Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!

5 Outline  You already learned how to evaluate a policy given a simulator  Just run the policy multiple times for a finite horizon and average the rewards  In next two lectures we’ll learn how to select good actions

6 Monte-Carlo Planning Outline  Single State Case (multi-armed bandits)  A basic tool for other algorithms  Monte-Carlo Policy Improvement  Policy rollout  Policy Switching  Monte-Carlo Tree Search  Sparse Sampling  UCT and variants Today

7 Single State Monte-Carlo Planning  Suppose MDP has a single state and k actions  Can sample rewards of actions using calls to simulator  Sampling action a is like pulling slot machine arm with random payoff function R(s,a) s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …

Multi-Armed Bandits  We will use bandit algorithms as components for multi-state Monte-Carlo planning  But they are useful in their own right  Pure bandit problems arise in many applications  Applicable whenever:  We have a set of independent options with unknown utilities  There is a cost for sampling options or a limit on total samples  Want to find the best option or maximize utility of our samples

Multi-Armed Bandits: Examples  Clinical Trials  Arms = possible treatments  Arm Pulls = application of treatment to inidividual  Rewards = outcome of treatment  Objective = Find best treatment quickly (debatable)  Online Advertising  Arms = different ads/ad-types for a web page  Arm Pulls = displaying an ad upon a page access  Rewards = click through  Objective = find best add quickly (the maximize clicks)

10 Simple Regret Objective  Different applications suggest different types of bandit objectives.  Today minimizing simple regret will be the objective  Simple Regret: quickly identify arm with high reward (in expectation) s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) Multi-Armed Bandit Problem … …

11 Simple Regret Objective: Formal Definition

12 UniformBandit Algorith (or Round Robin) Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19),

13 Can we do better? Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence. Often is seen to be more effective in practice as well.

14 Monte-Carlo Planning Outline  Single State Case (multi-armed bandits)  A basic tool for other algorithms  Monte-Carlo Policy Improvement  Policy rollout  Policy Switching  Monte-Carlo Tree Search  Sparse Sampling  UCT and variants Today

Policy Improvement via Monte-Carlo  Now consider a very large multi-state MDP.  Suppose we have a simulator and a non-optimal policy  E.g. policy could be a standard heuristic or based on intuition  Can we somehow compute an improved policy? 15 World Simulator + Base Policy Real World action State + reward

16 Policy Improvement Theorem

17 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) …  Idea: define a stochastic function SimQ(s,a,π,h) that we can implement and whose expected value is Q π (s,a,h)  Then use Bandit algorithm to select (approximately) the action with best Q-value How to implement SimQ?

18 Policy Improvement via Bandits SimQ(s,a,π,h) r = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 r = r + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return r  Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards  Expected value of SimQ(s,a,π,h) is Q π (s,a,h) which can be made arbitrarily close to Q π (s,a) by increasing h

19 Policy Improvement via Bandits SimQ(s,a,π,h) r = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 r = r + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return r s … … … … a1 a2 Trajectory under  Sum of rewards = SimQ(s,a 1,π,h) akak Sum of rewards = SimQ(s,a 2,π,h) Sum of rewards = SimQ(s,a k,π,h)

20 Policy Improvement via Bandits s a1a1 a2a2 akak SimQ(s,a 1,π,h) SimQ(s,a 2,π,h) SimQ(s,a k,π,h) …

21 UniformRollout s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. Samples of SimQ(s,a i,π,h)

22 s a1a1 a2a2 akak … q 11 q 12 … q 1u q 21 q 22 … q 2v q k1 … … … … …  Allocates a non-uniform number of trials across actions (focuses on more promising actions)

23 Executing Rollout in Real World …… s a1a1 a 2 a k … … … … … … … … … a1a1 a 2 a k … … … … … … … … … a2a2 akak run policy rollout Real world state/action sequence Simulated experience

24 Policy Rollout: # of Simulator Calls Total of n SimQ calls each using h calls to simulator Total of hn calls to the simulator a1a1 a2a2 akak … … … …… … …… …… SimQ(s,a i,π,h) trajectories Each simulates taking action a i then following π for h-1 steps. s

25 Multi-Stage Rollout  A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π  We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout  Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting from two iterations of policy improvement  Can nest this arbitrarily  Gives a way to use more time in order to improve performance

26 Practical Issues  There are three ways to speedup either rollout procedure 1. Use a faster policy 2. Decrease the number of trajectories n 3. Decrease the horizon h  Decreasing Trajectories:  If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often  One way to get away with a smaller n is to use an action filter  Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider  You can use your domain knowledge to filter out obviously bad actions  Rollout decides among the rest

27 Practical Issues  There are three ways to speedup either rollout procedure 1. Use a faster policy 2. Decrease the number of trajectories n 3. Decrease the horizon h  Decrease Horizon h:  If h is too small compared to the “real horizon” of the problem, then the Q-estimates may not be accurate  One way to get away with a smaller h is to use a value estimation heuristic  Heuristic function: a heuristic function v(s) returns an estimate of the value of state s  SimQ is adjusted to run policy for h steps ending in state s’ and returning the sum of rewards adding in estimate v(s’)

28 Multi-Stage Rollout a1a1 a2a2 akak … … … …… … …… … … Trajectories of SimQ(s,a i,Rollout[π,h,w],h) Each step requires nh simulator calls for Rollout policy Two stage: compute rollout policy of “rollout policy of π” Requires (nh) 2 calls to the simulator for 2 stages In general exponential in the number of stages s

29 Example: Rollout for Solitaire [Yan et al. NIPS’04]  Multiple levels of rollout can payoff but is expensive PlayerSuccess RateTime/Game Human Expert36.6%20 min (naïve) Base Policy 13.05%0.021 sec 1 rollout31.20%0.67 sec 2 rollout47.6%7.13 sec 3 rollout56.83%1.5 min 4 rollout60.51%18 min 5 rollout70.20%1 hour 45 min

30 Rollout in 2-Player Games s a1a1 a2a2 akak … q 11 q 12 … q 1w q 21 q 22 … q 2w q k1 q k2 … q kw … … … … … … ……… p1 p2

31 Another Useful Technique: Policy Switching  Suppose you have a set of base policies {π 1, π 2,…, π M }  Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select.  Policy switching is a simple way to select which policy to use at a given step via a simulator

32 Another Useful Technique: Policy Switching s Sim(s,π 1,h) Sim(s,π 2,h) Sim(s,π M,h) …  The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s  Implement by simply simulating π starting in s for h steps and returning discounted total reward  Use Bandit algorithm to select best policy and then select action chosen by that policy π 1 π 2 πMπM

33 PolicySwitching PolicySwitch[{π 1, π 2,…, π M },h,n](s) 1. Define bandit with M arms giving rewards Sim(s,π i,h) 2. Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials 3. Return action π i* (s) s π 1 π 2 πMπM … v 11 v 12 … v 1w v 21 v 22 … v 2w v M1 v M2 … v Mw … … … … … … ……… Sim(s,π i,h) trajectories Each simulates following π i for h steps. Discounted cumulative rewards

34 Executing Policy Switching in Real World …… s 2 k … … … … … … … … … 1 2 k … … … … … … … … … 2 (s) k (s’) run policy rollout Real world state/action sequence Simulated experience

35 Policy Switching: Quality

Policy Switching in 2-Player Games

Minimax Policy Switching …. Build Game Matrix Game Simulator Current State s Each entry gives estimated value (for max player) of playing a policy pair against one another Each value estimated by averaging across w simulated games.

Minimax Switching …. Build Game Matrix Game Simulator Current State s