Monte-Carlo Planning:

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.

Decision Theoretic Planning

1 Markov Decision Processes Basics Concepts Alan Fern.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Infinite Horizon Problems

Planning under Uncertainty

Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.

Markov Decision Processes

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Monte-Carlo Planning Look Ahead Trees

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

RL for Large State Spaces: Policy Gradient

1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.

Monte-Carlo Tree Search

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Monte-Carlo Planning: Basic Principles and Recent Progress Most slides by Alan Fern EECS, Oregon State University A few from me, Dan Klein, Luke Zettlmoyer,

Upper Confidence Trees for Game AI Chahine Koleejan.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

1 Markov Decision Processes Basics Concepts Alan Fern.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

Stochastic tree search and stochastic games

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8

Reinforcement Learning (1)

Reinforcement Learning in POMDPs Without Resets

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Reinforcement learning (Chapter 21)

Multi-armed Bandit Problems with Dependent Arms

Announcements Homework 3 due today (grace period through Friday)

RL methods in practice Alekh Agarwal.

 Real-Time Scheduling via Reinforcement Learning

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CASE − Cognitive Agents for Social Environments

Dr. Unnikrishnan P.C. Professor, EEE

Kevin Mason Michael Suggs

CS 188: Artificial Intelligence Fall 2007

School of Computer Science & Engineering

September 22, 2011 Dr. Itamar Arel College of Engineering

October 6, 2011 Dr. Itamar Arel College of Engineering

 Real-Time Scheduling via Reinforcement Learning

CS 188: Artificial Intelligence Fall 2008

CS 416 Artificial Intelligence

Reinforcement Learning

CS 416 Artificial Intelligence

Classifier-Based Approximate Policy Iteration

Reinforcement Learning (2)

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Reinforcement Learning

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 8

Presentation transcript:

Monte-Carlo Planning: Policy Improvement Alan Fern

Monte-Carlo Planning Often a simulator of a planning domain is available or can be learned from data Conservation Planning Fire & Emergency Response 2

Large Worlds: Monte-Carlo Approach Often a simulator of a planning domain is available or can be learned from data Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator action World Simulator Real World State + reward 3

MDP: Simulation-Based Representation A simulation-based representation gives: S, A, R, T, I: finite state set S (|S|=n and is generally very large) finite action set A (|A|=m and will assume is of reasonable size) Stochastic, real-valued, bounded reward function R(s,a) = r Stochastically returns a reward r given input s and a Stochastic transition function T(s,a) = s’ (i.e. a simulator) Stochastically returns a state s’ given input s and a Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP Stochastic initial state function I. Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!

Outline You already learned how to evaluate a policy given a simulator Just run the policy multiple times for a finite horizon and average the rewards In next two lectures we’ll learn how to select good actions

Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy Switching Monte-Carlo Tree Search Sparse Sampling UCT and variants Today

Single State Monte-Carlo Planning Suppose MDP has a single state and k actions Can sample rewards of actions using calls to simulator Sampling action a is like pulling slot machine arm with random payoff function R(s,a) s a1 a2 ak … … R(s,a1) R(s,a2) R(s,ak) Multi-Armed Bandit Problem

Multi-Armed Bandits We will use bandit algorithms as components for multi-state Monte-Carlo planning But they are useful in their own right Pure bandit problems arise in many applications Applicable whenever: We have a set of independent options with unknown utilities There is a cost for sampling options or a limit on total samples Want to find the best option or maximize utility of our samples

Multi-Armed Bandits: Examples Clinical Trials Arms = possible treatments Arm Pulls = application of treatment to inidividual Rewards = outcome of treatment Objective = Find best treatment quickly (debatable) Online Advertising Arms = different ads/ad-types for a web page Arm Pulls = displaying an ad upon a page access Rewards = click through Objective = find best add quickly (the maximize clicks)

Simple Regret Objective Different applications suggest different types of bandit objectives. Today minimizing simple regret will be the objective Simple Regret Minimization (informal): quickly identify arm with close to optimal expected reward s a1 a2 ak … … R(s,a1) R(s,a2) R(s,ak) Multi-Armed Bandit Problem

Simple Regret Objective: Formal Definition Protocol: at time step n based on all prior observations Pick an “exploration” arm 𝑎 𝑛 , then pull it and observe reward 𝑟 𝑛 Pick an “exploitation” arm index 𝑗 𝑛 that currently looks best (if algorithm is stopped at time 𝑛 it returns 𝑗 𝑛 ) ( 𝑎 𝑛 , 𝑗 𝑛 and 𝑟 𝑛 are random variables). Let 𝑹 ∗ be the expected reward of truly best arm Expected Simple Regret ( 𝑬[𝑺𝑹𝒆𝒈 𝒏 ]): difference between 𝑅 ∗ and expected reward of arm 𝑗 𝑛 selected by our strategy at time n 𝐸[𝑆𝑅𝑒𝑔 𝑛 ]= 𝑅 ∗ −𝐸[𝑅( 𝑎 𝑗 𝑛 )]

UniformBandit Algorith (or Round Robin) Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852 UniformBandit Algorithm: At round n pull arm with index (k mod n) + 1 At round n return arm (if asked) with largest average reward I.e. 𝑗 𝑛 is the index of arm with best average so far This bound is exponentially decreasing in n! So even this simple algorithm has a provably small simple regret. Theorem: The expected simple regret of Uniform after n arm pulls is upper bounded by O 𝑒 −𝑐𝑛 for a constant c.

Can we do better? Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence. Algorithm 𝜖-GreedyBandit : (parameter 0<𝜖<1) At round n, with probability 𝜖 pull arm with best average reward so far, otherwise pull one of the other arms at random. At round n return arm (if asked) with largest average reward Theorem: The expected simple regret of 𝜖-Greedy for 𝜖=0.5 after n arm pulls is upper bounded by O 𝑒 −𝑐𝑛 for a constant c that is larger than the constant for Uniform (this holds for “large enough” n). Often is more effective than UniformBandit in practice.

Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy Switching Monte-Carlo Tree Search Sparse Sampling UCT and variants Today

Policy Improvement via Monte-Carlo Now consider a very large multi-state MDP. Suppose we have a simulator and a non-optimal policy E.g. policy could be a standard heuristic or based on intuition Can we somehow compute an improved policy? World Simulator + Base Policy action Real World State + reward 15

Policy Improvement Theorem Definition: The Q-value function 𝑄 𝜋 𝑠,𝑎,ℎ gives the expected future reward of starting in state s, taking action 𝑎, and then following policy 𝜋 until the horizon h. How good is it to execute 𝜋 after taking action 𝑎 in state 𝑠 Define: 𝝅 ′ 𝒔 = argmax 𝒂 𝑸 𝝅 𝒔,𝒂,𝒉 Theorem [Howard, 1960]: For any non-optimal policy 𝜋 the policy 𝜋 ′ is strictly better than 𝜋. So if we can compute 𝜋 ′ 𝑠 at any state we encounter, then we can execute an improved policy Can we use bandit algorithms to compute 𝜋 ′ 𝑠 ?

Policy Improvement via Bandits ak … SimQ(s,a1,π,h) SimQ(s,a2,π,h) SimQ(s,ak,π,h) Idea: define a stochastic function SimQ(s,a,π,h) that we can implement and whose expected value is Qπ(s,a,h) Then use Bandit algorithm to select (approximately) the action with best Q-value (i.e. the action 𝜋′(𝑠)) How to implement SimQ?

Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q Trajectory under p Sum of rewards = SimQ(s,a1,π,h) … a1 … Sum of rewards = SimQ(s,a2,π,h) s a2 … ak Sum of rewards = SimQ(s,ak,π,h) …

Policy Improvement via Bandits SimQ(s,a,π,h) q = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 q = q + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return q Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards Expected value of SimQ(s,a,π,h) is Qπ(s,a,h) So averaging across multiple runs of SimQ quickly converges to Qπ(s,a,h)

Policy Improvement via Bandits ak … SimQ(s,a1,π,h) SimQ(s,a2,π,h) SimQ(s,ak,π,h) Now apply your favorite bandit algorithm for simple regret UniformRollout : use UniformBandit Parameters: number of trials n and horizon/height h 𝝐-GreedyRollout : use 𝜖-GreedyBandit Parameters: number of trials n, 𝜖, and horizon/height h (𝜖=0.5 often is a good choice)

UniformRollout Each action is tried roughly the same number of times (approximately 𝑤=𝑛/𝑘 times) s a1 a2 ak … SimQ(s,ai,π,h) trajectories Each simulates taking action ai then following π for h-1 steps. … … … … … … … … … Samples of SimQ(s,ai,π,h) q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw

𝝐−𝐆𝐫𝐞𝐞𝐝𝐲𝐑𝐨𝐥𝐥𝐨𝐮𝐭 Allocates a non-uniform number of trials across actions (focuses on more promising actions) s a1 a2 ak For 𝜖=0.5 we might expect it to be better than UniformRollout for same value of n. … … … … … … q11 q12 … q1u q21 q22 … q2v qk1

Executing Rollout in Real World ak Real world state/action sequence s … … run policy rollout run policy rollout a1 a2 ak … a1 a2 ak … Simulated experience How much time does each decision take?

Policy Rollout: # of Simulator Calls ak … SimQ(s,ai,π,h) trajectories Each simulates taking action ai then following π for h-1 steps. … … … … … … … … … Total of n SimQ calls each using h calls to simulator and policy Total of hn calls to the simulator and to the policy (dominates time to make decision)

Practical Issues: Accuracy Selecting number of trajectories 𝒏 n should be at least as large as the number of available actions (so each is tried at least once) In general n needs to be larger as the randomness of the simulator increases (so each action gets tried a sufficient number of times) Rule-of-Thumb : start with n set so that each action can be tried approximately 5 times and then see impact of decreasing/increasing n Selecting height/horizon h of trajectories A common option is to just select h to be the same as the horizon of the problem being solved Suggestion: setting h = -1 in our framework, which will run all trajectories until the simulator hits a terminal state Using a smaller value of h can sometimes be effective if enough reward is accumulated to give a good estimate of Q-values In general, larger values are better, but this increases time.

Practical Issues: Speed There are three ways to speedup decision making time Use a faster policy

Practical Issues: Speed There are three ways to speedup decision making time Use a faster policy Decrease the number of trajectories n Decreasing Trajectories: If n is small compared to # of actions k, then performance could be poor since actions don’t get tried very often One way to get away with a smaller n is to use an action filter Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider You can use your domain knowledge to filter out obviously bad actions Rollout decides among the remaining actions returned by f(s) Since rollout only tries actions in f(s) can use a smaller value of n

Practical Issues: Speed There are three ways to speedup either rollout procedure Use a faster policy Decrease the number of trajectories n Decrease the horizon h Decrease Horizon h: If h is too small compared to the “real horizon” of the problem, then the Q-estimates may not be accurate Can get away with a smaller h by using a value estimation heuristic Heuristic function: a heuristic function v(s) returns an estimate of the value of state s SimQ is adjusted to run policy for h steps ending in state s’ and returns the sum of rewards up until s’ added to the estimate v(s’)

Multi-Stage Rollout A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting from two iterations of policy improvement Can nest this arbitrarily Gives a way to use more time in order to improve performance

Multi-Stage Rollout s a1 ak a2 … Each step requires nh simulator calls for Rollout policy a1 a2 ak … Trajectories of SimQ(s,ai,Rollout[π,h,w],h) … … … … … … … … … Two stage: compute rollout policy of “rollout policy of π” Requires (nh)2 calls to the simulator for 2 stages In general exponential in the number of stages

Example: Rollout for Solitaire [Yan et al. NIPS’04] Player Success Rate Time/Game Human Expert 36.6% 20 min (naïve) Base Policy 13.05% 0.021 sec 1 rollout 31.20% 0.67 sec 2 rollout 47.6% 7.13 sec 3 rollout 56.83% 1.5 min 4 rollout 60.51% 18 min 5 rollout 70.20% 1 hour 45 min Multiple levels of rollout can payoff but is expensive

Rollout in 2-Player Games SimQ simply uses the base policy 𝜋 to select moves for both players until the horizon Rollout is biased toward playing well against 𝜋 Is this ok? s a1 a2 ak … p1 p2 … … … … … … … … … q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw

Another Useful Technique: Policy Switching Suppose you have a set of base policies {π1, π2,…, πM} Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select. Policy switching is a simple way to select which policy to use at a given step via a simulator

Another Useful Technique: Policy Switching π 1 πM π 2 … Sim(s,π1,h) Sim(s,π2,h) Sim(s,πM,h) The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s Implement by simply simulating π starting in s for h steps and returning discounted total reward Use Bandit algorithm to select best policy and then select action chosen by that policy

PolicySwitching s π 1 πM π 2 … PolicySwitch[{π1, π2,…, πM},h,n](s) Define bandit with M arms giving rewards Sim(s,πi,h) Let i* be index of the arm/policy selected by your favorite bandit algorithm using n trials Return action πi*(s) s π 1 πM π 2 … Sim(s,πi,h) trajectories Each simulates following πi for h steps. … … … … … … … … … Discounted cumulative rewards v11 v12 … v1w v21 v22 … v2w vM1 vM2 … vMw

Executing Policy Switching in Real World 𝜋k(s’) Real world state/action sequence s … … run policy rollout run policy rollout 𝜋1 𝜋2 𝜋k … 𝜋1 𝜋2 𝜋k … Simulated experience

Policy Switching: Quality Let 𝜋 𝑝𝑠 denote the ideal switching policy Always pick the best policy index at any state The value of the switching policy is at least as good as the best single policy in the set It will often perform better than any single policy in set. For non-ideal case, were bandit algorithm only picks approximately the best arm we can add an error term to the bound. Theorem: For any state s, max 𝑖 𝑉 𝜋 𝑖 𝑠 ≤ 𝑉 𝜋 𝑝𝑠 𝑠 .

Policy Switching in 2-Player Games Suppose we have a two sets of polices, one for each player. Max Policies (us) : { 𝜋 1 , 𝜋 2 , …, 𝜋 𝑀 } Min Policies (them) : { 𝜙 1 , 𝜙 2 ,…, 𝜙 𝑁 } These policy sets will often be the same, when players have the same actions sets. Policies encode our knowledge of what the possible effective strategies might be in the game But we might not know exactly when each strategy will be most effective.

Minimax Policy Switching …. Build Game Matrix Current State s Game Simulator Each entry gives estimated value (for max player) of playing a policy pair against one another Each value estimated by averaging across w simulated games.

MaxiMin Switching Build Game Matrix …. Build Game Matrix Current State s Game Simulator MaxiMin Policy 𝜋 𝑖 =arg max 𝑖 min 𝑗 𝐶 𝑖,𝑗 (𝑠) Select action 𝜋 𝑖 𝑠 Can switch between policies based on state of game!

MaxiMin Switching Build Game Matrix …. Build Game Matrix Current State s Game Simulator Parameters in Library Implementation: Policy Sets: { 𝜋 1 , 𝜋 2 , …, 𝜋 𝑀 }, { 𝜙 1 , 𝜙 2 ,…, 𝜙 𝑁 } Sampling Width w : number of simulations per policy pair Height/Horizon h : horizon used for simulations