Discretization Pieter Abbeel UC Berkeley EECS

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Nonlinear Optimization for Optimal Control
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Maximum Likelihood (ML), Expectation Maximization (EM)
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Particle Filters++ TexPoint fonts used in EMF.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Department of Computer Science Undergraduate Events More
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Discretization Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAA

Markov Decision Process Assumption: agent gets to observe the state [Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Markov Decision Process (S, A, T, R, H) Given S: set of states A: set of actions T: S x A x S x {0,1,…,H}  [0,1], Tt(s,a,s’) = P(st+1 = s’ | st = s, at =a) R: S x A x S x {0, 1, …, H}  < Rt(s,a,s’) = reward for (st+1 = s’, st = s, at =a) H: horizon over which the agent will act Goal: Find ¼ : S x {0, 1, …, H}  A that maximizes expected sum of rewards, i.e.,

Value Iteration Idea: = the expected sum of rewards accumulated when starting from state s and acting optimally for a horizon of i steps Algorithm: Start with for all s. For i=1, … , H for all states s 2 S: Action selection:

Continuous State Spaces S = continuous set Value iteration becomes impractical as it requires to compute, for all states s 2 S:

Markov chain approximation to continuous state space dynamics model (“discretization”) Original MDP (S, A, T, R, H) Discretized MDP Grid the state-space: the vertices are the discrete states. Reduce the action space to a finite set. Sometimes not needed: When Bellman back-up can be computed exactly over the continuous action space When we know only certain controls are part of the optimal policy (e.g., when we know the problem has a “bang-bang” optimal solution) Transition function: see next few slides.

Discretization Approach A: Deterministic Transition onto Nearest Vertex --- 0’th Order Approximation 0.1 a 0.3 »2 »1 »3 Discrete states: { »1 , …, »6 } Similarly define transition probabilities for all »i 0.4 0.2 »4 »5 »6  Discrete MDP just over the states { »1 , …, »6 }, which we can solve with value iteration If a (state, action) pair can results in infinitely many (or very many) different next states: Sample next states from the next-state distribution

Discretization Approach B: Stochastic Transition onto Neighboring Vertices --- 1’st Order Approximation »1 »5 »9 »10 »11 »12 »8 »4 »3 »2 »6 »7 s’ a Discrete states: { »1 , …, »12 } If stochastic: Repeat procedure to account for all possible transitions and weight accordingly Need not be triangular, but could use other ways to select neighbors that contribute. “Kuhn triangulation” is particular choice that allows for efficient computation of the weights pA, pB, pC, also in higher dimensions

Discretization: Our Status Have seen two ways to turn a continuous state-space MDP into a discrete state-space MDP When we solve the discrete state-space MDP, we find: Policy and value function for the discrete states They are optimal for the discrete MDP, but typically not for the original MDP Remaining questions: How to act when in a state that is not in the discrete states set? How close to optimal are the obtained policy and value function?

How to Act (i): 0-step Lookahead For non-discrete state s choose action based on policy in nearby states Nearest Neighbor: (Stochastic) Interpolation:

How to Act (ii): 1-step Lookahead Use value function found for discrete MDP Nearest Neighbor: (Stochastic) Interpolation:

How to Act (iii): n-step Lookahead Think about how you could do this for n-step lookahead Why might large n not be practical in most cases?

Discretization: example 3 »1 »2 »3 Discrete states: { »1 , …, »6 } »4 »5 »6 After entering a region, the state gets uniformly reset to any state from that region. [Chow and Tsitsiklis, 1991]

Discretization: example 3 (ctd) Discretization results in a similar MDP as for example 2 Main difference: transition probabilities are computed based upon a region rather than the discrete states Can this be thought of as value function approximation? Not entirely clear to me.

Example: Double integrator---minimum time Continuous time: Objective: reach origin in minimum time Can be solved analytically: optimal policy is bang-bang: the control system should accelerate maximally towards the origin until a critical point at which it should hit the brakes in order to come perfectly to rest at the origin. This results in: [See Tedrake 6.6.3 for further details.] TODO: incorporate analytical solution to diffential eqn. into experiments [rather than forward Euler integration]

Example: Double integrator---minimum time---optimal solution

Example: Double integrator---minimum time optimal Kuhn triang., h = 1 Overlay plots? Kuhn triang., h = 0.1 Kuhn triang., h = 0.02

Resulting cost, Kuhn triang. Make into 2D plot; attach code with slides Green = continuous time optimal policy for mintime problem For simulation we used: dt = 0.01; and goal area = within .01 of zero for q and \dot{q}. This results in the continuous time optimal policy not being exactly optimal for the discrete time case.

Example: Double integrator---quadratic cost Dynamics: Cost function:

0’th Order Interpolation, 1 Step Lookahead for Action Selection --- Trajectories optimal Nearest neighbor, h = 1 Overlay plots? Nearest neighbor, h = 0.1 Nearest neighbor, h = 0.02 dt=0.1

0’th Order Interpolation, 1 Step Lookahead for Action Selection --- Resulting Cost

1st Order Interpolation, 1-Step Lookahead for Action Selection --- Trajectories optimal Kuhn triang., h = 1 Overlay plots? Kuhn triang., h = 0.1 Kuhn triang., h = 0.02

1st Order Interpolation, 1-Step Lookahead for Action Selection --- Resulting Cost

Discretization Quality Guarantees Typical guarantees: Assume: smoothness of cost function, transition model For h  0, the discretized value function will approach the true value function To obtain guarantee about resulting policy, combine above with a general result about MDP’s: One-step lookahead policy based on value function V which is close to V* is a policy that attains value close to V*

Quality of Value Function Obtained from Discrete MDP: Proof Techniques Chow and Tsitsiklis, 1991: Show that one discretized back-up is close to one “complete” back- up + then show sequence of back-ups is also close Kushner and Dupuis, 2001: Show that sample paths in discrete stochastic MDP approach sample paths in continuous (deterministic) MDP [also proofs for stochastic continuous, bit more complex] Function approximation based proof (see later slides for what is meant with “function approximation”) Great descriptions: Gordon, 1995; Tsitsiklis and Van Roy, 1996

Example result (Chow and Tsitsiklis,1991)

Value Iteration with Function Approximation Provides alternative derivation and interpretation of the discretization methods we have covered in this set of slides: Start with for all s. For i=1, … , H for all states , where is the discrete state set where 0’th Order Function Approximation 1st Order Function Approximation

Discretization as function approximation 0’th order function approximation builds piecewise constant approximation of value function 1st order function approximatin builds piecewise (over “triangles”) linear approximation of value function

Kuhn triangulation Allows efficient computation of the vertices participating in a point’s barycentric coordinate system and of the convex interpolation weights (aka the barycentric coordinates) See Munos and Moore, 2001 for further details.

Kuhn triangulation (from Munos and Moore)

[[Continuous time ]] One might want to discretize time in a variable way such that one discrete time transition roughly corresponds to a transition into neighboring grid points/regions Discounting: ±t depends on the state and action See, e.g., Munos and Moore, 2001 for details. Note: Numerical methods research refers to this connection between time and space as the CFL (Courant Friedrichs Levy) condition. Googling for this term will give you more background info. !! 1 nearest neighbor tends to be especially sensitive to having the correct match [Indeed, with a mismatch between time and space 1 nearest neighbor might end up mapping many states to only transition to themselves no matter which action is taken.] >> there's a condition (called CFL - Courant Friedrichs Levy - >> the names I forgot earlier) that dictates a condition >> between the spatial discretization and time discretization.

Nearest neighbor quickly degrades when time and space scale are mismatched dt= 0.01 Overlay plots? dt= 0.1 h = 0.1 h = 0.02