Factored Approches for MDP & RL (Some Slides taken from Alan Fern’s course)

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Linear Regression.
Markov Decision Process
Reinforcement Learning
RL for Large State Spaces: Value Function Approximation
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Decision Theoretic Planning
1 Classical STRIPS Planning Alan Fern * * Based in part on slides by Daniel Weld.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Infinite Horizon Problems
Planning under Uncertainty
Visual Recognition Tutorial
Two Models of Evaluating Probabilistic Planning IPPC (Probabilistic Planning Competition) – How often did you reach the goal under the given time constraints.
Reinforcement Learning
Reinforcement Learning Slides for this part are adapted from those of Dan
Reinforcement Learning
Markov Decision Processes
Evaluating Hypotheses
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Planning Where states are transparent and actions have preconditions and effects Notes at
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Planning Where states are transparent and actions have preconditions and effects Notes at
Radial Basis Function Networks
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
RL for Large State Spaces: Policy Gradient
CSE 573: Artificial Intelligence
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning Slides for this part are adapted from those of Dan And also Alan
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Reinforcement Learning
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Intro to Planning Or, how to represent the planning problem in logic.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Reinforcement Learning
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Announcements Homework 3 due today (grace period through Friday)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
RL for Large State Spaces: Value Function Approximation
Chapter 2: Evaluative Feedback
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning Dealing with Partial Observability
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Presentation transcript:

Factored Approches for MDP & RL (Some Slides taken from Alan Fern’s course)

Factored MDP/RL Representations States made of features – Boolean vs. Continuous Actions modify the features (probabilistically) – Representations include Probabilistic STRIPS, 2- Time-slice Dynamic Bayes Nets etc. Reward and Value functions – Representations include ADDs, linear weighted sums of features etc. Advantages Specification: is far easier Inference: Novel lifted versions of the Value and Policy iterations possible – Bellman backup directly in terms of ADDs – Policy gradient approach where you do direct search in the policy space Learning : Generalization possibilities – Q-learning etc. will now directly update the factored representations (e.g. weights of the features) Thus giving implicit generalization – Approaches such as FF-HOP can recognize and reuse common substructure

Problems with transition systems Transition systems are a great conceptual tool to understand the differences between the various planning problems …However direct manipulation of transition systems tends to be too cumbersome – The size of the explicit graph corresponding to a transition system is often very large – The remedy is to provide “compact” representations for transition systems Start by explicating the structure of the “states” – e.g. states specified in terms of state variables Represent actions not as incidence matrices but rather functions specified directly in terms of the state variables – An action will work in any state where some state variables have certain values. When it works, it will change the values of certain (other) state variables

State Variable Models World is made up of states which are defined in terms of state variables – Can be boolean (or multi-ary or continuous) States are complete assignments over state variables – So, k boolean state variables can represent how many states? Actions change the values of the state variables – Applicability conditions of actions are also specified in terms of partial assignments over state variables

Blocks world State variables: Ontable(x) On(x,y) Clear(x) hand-empty holding(x) Stack(x,y) Prec: holding(x), clear(y) eff: on(x,y), ~cl(y), ~holding(x), hand-empty Unstack(x,y) Prec: on(x,y),hand-empty,cl(x) eff: holding(x),~clear(x),clear(y),~hand-empty Pickup(x) Prec: hand-empty,clear(x),ontable(x) eff: holding(x),~ontable(x),~hand-empty,~Clear(x) Putdown(x) Prec: holding(x) eff: Ontable(x), hand-empty,clear(x),~holding(x) Initial state: Complete specification of T/F values to state variables --By convention, variables with F values are omitted Goal state: A partial specification of the desired state variable/value combinations --desired values can be both positive and negative Init: Ontable(A),Ontable(B), Clear(A), Clear(B), hand-empty Goal: ~clear(B), hand-empty All the actions here have only positive preconditions; but this is not necessary STRIPS ASSUMPTION: If an action changes a state variable, this must be explicitly mentioned in its effects

Why is STRIPS representation compact? (than explicit transition systems) In explicit transition systems actions are represented as state-to-state transitions where in each action will be represented by an incidence matrix of size |S|x|S| In state-variable model, actions are represented only in terms of state variables whose values they care about, and whose value they affect. Consider a state space of 1024 states. It can be represented by log =10 state variables. If an action needs variable v1 to be true and makes v7 to be false, it can be represented by just 2 bits (instead of a 1024x1024 matrix) – Of course, if the action has a complicated mapping from states to states, in the worst case the action rep will be just as large – The assumption being made here is that the actions will have effects on a small number of state variables. Sit. Calc STRIPS rep Transition rep First order Rel/ Prop Atomic

Factored Representations fo MDPs: Actions Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! –Write a Bayes Network relating the value of fluents at the state before and after the action Bayes networks representing fluents at different time points are called “Dynamic Bayes Networks” We look at 2TBN (2-time-slice dynamic bayes nets) Go further by using STRIPS assumption –Fluents not affected by the action are not represented explicitly in the model –Called Probabilistic STRIPS Operator (PSO) model

Action CLK

Factored Representations: Reward, Value and Policy Functions Reward functions can be represented in factored form too. Possible representations include – Decision trees (made up of fluents) – ADDs (Algebraic decision diagrams) Value functions are like reward functions (so they too can be represented similarly) Bellman update can then be done directly using factored representations..

SPUDDs use of ADDs

Direct manipulation of ADDs in SPUDD

Ideas for Efficient Algorithms.. Use heuristic search (and reachability information) – LAO*, RTDP Use execution and/or Simulation – “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) – “Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc. Use “factored” representations – Factored representations for Actions, Reward Functions, Values and Policies – Directly manipulating factored representations during the Bellman update

Probabilistic Planning --The competition (IPPC) --The Action language.. PPDDL was based on PSO A new standard RDDL is based on 2-TBN

Not ergodic

Reducing Heuristic Computation Cost by exploiting factored representations The heuristics computed for a state might give us an idea about the heuristic value of other “similar” states – Similarity is possible to determine in terms of the state structure Exploit overlapping structure of heuristics for different states – E.g. SAG idea for McLUG – E.g. Triangle tables idea for plans (c.f. Kolobov)

A Plan is a Terrible Thing to Waste Suppose we have a plan – s0—a0—s1—a1—s2—a2—s3…an—sG – We realized that this tells us not just the estimated value of s0, but also of s1,s2…sn – So we don’t need to compute the heuristic for them again Is that all? – If we have states and actions in factored representation, then we can explain exactly what aspects of si are relevant for the plan’s success. – The “explanation” is a proof of correctness of the plan » Can be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one. The explanation will typically be just a subset of the literals making up the state – That means actually, the plan suffix from si may actually be relevant in many more states that are consistent with that explanation

Triangle Table Memoization Use triangle tables / memoization C C B B A A A A B B C C If the above problem is solved, then we don’t need to call FF again for the below: B B A A A A B B

Explanation-based Generalization (of Successes and Failures) Suppose we have a plan P that solves a problem [S, G]. We can first find out what aspects of S does this plan actually depend on – Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof – Now you can memoize this plan for just that subset of S

Relaxations for Stochastic Planning Determinizations can also be used as a basis for heuristics to initialize the V for value iteration [mGPT; GOTH etc] Heuristics come from relaxation We can relax along two separate dimensions: – Relax –ve interactions Consider +ve interactions alone using relaxed planning graphs – Relax uncertainty Consider determinizations – Or a combination of both!

Solving Determinizations If we relax –ve interactions – Then compute relaxed plan Admissible if optimal relaxed plan is computed Inadmissible otherwise If we keep –ve interactions – Then use a deterministic planner (e.g. FF/LPG) Inadmissible unless the underlying planner is optimal

Dimensions of Relaxation Uncertainty Negative Interactions Relaxed Plan Heuristic 2 2 McLUG 3 3 FF/LPG Reducing Uncertainty Bound the number of stochastic outcomes  Stochastic “width” Limited width stochastic planning? Increasing consideration 

Dimensions of Relaxation NoneSomeFull NoneRelaxed PlanMcLUG Some FullFF/LPGLimited width Stoch Planning Uncertainty -ve interactions

Expressiveness v. Cost h = 0 McLUG FF-Replan FF Limited width stochastic planning Node Expansions v. Heuristic Computation Cost Nodes Expanded Computation Cost FF R FF

--Factored TD and Q-learning --Policy search (has to be factored..)

32 Large State Spaces When a problem has a large state space we can not longer represent the V or Q functions as explicit tables Even if we had enough memory – Never enough training data! – Learning takes too long What to do?? [Slides from Alan Fern]

33 Function Approximation Never enough training data! – Must generalize what is learned from one situation to other “similar” new situations Idea: – Instead of using large table to represent V or Q, use a parameterized function The number of parameters should be small compared to number of states (generally exponentially fewer parameters) – Learn parameters from experience – When we update the parameters based on observations in one state, then our V or Q estimate will also change for other similar states I.e. the parameterization facilitates generalization of experience

34 Linear Function Approximation Define a set of state features f1(s), …, fn(s) – The features are used as our representation of states – States with similar feature values will be considered to be similar A common approximation is to represent V(s) as a weighted sum of the features (i.e. a linear approximation) The approximation accuracy is fundamentally limited by the information provided by the features Can we always define features that allow for a perfect linear approximation? – Yes. Assign each state an indicator feature. (I.e. i’th feature is 1 iff i’th state is present and  i represents value of i’th state) – Of course this requires far to many features and gives no generalization.

35 Example Consider grid problem with no obstacles, deterministic actions U/D/L/R (49 states) Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) V(s) =  0 +  1 x +  2 y Is there a good linear approximation? – Yes. –  0 =10,  1 = -1,  2 = -1 – (note upper right is origin) V(s) = 10 - x - y subtracts Manhattan dist. from goal reward

36 But What If We Change Reward … V(s) =  0 +  1 x +  2 y Is there a good linear approximation? – No

37 But What If We Change Reward … V(s) =  0 +  1 x +  2 y Is there a good linear approximation? – No

38 But What If… V(s) =  0 +  1 x +  2 y 10 +  3 z  Include new feature z  z= |3-x| + |3-y|  z is dist. to goal location  Does this allow a good linear approx?   0 =10,  1 =  2 = 0,  0 = Feature Engineering….

41 Linear Function Approximation Define a set of features f1(s), …, fn(s) – The features are used as our representation of states – States with similar feature values will be treated similarly – More complex functions require more complex features Our goal is to learn good parameter values (i.e. feature weights) that approximate the value function well – How can we do this? – Use TD-based RL and somehow update parameters based on each experience.

42 TD-based RL for Linear Approximators 1.Start with initial parameter values 2.Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) 3.Update estimated model 4.Perform TD update for each parameter 5.Goto 2 What is a “TD update” for a parameter?

43 Aside: Gradient Descent Given a function f(  1,…,  n ) of n real values  = (  1,…,  n ) suppose we want to minimize f with respect to  A common approach to doing this is gradient descent The gradient of f at point , denoted by   f(  ), is an n-dimensional vector that points in the direction where f increases most steeply at point  Vector calculus tells us that   f(  ) is just a vector of partial derivatives where can decrease f by moving in negative gradient direction This will be used Again with Graphical Model Learning

44 Aside: Gradient Descent for Squared Error Suppose that we have a sequence of states and target values for each state – E.g. produced by the TD-based RL loop Our goal is to minimize the sum of squared errors between our estimated function and each target value: After seeing j’th state the gradient descent rule tells us that we can decrease error by updating parameters by: squared error of example j our estimated value for j’th state learning rate target value for j’th state

45 Aside: continued For a linear approximation function: Thus the update becomes: For linear functions this update is guaranteed to converge to best approximation for suitable learning rate schedule depends on form of approximator

46 TD-based RL for Linear Approximators 1.Start with initial parameter values 2.Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) Transition from s to s’ 3.Update estimated model 4.Perform TD update for each parameter 5.Goto 2 What should we use for “target value” v(s)? Use the TD prediction based on the next state s’ this is the same as previous TD method only with approximation Note that we are generalizing w.r.t. possibly faulty data.. (the neighbor’s value may not be correct yet..)

47 TD-based RL for Linear Approximators 1.Start with initial parameter values 2.Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) 3.Update estimated model 4.Perform TD update for each parameter 5.Goto 2 Step 2 requires a model to select greedy action For applications such as Backgammon it is easy to get a simulation-based model For others it is difficult to get a good model But we can do the same thing for model-free Q-learning

48 Q-learning with Linear Approximators 1.Start with initial parameter values 2.Take action a according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) transitioning from s to s’ 3.Perform TD update for each parameter 4.Goto 2 For both Q and V, these algorithms converge to the closest linear approximation to optimal Q or V. Features are a function of states and actions.

49 Example: Tactical Battles in Wargus Wargus is real-time strategy (RTS) game – Tactical battles are a key aspect of the game RL Task: learn a policy to control n friendly agents in a battle against m enemy agents – Policy should be applicable to tasks with different sets and numbers of agents 5 vs vs. 10

56 Policy Gradient Ascent Let  (  )  be the expected value of policy  . –  (  )  is just the expected discounted total reward for a trajectory of  . – For simplicity assume each trajectory starts at a single initial state. Our objective is to find a  that maximizes  (  ) Policy gradient ascent tells us to iteratively update parameters via: Problem:  (  ) is generally very complex and it is rare that we can compute a closed form for the gradient of  (  ). We will instead estimate the gradient based on experience

57 Gradient Estimation Concern: Computing or estimating the gradient of discontinuous functions can be problematic. For our example parametric policy is  (  ) continuous? No. – There are values of  where arbitrarily small changes, cause the policy to change. – Since different policies can have different values this means that changing  can cause discontinuous jump of  (  ).

58 Example: Discontinous  (  ) Consider a problem with initial state s and two actions a1 and a2 – a1 leads to a very large terminal reward R1 – a2 leads to a very small terminal reward R2 Fixing  2 to a constant we can plot the ranking assigned to each action by Q and the corresponding value  (  ) 11 11 ()() R1 R2 Discontinuity in  (  ) when ordering of a1 and a2 change

59 Probabilistic Policies We would like to avoid policies that drastically change with small parameter changes, leading to discontinuities A probabilistic policy   takes a state as input and returns a distribution over actions – Given a state s   (s,a) returns the probability that   selects action a in s Note that  (  ) is still well defined for probabilistic policies – Now uncertainty of trajectories comes from environment and policy – Importantly if   (s,a) is continuous relative to changing  then  (  ) is also continuous relative to changing  A common form for probabilistic policies is the softmax function or Boltzmann exploration function Aka Mixed Policy (not needed for Optimality…)

60 Empirical Gradient Estimation Our first approach to estimating    (  ) is to simply compute empirical gradient estimates Recall that  = (  1,…,  n) and so we can compute the gradient by empirically estimating each partial derivative So for small  we can estimate the partial derivatives by This requires estimating n+1 values:

61 Empirical Gradient Estimation How do we estimate the quantities For each set of parameters, simply execute the policy for N trials/episodes and average the values achieved across the trials This requires a total of N(n+1) episodes to get gradient estimate – For stochastic environments and policies the value of N must be relatively large to get good estimates of the true value – Often we want to use a relatively large number of parameters – Often it is expensive to run episodes of the policy So while this can work well in many situations, it is often not a practical approach computationally Better approaches try to use the fact that the stochastic policy is differentiable. – Can get the gradient by just running the current policy multiple times Doable without permanent damage if there is a simulator

62 Applications of Policy Gradient Search Policy gradient techniques have been used to create controllers for difficult helicopter maneuvers For example, inverted helicopter flight. A planner called FPG also “won” the 2006 International Planning Competition – If you don’t count FF-Replan

64 Policy Gradient Recap When policies have much simpler representations than the corresponding value functions, direct search in policy space can be a good idea – Allows us to design complex parametric controllers and optimize details of parameter settings For baseline algorithm the gradient estimates are unbiased (i.e. they will converge to the right value) but have high variance – Can require a large N to get reliable estimates  OLPOMDP offers can trade-off bias and variance via the discount parameter [Baxter & Bartlett, 2000] Can be prone to finding local maxima – Many ways of dealing with this, e.g. random restarts.

65 Gradient Estimation: Single Step Problems For stochastic policies it is possible to estimate    (  ) directly from trajectories of just the current policy   – Idea: take advantage of the fact that we know the functional form of the policy First consider the simplified case where all trials have length 1 – For simplicity assume each trajectory starts at a single initial state and reward only depends on action choice –  (  )  is just the expected reward of action selected by  . where s 0 is the initial state and R(a) is reward of action a The gradient of this becomes How can we estimate this by just observing the execution of   ?

66 Rewriting The gradient is just the expected value of g(s 0,a)R(a) over execution trials of   – Can estimate by executing   for N trials and averaging samples a j is action selected by policy on j’th episode – Only requires executing   for a number of trials that need not depend on the number of parameters can get closed form g(s 0,a) Gradient Estimation: Single Step Problems

67 Gradient Estimation: General Case So for the case of a length 1 trajectories we got: For the general case where trajectories have length greater than one and reward depends on state we can do some work and get: s jt is t’th state of j’th episode, a jt is t’th action of epidode j The derivation of this is straightforward but messy. Observed total reward in trajectory j from step t to end length of trajectory j # of trajectories of current policy

68 How to interpret gradient expression? So the overall gradient is a reward weighted combination of individual gradient directions – For large R j (s j, t ) will increase probability of a j, t in s j, t – For negative R j (s j, t ) will decrease probability of a j, t in s j, t Intuitively this increases probability of taking actions that typically are followed by good reward sequences Direction to move parameters in order to increase the probability that policy selects a jt in state s jt Total reward observed after taking a jt in state s jt

69 Basic Policy Gradient Algorithm Repeat until stopping condition 1.Execute   for N trajectories while storing the state, action, reward sequences One disadvantage of this approach is the small number of updates per amount of experience – Also requires a notion of trajectory rather than an infinite sequence of experience Online policy gradient algorithms perform updates after each step in environment (often learn faster)

Online Policy Gradient (OLPOMDP) Repeat forever 1.Observe state s 2.Draw action a according to distribution   (s) 3.Execute a and observe reward r 4. ;; discounted sum of ;; gradient directions 5. Performs policy update at each time step and executes indefinitely – This is the OLPOMDP algorithm [Baxter & Bartlett, 2000]

Interpretation Repeat forever 1.Observe state s 2.Draw action a according to distribution   (s) 3.Execute a and observe reward r 4. ;; discounted sum of ;; gradient directions 5. Step 4 computes an “eligibility trace” e – Discounted sum of gradients over previous state-action pairs – Points in direction of parameter space that increases probability of taking more recent actions in more recent states For positive rewards step 5 will increase probability of recent actions and decrease for negative rewards.

72 Computing the Gradient of Policy Both algorithms require computation of For the Boltzmann distribution with linear approximation we have: where Here the partial derivatives needed for g(s,a) are: