Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Recall Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what Reinforcement Learning: learning what to do from interactions with the environment

Recall Markov Decision Process r t and s t+1 depend only on the current (s t ) state and action (a t ) Goal: get as much eventual reward as possible no matter from which state you start off.

Today’s lecture We recall the formalism for the deterministic case We reformulate the formalism for nondeterministic case We learn about: – Bellman equations – Optimal policy – Policy iteration – Q-learning in nondeterministic environment

“What to learn” Learn an action policy so that it maximises the eventual reward from each state Learn this from interaction examples, i.e. data of the form ((s,a),r) Learn an action policy expected so that it maximises the expected eventual reward from each state Learn this from interaction examples, i.e. data of the form ((s,a),r)

Notations used: We assume we are at a time t, so Recall & summarize other notations as well: Policy:  Remember in the deterministic case,  (s) is an action In the non-deterministic case,  (s) is a random variable, ie we can only talk about  (a|s), which is the probability of doing action a in state s State values in a policy: V  (s) Values of state-action pairs (i.e. Q-values): Q(s,a) State transitions: the next state depends on the current state and current action. ‘the state that deterministically follows s if action a is taken’:  (s,a) Now the state transitions may also be probabilistic: then, ‘the probability that s’ follows s if action a is taken’ is p(s’|s,a)

State value function How much reward will I accumulate from state s if I follow policy π expect toHow much reward can I expect to accumulate from state s if I follow policy π This is called the Bellman equation. It is a linear system with a unique solution.

Example Compute V * for π *: V * (s 6 ) = 100 + 0.9*0 = 100 V * (s 5 ) = 0 + 0.9*100 = 90 V * (s 4 ) = 0 + 0.9*90 = 81 Compute V α for random pol π α V α (s 6 ) = 0.5*(100+0.9*0) +0.5*(0+0.9*0) = 50 V α (s 5 ) = 0.66*(0 + 0.9*50) +0.33*(0 + 0.9*0) = 30 V α (s 4 ) = 0.5*(0 + 0.9*30) +0.5*(0+0.9*0) = 13.5 Etc… If computed for all states, then start again and keep iterating till the values converge Btw, has p(s’|s,a) disappeared?

What is good about it? For any policy, Bellman eqn has unique solution, so it has a unique solution for the optimal policy as well. Denote this by V *. The optimal policy is the one we want to learn. Denote it by  * If we could learn V *, then with one look-ahead we could compute  * : Q: How to learn V * ? It depends on  *, which is unknown as well. A: Iterate and improve on both in each iteration until converging to V * and an associated  *. This is called (generalised) policy iteration.

Generalised Policy Iteration Geometric illustration

Before going on – what is the meaning of ‘optimal policy’ more exactly? π is said to be better then another policy π’, In an MDP there always exists at least one policy which is better than all others. This is called the ‘optimal policy’. Any policy which is greedy with respect to V * is an optimal policy.

What is bad about it? We need to know the model of the environment for doing policy iteration. – i.e. we need to know the state transitions (‘what follows what’) We need to be able to look one step ahead – i.e. to ‘try out’ all actions in order to choose the best one – In some applications this is not feasible – Is some others it is – Can you think of any examples? a fierce battle? an Internet crawler?

Looking ahead… Backup diagram

‘The other route:’ Action value function How much eventual reward can I get if making action a from state s. expectHow much eventual reward can I expect to get if making action a from state s. This is also a Bellman equation with a unique solution.

What is good about it? If we know Q, look-ahead is not needed for following an optimal policy! i.e. if we know all action values then just do the best action from each state. Different implementations exist in the literature to improve efficiency. We will stick with turning the Bellman equation for action-value functions into an iterative update.

Simple example of updating Q Recall: Here, the Q values from a previous iteration are given on the arrows Grid world, the rewards are given on the arrows s1s1 s2s2 s4s4 s5s5 s3s3 s6s6 ‘Simple’, i.e. observe this is a deterministic world.

Non-deterministic example: exercise MDP model of a fierce battle. See the full text on your worksheet Non-deterministic actions: the same action from the same state might have different consequences.

First iteration: Q(L1,A)=0+0.9*(0.7*50+0.3*0)=… Q(L1,S)=0+0.9*(0.5*(-50)+0.5*0)=… Q(L1,M)=… … What is your optimal action plan?

Key points Learning by reinforcement Markov Decision Processes Value functions Optimal policy Bellman equations Methods and implementations for computing value functions – Policy iteration – Q-learning

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Similar presentations

Presentation on theme: "Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Similar presentations

Presentation on theme: "Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)"— Presentation transcript:

Similar presentations

About project

Feedback