Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes

Closed Loop Interactions Environment SensorsActuators Reward Agent

Reinforcement Learning When mechanism(=model) is unknown When mechanism is known, but model is too hard to solve

Basic Idea Select an action using some sort of action selection process If it leads to a reward, reinforce taking that action in future If it leads to a punishment, avoid taking that action in future

But It’s Not So Simple Rewards and punishments may be delayed –credit assignment problem: how do you figure out which actions were responsible? How do you choose an action? –exploration versus exploitation What if the state space is very large so you can’t visit all states?

Model-Based RL

Model-Based Reinforcement Learning Mechanism is an MDP Approach: –learn the MDP –solve it to determine the optimal policy Works when model is unknown, but it is not too large to store and solve

Learning the MDP We need to learn the parameters of the reward and transition models We assume the agent plays every action in every state a number of times Let R a i = total reward received for playing a in state i Let N a i = number of times played a in state i Let N a ij = number of times j was reached when played a in state i R(i,a) = R a i / N a i T a ij = N a ij / N a i

Note Learning and solving the MDP need not be a one-off thing Instead, we can repeatedly solve the MDP to get better and better policies How often should we solve the MDP? –depends how expensive it is compared to acting in the world

Model-Based Reinforcement Learning Algorithm Let  0 be arbitrary k  0 Experience   Repeat k  k + 1 Begin in state i For a while: Choose action a based on  k-1 Receive reward r and transition to j Experience  Experience  i  j Learn MDP M from Experience Solve M to obtain  k

Credit Assignment How does model-based RL deal with the credit assignment problem? By learning the MDP, the agent knows which states lead to which other states Solving the MDP ensures that the agent plans ahead and takes the long run effects of actions into account So the problem is solved optimally

Action Selection The line in the algorithm Choose action a based on  k-1 is not specific How do we choose the action?

Action Selection The line in the algorithm Choose action a based on  k-1 is not specific How do we choose the action? Obvious answer: the policy tells us the action to perform But is that always what we want to do?

Exploration versus Exploitation Exploit: use your learning results to play the action that maximizes your expected utility, relative to the model you have learned Explore: play an action that will help you learn the model better

Questions When to explore How to explore –simple answer: play an action you haven’t played much yet in the current state –more sophisticated: play an action that will probably lead you to part of the space you haven’t explored much How to exploit –we know the answer to this: follow the learned policy

Conditions for Optimality To ensure that the optimal policy will eventually be reached, we need to ensure that 1.Every action is taken in every state infinitely often in the long run 2.The probability of exploitation tends to 1

Possible Exploration Strategies: 1 Explore until time T, then exploit Why is this bad?

Possible Exploration Strategies: 1 Explore until time T, then exploit Why is this bad? –We may not explore long enough to get an accurate model –As a result, the optimal policy will not be reached

Possible Exploration Strategies: 1 Explore until time T, then exploit Why is this bad? –We may not explore long enough to get an accurate model –As a result, the optimal policy will not be reached But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy

Possible Exploration Strategies: 1 Explore until time T, then exploit Why is this bad? –We may not explore long enough to get an accurate model –As a result, the optimal policy will not be reached But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy Works well for learning from simulation and performing in the real world

Possible Exploration Strategies: 2 Explore with a fixed probability of p Why is this bad?

Possible Exploration Strategies: 2 Explore with a fixed probability of p Why is this bad? –Does not fully exploit when learning has converged to optimal policy

Possible Exploration Strategies: 2 Explore with a fixed probability of p Why is this bad? –Does not fully exploit when learning has converged to optimal policy When could this approach be useful?

Possible Exploration Strategies: 2 Explore with a fixed probability of p Why is this bad? –Does not fully exploit when learning has converged to optimal policy When could this approach be useful? –If world is changing gradually

Boltzmann Exploration In state i, choose action a with probability T is a temperature High temperature: more exploration T should be cooled down to reduce amount of exploration over time Sensitive to cooling schedule

Guarantee If: –every action is taken in every state infinitely often –probability of exploration tends to zero Then: –Model-based reinforcement learning will converge to the optimal policy with probability 1

Pros and Cons Pro: –makes maximal use of experience –solves model optimally given experience Con: –assumes model is small enough to solve –requires expensive solution procedure

R-Max Assume R(s,a)=R-max (the maximal possible reward –Called optimism bias Assume any transition probability Solve and act optimally When N a i > c, update R(i,a) After each update, resolve If you choose c properly, converges to the optimal policy

Model-Free RL

Monte Carlo Sampling If we want to estimate y = E x~D [f(x)] we can –Generate random samples x 1,…,x N from D –Estimate –Guaranteed to converge to correct estimate with sufficient samples –Requires keeping count of # of samples Alternative, update average: –Generate random samples x 1,…,x N from D –Estimate

Randomized Polices A randomized policy  specifies a probability distribution over the action to take in each state Notation:  (i)(a) is the probability of taking action a in state i

Value of a Randomized Policy For a randomized policy  :

Value of a Randomized Policy For a randomized policy  : There are two random elements here: –which action the agent chooses, according to  –what state the world transitions to

Value of a Randomized Policy For a randomized policy  : There are two random elements here: –which action the agent chooses, according to  –what state the world transitions to So:

Estimating the Value of a Randomized Policy Fix a randomized policy  When starting in i, taking action a according to , getting reward r and transitioning to j, we get a sample of

Estimating the Value of a Policy Fix a policy  When starting in state i, taking action a according to , getting reward r and transitioning to j, we get a sample of So we can update V  (i)  (1-  )V  (i) +  (r + V  (j)) But where does V  (j) comes from? –Guess (this is called bootstrapping)

Improving Policies We learn V  but we want to know V* We execute the optimal policy  ’ relative to this value function V  We will then learn a new value function V  ’ relative to the new policy  ’ We will improve the policy again, etc. Policy iteration without a model

In Practice We don’t wait until we have fully learned the value function to improve the policy We improve the policy constantly as we learn the value function Update rule: V (i)  (1-  )V(i) +  (r + V(j))

Temporal Difference Algorithm For each state i: V(i)  0 Begin in state i Repeat: Apply action a based on current policy Receive reward r and transition to j i  j

Credit Assignment By linking values to those of the next state, rewards and punishments are eventually propagated backwards We wait until end of game and then propagate backwards in reverse order

But how do learn to act To improve our policy, we need to have an idea of how good it is to use a different policy TD learns the value function –Similar to the value determination step of policy iteration, but without a model To improve, we need an estimate of the Q function:

TD for Control: SARSA Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r, Choose a’ from s’ using policy derived from Q (e.g., ε-greedy) s  s’, a  a’ until s is terminal

Off-Policy vs. On-Policy On-policy learning: learn only the value of actions used in the current policy. SARSA is an example of an on-policy method. Of course, learning of the value of the policy is combined with gradual change of the policy Off-policy learning: can learn the value of a policy/action different than the one used – separating learning from control. Q-learning is an example. It learns about the optimal policy by using a different policy (e.g., e-greedy policy).

Q Learning Don’t learn the model, learn the Q function directly Update rule: –On transitioning from i to j, taking action a, receiving reward r, update Decision rule: –In state i, choose action that maximizes Q(i,a) 45

Recursive Formulation of Q Function

Learning the Q Values We don’t know T a i and we don’t want to learn it

Learning the Q Values We don’t know T a i and we don’t want to learn it If only we knew that our future Q values were accurate… …every time we played a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+max b Q(j,b)

Learning the Q Values We don’t know T a i and we don’t want to learn it If only we knew that our future Q values were accurate… …every time we played a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+max b Q(j,b) So we pretend that they are accurate –(after all, they get more and more accurate)

Q Learning Update Rule On transitioning from i to j, taking action a, receiving reward r, update

Q Learning Update Rule On transitioning from i to j, taking action a, receiving reward r, update  is the learning rate Large  : –learning is quicker –but may not converge  is often decreased over the course of learning

Q Learning Algorithm For each state i and action a: Q(i,a)  0 Begin in state i Repeat: Choose action a based on the Q values for state i for all actions Receive reward r and transition to j i  j

Choosing Which Action to Take Once you have learned the Q function, you can use it to determine the policy –in state i, choose action a that has highest estimated Q(i,a) But we need to combine exploitation with exploration –same methods as before

Guarantee If: –every action is taken in every state infinitely often –  is sufficiently small Then Q learning will converge to the optimal Q values with probability 1 If also: –probability of exploration tends to zero Then Q learning will converge to the optimal policy with probability 1

Credit Assignment By linking Q values to those of the next state, rewards and punishments are eventually propagated backwards But may take a long time Idea: wait until end of game and then propagate backwards in reverse order

Q-learning (  = 1) S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 a b a,ba,ba,ba,ba,ba,b a,ba,ba,ba,ba,ba,b After playing aaaa: Q(S 4,a) = 1 Q(S 4,b) = 0 Q(S 3,a) = 1 Q(S 3,b) = 0 Q(S 2,a) = 1 Q(S 2,b) = 0 Q(S 1,a) = 1 Q(S 1,b) = 0 After playing bbbb: Q(S 8,a) = 0 Q(S 8,b) = -1 Q(S 7,a) = 0 Q(S 7,b) = 0 Q(S 6,a) = 0 Q(S 6,b) = 0 Q(S 1,a) = 1 Q(S 1,b) = 0 0 001 00 0

Bottom Line Q learning makes optimistic assumption about the future Rewards will be propagated back in linear time, but punishments may take exponential time to be propagated But eventually, Q learning will converge to optimal policy

Temporal Difference (TD) Learning Learn the value function Either learn the mechanism, or assume the mechanism is known Update rule: –On transitioning from i to j, taking action a, receiving reward r, update So far, good only for prediction –For control, i.e., to choose action a that maximizes R(i,a) +  j T a ij V(j) we need to know the transition function – not so good 59

How Do We Choose The Action? In Q learning, we chose the action a that maximized Q(I,a) But here, the value function does not mention an action, so how do we know which action is best?

How Do We Choose The Action? In Q learning, we chose the action a that maximized Q(I,a) But here, the value function does not mention an action, so how do we know which action is best? Answer: use knowledge of the MDP In state i, choose action a that maximizes R(i,a) +  j T a ij V(j) –this is why we need to learn the model!

How Do We Choose The Action? In Q learning, we chose the action a that maximized Q(I,a) But here, the value function does not mention an action, so how do we know which action is best? Answer: use knowledge of the MDP In state i, choose action a that maximizes R(i,a) +  j T a ij V(j) –this is why we need to learn the model! As usual, combine exploration

TD Example (  = 1) S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 a b a,ba,ba,ba,ba,ba,b a,ba,ba,ba,ba,ba,b After playing aaaa: After playing bbbb: 0 001 00 0 V(S 4 ) = 1 V(S 3 ) = 1 V(S 2 ) = 1 V(S 1 ) = 1 V(S 8 ) = -1 V(S 7 ) = -1 V(S 6 ) = -1 V(S 1 ) = -1

Comparison Model-based best when MDP can be solved Temporal difference best when MDP is known or can be learned but cannot be solved Q learning best when MDP can’t be learned

When TD Learning Works When MDP is not hard to learn, but hard to solve Or… the mechanism is known, but the MDP is too hard to solve –size of model: O(|States| 2 ) but often model can be represented compactly –size of value function: O(|States|) Still does not work when state space is very large 65

Issue: Generalization What if state space is very large? Then we can’t visit every state We need to generalize from states we have seen to states we haven’t seen This is just like learning from a training set and generalizing to the future 67

State Space And Variables When we looked at reinforcement learning, state space was monolithic –e.g. in darts, just a number In many domains, state consists of a number of variables –e.g. in backgammon, number of pieces at each location Size of state space is exponential in number of variables We also need to consider continuous state spaces –e.g. helicopter 68

Value Function Approximation Define features X 1,…,X n of the state 69

Value Function Approximation Define features X 1,…,X n of the state Instead of learning V(s) for every state, learn an approximation 70

Value Function Approximation Define features X 1,…,X n of the state Instead of learning V(s) for every state, learn an approximation depends only on the features 71

Value Function Approximation Define features X 1,…,X n of the state Instead of learning V(s) for every state, learn an approximation depends only on the features Represent compactly –e.g. using neural network 72

Value Function Approximation Define features X 1,…,X n of the state Instead of learning V(s) for every state, learn an approximation depends only on the features Represent compactly –e.g. using neural network Works when state space is large but mechanism is known –e.g. backgammon 73

Q Function Approximation Define features X 1,…,X n of the state 74

Q Function Approximation Define features X 1,…,X n of the state Instead of learning Q(s,a) for every state, learn an approximation 75

Q Function Approximation Define features X 1,…,X n of the state Instead of learning Q(s,a) for every state, learn an approximation depends only on the features 76

Q Function Approximation Define features X 1,…,X n of the state Instead of learning Q(s,a) for every state, learn an approximation depends only on the features Represent compactly –e.g. using neural network 77

Q Function Approximation Define features X 1,…,X n of the state Instead of learning Q(s,a) for every state, learn an approximation depends only on the features Represent compactly –e.g. using neural network Works when state space is large and mechanism is unknown –e.g. helicopter 78

Value Function Approximation Update Rule On transitioning from i to j, taking action a, receiving reward r: Create a training instance in which –Inputs are features of i –Output is Run forward propagation and back propagation on this instance 79

Basic Approach Define features that summarize the state –state represented by features X 1,…,X n Assume that the value of a state approximately depends only on the features –V’(s) = f(x 1,…,x n ) Assume that f can be compactly represented Learn f from experience –how to learn such a function will be a major topic of this course

E.g. Samuel’s Checkers Player Features: –x 1 : number of black pieces on board –x 2 : number of red pieces on board –x 3 : number of black kings on board –x 4 : number of red kings on board –x 5 : number of black pieces threatened –x 6 : number of red pieces threatened f(x 1,…,x 6 ) = w 1 x 1 +w 2 x 2 +w 3 x 3 +w 4 x 4 +w 5 x 5 +w 6 x 6 w 1,…,w 6 are learnable parameters

E.g. Backgammon Input features are, for each space: –number of pieces on space –color of pieces, if any Network can learn high-level features, such as number of threatened exposed pieces TDGammon: a backgammon system trained by playing against itself using value function approximation –reached world champion level 82

Training Data Each time agent transitions from i to j, taking action a and receiving reward r, we get an estimate v = r + V’(j) for V(i) Let the features of i be x 1,…,x n We get a training instance We use this instance to update our model of f

Applications of MDPs, POMDPs and Reinforcement Learning TD-Gammon: world champion level backgammon player Robotics and control: e.g. helicopter Industrial: e.g. job shop scheduling Business: e.g. internet advertising Military: e.g. target identification Medical: e.g. testing and diagnosis

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Similar presentations

Presentation on theme: "Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Similar presentations

Presentation on theme: "Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes."— Presentation transcript:

Similar presentations

About project

Feedback