Download presentation
Presentation is loading. Please wait.
Published byAntonie Benešová Modified over 6 years ago
1
Learning to Maximize Reward: Reinforcement Learning
Brian C. Williams 16.412J/6.834J October 28th, 2002 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 1/12/2019
2
Reading Today: Reinforcement Learning
Read 2nd ed AIMA Chapter 19, or 1st ed AIMA Chapter 20 Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) For Markov Decision Processes Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4. Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998)
3
Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Q values Q learning Multi-step backups Nondeterministic MDPs Function Approximators Model-based Learning Summary
4
Example: TD-Gammon [Tesauro, 1995]
Learns to play Backgammon Situations: Board configurations (1020) Actions: Moves Rewards: +100 if win - 100 if lose 0 for all other states Trained by playing 1.5 million games against self. Currently, roughly equal to best human player.
5
Reinforcement Learning Problem
Given: Repeatedly… Executed action Observed state Observed reward Learn action policy p: S A Maximizes life reward r0 + g r1 + g 2 r from any start state. Discount: 0 < g < 1 Note: Unsupervised learning Delayed reward Agent Environment s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Goal: Learn to choose actions that maximize life reward r0 + g r1 + g 2 r
6
How About Learning the Policy Directly?
p*: S A fill out table entries for p* by collecting statistics on training pairs <s,a*>. Where does a*come from?
7
How About Learning the Value Function?
Have agent learn value function Vp*, denoted V*. Given learned V*, agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: Works well if agent knows the environment model. d: S x A S r: S x A With no model, agent can’t choose action from V*. With a model, could compute V* via value iteration, why learn it?
8
How About Learning the Model as Well?
Have agent learn d and r by statistics on training instances <st,rt+1,st+1> Compute V* by value iteration. Vt+1(s) maxa [r(s,a) + gV t(d(s, a))] Agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: A viable strategy for many problems, but … When do you stop learning the model and compute V*? May take a long time to converge on model. Would like to continuously interleave learning and acting, but repeatedly computing V* is costly. How can we avoid learning the model and V* explicitly?
9
Eliminating the Model with Q Functions
p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Key idea: Define function that encapsulates V*, d and r: Q(s,a) = r(s,a) + gV*(d(s, a)) From learned Q, can choose an optimal action without knowing d or r. p*(s) = argmaxa Q(s,a) V = Cumulative reward of being in s. Q = Cumulative reward of being in s and taking action a.
10
Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Q values Q learning Multi-step backups Nondeterministic MDPs Function Approximators Model-based Learning Summary
11
How Do We Learn Q? Called a backup
Q(st,at) = r(st,at) + gV*(d(st, at)) Idea: Create update rule similar to Bellman equation. Perform updates on training examples <st , at , rt+1 , st+1 > Q(st,at) rt+1 + gV*(st+1 ) How do we eliminate V*? Q and V* are closely related: V*(s) = maxa’ Q(s,a’) Substituting Q for V*: Q(st,at) rt+1 + g maxa’ Q(st+1,a’) Called a backup
12
Q-Learning for Deterministic Worlds
Let Q denote the current approximation to Q. Initially: For each s, a initialize table entry Q(s, a) 0 Observe initial state s0 Do for all time t: Select an action at and execute it Receive immediate reward rt+1 Observe the new state st+1 Update the table entry for Q (st, at) as follows: Q(st, at) rt+1+ g maxa’ Q(st+1,a’) st st+1
13
Example – Q Learning Update
72 100 100 81 63 g = 0.9 63 81 0 reward received
14
Example – Q Learning Update
90 aright s1 s2 72 100 100 81 63 g = 0.9 63 81 0 reward received s2 s1 Max rt Q(s1,aright) r(s1,aright) + g maxa’ Q(s2,a’) max {63, 81, 100} 90 Note: if rewards are non-negative: For all s, a, n, Qn(s, a) Qn+1(s, a) For all s, a, n, 0 Qn(s, a) Q(s, a)
15
Q-Learning Iterations: Episodic
Start at upper left – move clockwise; table initially 0; g = 0.8 Q(s, a) r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
16
Q-Learning Iterations: Episodic
Start at upper left – move clockwise; table initially 0; g = 0.8 Q(s, a) r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
17
Q-Learning Iterations
Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10
18
Q-Learning Iterations
Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10 r+ g maxa’ {Q(s4,W), Q(s4,N)} = x max{10,0) = 8
19
Q-Learning Iterations
Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10 r+ g maxa’ {Q(s4,W), Q(s4,N)} = x max{10,0) = 8 10 r+ g maxa’ {Q(s3,W), Q(s3,S)} = x max{0,8) = 6.4
20
Q-Learning Iterations
Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10 r+ g maxa’ {Q(s4,W), Q(s4,N)} = x max{10,0) = 8 10 r+ g maxa’ {Q(s3,W), Q(s3,S)} = x max{0,8) = 6.4 8
21
Q-Learning Iterations
Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10 r+ g maxa’ {Q(s4,W), Q(s4,N)} = x max{10,0) = 8 10 r+ g maxa’ {Q(s3,W), Q(s3,S)} = x max{0,8) = 6.4 8
22
Example Summary: Value Iteration and Q-Learning
100 G 100 Q(s, a) values 81 90 72 R(s, a) values 90 81 100 G V*(s) values G One Optimal Policy
23
Exploration vs Exploitation
How do you pick actions as you learn? Greedy Action Selection: Always select the action that looks best: p*(s) = arg maxa Q(s,a) Probabilistic Action Selection: Likelihood of a is proportional to current Q value. P(ai|s) = kQ(s, ai) / S j kQ(s, aj)
24
Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Q values Q learning Multi-step backups Nondeterministic MDPs Function Approximators Model-based Learning Summary
25
TD(l): Temporal Difference Learning from lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997. Q learning: reduce discrepancy between successive Q estimates One step time difference: Q(1)(st,at) = rt + g maxa Q(st+1,at) Why not two steps? Q(2)(st,at) = rt + g rt+1 + g2 maxa Q(st+2,at+1) Or n ? Q(n)(st,at) = rt + g rt+1 + g(n-1) rt+n-1 + gn maxa Q(st+n,at+n-1) Blend all of these: Ql(st,at) = (1-l) [Q(1)(st,at) + l Q(2)(st,at) + l2Q(3)(st,at) + …] …
26
Eligibility Traces Idea: Perform backups on N previous data points, as well as most recent data point. Select data to backup based on frequency of visitation. Bias towards frequent data by geometric decay gi-j. Visits to data point <s,a>: t Accumulating trace: t
27
Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Nondeterministic MDPs: Value Iteration Q Learning Function Approximators Model-based Learning Summary
28
Nondeterministic MDPs
state transitions become probabilistic: d(s,a,s’) S1 Unemployed D R S2 Industry S3 Grad School S4 Academia 0.1 0.9 1.0 R – Research path D – Development path Example
29
NonDeterministic Case
How do we redefine cumulative reward to handle non-determinism? Define V and Q based on expected values: Vp(st) = E[rt + g rt+1 + g 2 rt ] Vp(st) = E[ g i rt+I ] Q(st,at) = E[r(st,at) + gV*(d(st, at))]
30
Value Iteration for Non-deterministic MDPs
V1(s) := 0 for all s t := 1 loop t := t + 1 loop for all s in S loop for all a in A Qt (s ,a) := r(s,a) + g S s’ in S d(s,a,s’) V* t(s’) end loop Vt(s) := maxa [Qt (s,a)] until |V*t+1(s) - V* t (s) | < e for all s in S
31
Q Learning for Nondeterministic MDPs
Q* (s) = r(s,a) + g S s’ in S d(s,a,s’) maxa’ [Q* (s’,a’)] Alter training rule for non-deterministic Qn: Qn(st, at) (1- an) Qn-1 (st,at) + an [rt+1+ g maxa’ Qn-1(st+1,a’)] where an = 1/(1+visitsn(s,a)) Can still prove convergence of Q [Watkins and Dayan, 92]
32
Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Nondeterministic MDPs Function Approximators Model-based Learning Summary
33
Function Approximation
Approximator s Q(s,a) a targets or error Function Approximators: Backprop Neural Network Radial Basis Function Network CMAC Network Nearest Neighbor, Memory-based Decision Tree gradient- descent methods
34
Function Approximation Example: Adjusting Network Weights
Approximator s Q(s,a) a targets or error Function Approximator: Q(s,a) = f(s,a,w) Update: Gradient-descent Sarsa: w w + a[rt+1 + gQ(st+1,at+1)-Q(st,at)] w f(st,at,w) Standard Backprop gradient weight vector target value estimated value
35
Example: TD-Gammon [Tesauro, 1995]
Learns to play Backgammon Situations: Board configurations (1020) Actions: Moves Rewards: +100 if win - 100 if lose 0 for all other states Trained by playing 1.5 million games against self. Currently, roughly equal to best human player.
36
Example: TD-Gammon [Tesauro, 1995]
V(s) predicted probability of winning On win: Outcome = 1 On Loss: Outcome = 0 TD error V(st+1) – V(st) Hidden Units Random Initial Weights Raw Board Position (# of pieces at each position)
37
Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Nondeterministic MDPs Function Approximators Model-based Learning Summary
38
Model-based Learning: Certainty-Equivalence Method
For every step: Use new experience to update model parameters. Transitions Rewards Solve the model for V and p. Value iteration. Policy iteration. Use the policy to choose the next action.
39
Learning the Model For each state-action pair <s,a> visited accumulate: Mean Transition: T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a) Mean Reward: R(s, a)
40
Comparison of Model-based and Model-free methods
Temporal Differencing / Q Learning: Only does computation for the states the system is actually in. Good real-time performance Inefficient use of data Model-based methods: Computes the best estimates for every state on every time step. Efficient use of data Terrible real-time performance What is a middle ground?
41
Dyna: A Middle Ground [Sutton, Intro to RL, 97]
At each step, incrementally: Update model based on new data Update policy based on new data Update policy based on updated model Performance, until optimal, on Grid World: Q-Learning: 531,000 Steps 531,000 Backups Dyna: 61,908 Steps 3,055,000 Backups
42
Dyna Algorithm Given state s: Choose action a using estimated policy.
Observe new state s’ and reward r. Update T and R of model. Update V at <s, a>: V(s) maxa [r(s,a) + g s’ T(s,a,s’)V(s’))] Perform k additional updates: Pick k random states sj in {s1, s2, sk} Update each V(sj): V(sj) maxa [r(sj,a) + g s’ T(sj,a,s’)V(s’))]
43
Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Nondeterministic MDPs Function Approximators Model-based Learning Summary
44
Ongoing Research Handling cases where state is only partially observable Design of optimal exploration strategies Extend to continuous action, state Learn and use d : S x A S Scaling up in the size of the state space Function approximators (neural net instead of table) Generalization Macros Exploiting substructure Multiple learners – Multi-agent reinforcement learning
45
Markov Decision Processes (MDPs)
Model: Finite set of states, S Finite set of actions, A Probabilistic state transitions, d(s,a) Reward for each state and action, R(s,a) Process: Observe state st in S Choose action at in A Receive immediate reward rt State changes to st+1 Deterministic Example: G 10 Legal transitions shown Reward on unlabeled transitions is 0. a0 a1 r1 s2 a2 r2 s3 s0 s1 s1 r0 a1
46
Crib Sheet: MDPs by Value Iteration
Insight: Can calculate optimal values iteratively using Dynamic Programming. Algorithm: Iteratively calculate value using Bellman’s Equation: V*t+1(s) maxa [r(s,a) + gV* t(d(s, a))] Terminate when values are “close enough” |V*t+1(s) - V* t (s) | < e Agent selects optimal action by one step lookahead on V*: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)]
47
Crib Sheet: Q-Learning for Deterministic Worlds
Let Q denote the current approximation to Q. Initially: For each s, a initialize table entry Q(s, a) 0 Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s’ Update the table entry for Q (s, a) as follows: Q(s, a) r+ g maxa’ Q(s’,a’) s s’
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.