Learning to Maximize Reward: Reinforcement Learning

Learning to Maximize Reward: Reinforcement Learning
Brian C. Williams 16.412J/6.834J October 28th, 2002 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 1/12/2019

Reading Today: Reinforcement Learning
Read 2nd ed AIMA Chapter 19, or 1st ed AIMA Chapter 20 Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) For Markov Decision Processes Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4. Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998)

Markov Decision Processes and Reinforcement Learning
Motivation Learning policies through reinforcement Q values Q learning Multi-step backups Nondeterministic MDPs Function Approximators Model-based Learning Summary

Example: TD-Gammon [Tesauro, 1995]
Learns to play Backgammon Situations: Board configurations (1020) Actions: Moves Rewards: +100 if win - 100 if lose 0 for all other states Trained by playing 1.5 million games against self. Currently, roughly equal to best human player.

Reinforcement Learning Problem
Given: Repeatedly… Executed action Observed state Observed reward Learn action policy p: S  A Maximizes life reward r0 + g r1 + g 2 r from any start state. Discount: 0 < g < 1 Note: Unsupervised learning Delayed reward Agent Environment s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Goal: Learn to choose actions that maximize life reward r0 + g r1 + g 2 r

How About Learning the Policy Directly?
p*: S  A fill out table entries for p* by collecting statistics on training pairs <s,a*>. Where does a*come from?

How About Learning the Value Function?
Have agent learn value function Vp*, denoted V*. Given learned V*, agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: Works well if agent knows the environment model. d: S x A  S r: S x A   With no model, agent can’t choose action from V*. With a model, could compute V* via value iteration, why learn it?

How About Learning the Model as Well?
Have agent learn d and r by statistics on training instances <st,rt+1,st+1> Compute V* by value iteration. Vt+1(s)  maxa [r(s,a) + gV t(d(s, a))] Agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: A viable strategy for many problems, but … When do you stop learning the model and compute V*? May take a long time to converge on model. Would like to continuously interleave learning and acting, but repeatedly computing V* is costly. How can we avoid learning the model and V* explicitly?

Eliminating the Model with Q Functions
p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Key idea: Define function that encapsulates V*, d and r: Q(s,a) = r(s,a) + gV*(d(s, a)) From learned Q, can choose an optimal action without knowing d or r. p*(s) = argmaxa Q(s,a) V = Cumulative reward of being in s. Q = Cumulative reward of being in s and taking action a.

How Do We Learn Q? Called a backup
Q(st,at) = r(st,at) + gV*(d(st, at)) Idea: Create update rule similar to Bellman equation. Perform updates on training examples <st , at , rt+1 , st+1 > Q(st,at)  rt+1 + gV*(st+1 ) How do we eliminate V*? Q and V* are closely related: V*(s) = maxa’ Q(s,a’) Substituting Q for V*: Q(st,at)  rt+1 + g maxa’ Q(st+1,a’) Called a backup

Q-Learning for Deterministic Worlds
Let Q denote the current approximation to Q. Initially: For each s, a initialize table entry Q(s, a)  0 Observe initial state s0 Do for all time t: Select an action at and execute it Receive immediate reward rt+1 Observe the new state st+1 Update the table entry for Q (st, at) as follows: Q(st, at)  rt+1+ g maxa’ Q(st+1,a’) st  st+1

Example – Q Learning Update
72 100 100 81 63 g = 0.9 63 81 0 reward received

Example – Q Learning Update
90 aright s1 s2 72 100 100 81 63 g = 0.9 63 81 0 reward received s2 s1 Max rt Q(s1,aright)  r(s1,aright) + g maxa’ Q(s2,a’)  max {63, 81, 100}  90 Note: if rewards are non-negative: For all s, a, n, Qn(s, a)  Qn+1(s, a) For all s, a, n, 0  Qn(s, a)  Q(s, a)

Q-Learning Iterations: Episodic
Start at upper left – move clockwise; table initially 0; g = 0.8 Q(s, a)  r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)

Q-Learning Iterations
Start at upper left – move clockwise; g = 0.8 Q(s, a)  r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10

Start at upper left – move clockwise; g = 0.8 Q(s, a)  r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10 r+ g maxa’ {Q(s4,W), Q(s4,N)} = x max{10,0) = 8

Start at upper left – move clockwise; g = 0.8 Q(s, a)  r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10 r+ g maxa’ {Q(s4,W), Q(s4,N)} = x max{10,0) = 8 10 r+ g maxa’ {Q(s3,W), Q(s3,S)} = x max{0,8) = 6.4

Start at upper left – move clockwise; g = 0.8 Q(s, a)  r+ g maxa’ Q(s’,a’) G 10 s1 s2 s3 s4 s5 s6 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ g maxa’ {Q(s5,loop)}= x 0 = 10 r+ g maxa’ {Q(s4,W), Q(s4,N)} = x max{10,0) = 8 10 r+ g maxa’ {Q(s3,W), Q(s3,S)} = x max{0,8) = 6.4 8

Example Summary: Value Iteration and Q-Learning
100 G 100 Q(s, a) values 81 90 72 R(s, a) values 90 81 100 G V*(s) values G One Optimal Policy

Exploration vs Exploitation
How do you pick actions as you learn? Greedy Action Selection: Always select the action that looks best: p*(s) = arg maxa Q(s,a) Probabilistic Action Selection: Likelihood of a is proportional to current Q value. P(ai|s) = kQ(s, ai) / S j kQ(s, aj)

TD(l): Temporal Difference Learning from lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997. Q learning: reduce discrepancy between successive Q estimates One step time difference: Q(1)(st,at) = rt + g maxa Q(st+1,at) Why not two steps? Q(2)(st,at) = rt + g rt+1 + g2 maxa Q(st+2,at+1) Or n ? Q(n)(st,at) = rt + g rt+1 + g(n-1) rt+n-1 + gn maxa Q(st+n,at+n-1) Blend all of these: Ql(st,at) = (1-l) [Q(1)(st,at) + l Q(2)(st,at) + l2Q(3)(st,at) + …] …

Eligibility Traces Idea: Perform backups on N previous data points, as well as most recent data point. Select data to backup based on frequency of visitation. Bias towards frequent data by geometric decay gi-j. Visits to data point <s,a>: t Accumulating trace: t

Motivation Learning policies through reinforcement Nondeterministic MDPs: Value Iteration Q Learning Function Approximators Model-based Learning Summary

Nondeterministic MDPs
state transitions become probabilistic: d(s,a,s’) S1 Unemployed D R S2 Industry S3 Grad School S4 Academia 0.1 0.9 1.0 R – Research path D – Development path Example

NonDeterministic Case
How do we redefine cumulative reward to handle non-determinism? Define V and Q based on expected values: Vp(st) = E[rt + g rt+1 + g 2 rt ] Vp(st) = E[  g i rt+I ] Q(st,at) = E[r(st,at) + gV*(d(st, at))]

Value Iteration for Non-deterministic MDPs
V1(s) := 0 for all s t := 1 loop t := t + 1 loop for all s in S loop for all a in A Qt (s ,a) := r(s,a) + g S s’ in S d(s,a,s’) V* t(s’) end loop Vt(s) := maxa [Qt (s,a)] until |V*t+1(s) - V* t (s) | < e for all s in S

Q Learning for Nondeterministic MDPs
Q* (s) = r(s,a) + g S s’ in S d(s,a,s’) maxa’ [Q* (s’,a’)] Alter training rule for non-deterministic Qn: Qn(st, at)  (1- an) Qn-1 (st,at) + an [rt+1+ g maxa’ Qn-1(st+1,a’)] where an = 1/(1+visitsn(s,a)) Can still prove convergence of Q [Watkins and Dayan, 92]

Motivation Learning policies through reinforcement Nondeterministic MDPs Function Approximators Model-based Learning Summary

Function Approximation
Approximator s Q(s,a) a targets or error Function Approximators: Backprop Neural Network Radial Basis Function Network CMAC Network Nearest Neighbor, Memory-based Decision Tree gradient- descent methods

Function Approximation Example: Adjusting Network Weights
Approximator s Q(s,a) a targets or error Function Approximator: Q(s,a) = f(s,a,w) Update: Gradient-descent Sarsa: w  w + a[rt+1 + gQ(st+1,at+1)-Q(st,at)] w f(st,at,w) Standard Backprop gradient weight vector target value estimated value

Learns to play Backgammon Situations: Board configurations (1020) Actions: Moves Rewards: +100 if win - 100 if lose 0 for all other states Trained by playing 1.5 million games against self. Currently, roughly equal to best human player.

V(s) predicted probability of winning On win: Outcome = 1 On Loss: Outcome = 0 TD error V(st+1) – V(st) Hidden Units Random Initial Weights Raw Board Position (# of pieces at each position)

Model-based Learning: Certainty-Equivalence Method
For every step: Use new experience to update model parameters. Transitions Rewards Solve the model for V and p. Value iteration. Policy iteration. Use the policy to choose the next action.

Learning the Model For each state-action pair <s,a> visited accumulate: Mean Transition: T(s, a, s’) = number-times-seen(s, a  s’) number-times-tried(s,a) Mean Reward: R(s, a)

Comparison of Model-based and Model-free methods
Temporal Differencing / Q Learning: Only does computation for the states the system is actually in. Good real-time performance Inefficient use of data Model-based methods: Computes the best estimates for every state on every time step. Efficient use of data Terrible real-time performance What is a middle ground?

Dyna: A Middle Ground [Sutton, Intro to RL, 97]
At each step, incrementally: Update model based on new data Update policy based on new data Update policy based on updated model Performance, until optimal, on Grid World: Q-Learning: 531,000 Steps 531,000 Backups Dyna: 61,908 Steps 3,055,000 Backups

Dyna Algorithm Given state s: Choose action a using estimated policy.
Observe new state s’ and reward r. Update T and R of model. Update V at <s, a>: V(s)  maxa [r(s,a) + g s’ T(s,a,s’)V(s’))] Perform k additional updates: Pick k random states sj in {s1, s2, sk} Update each V(sj): V(sj)  maxa [r(sj,a) + g s’ T(sj,a,s’)V(s’))]

Ongoing Research Handling cases where state is only partially observable Design of optimal exploration strategies Extend to continuous action, state Learn and use d : S x A  S Scaling up in the size of the state space Function approximators (neural net instead of table) Generalization Macros Exploiting substructure Multiple learners – Multi-agent reinforcement learning

Markov Decision Processes (MDPs)
Model: Finite set of states, S Finite set of actions, A Probabilistic state transitions, d(s,a) Reward for each state and action, R(s,a) Process: Observe state st in S Choose action at in A Receive immediate reward rt State changes to st+1 Deterministic Example: G 10 Legal transitions shown Reward on unlabeled transitions is 0. a0 a1 r1 s2 a2 r2 s3 s0 s1 s1 r0 a1

Crib Sheet: MDPs by Value Iteration
Insight: Can calculate optimal values iteratively using Dynamic Programming. Algorithm: Iteratively calculate value using Bellman’s Equation: V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] Terminate when values are “close enough” |V*t+1(s) - V* t (s) | < e Agent selects optimal action by one step lookahead on V*: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)]

Crib Sheet: Q-Learning for Deterministic Worlds
Let Q denote the current approximation to Q. Initially: For each s, a initialize table entry Q(s, a)  0 Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s’ Update the table entry for Q (s, a) as follows: Q(s, a)  r+ g maxa’ Q(s’,a’) s  s’

Learning to Maximize Reward: Reinforcement Learning

Similar presentations

Presentation on theme: "Learning to Maximize Reward: Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Maximize Reward: Reinforcement Learning

Similar presentations

Presentation on theme: "Learning to Maximize Reward: Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback