Download presentation
Presentation is loading. Please wait.
Published byMarilynn Cameron Modified over 8 years ago
1
Reinforcement Learning Guest Lecturer: Chengxiang Zhai 15-681 Machine Learning December 6, 2001
2
Outline For Today The Reinforcement Learning Problem Markov Decision Process Q-Learning Summary
3
The Checker Problem Revisited Goal: To win every game! What to learn: Given any board position, choose a “good” move But, what is a “good” move? –A move that helps win a game –A move that will lead to a “better” board position So, what is a “better” board position? –A position where a “good” next move exists!
4
Structure of the Checker Problem You are interacting/experimenting with an environment (board) You see the state of the environment (board position) And, you take an action (move), which will –change the state of the environment –result in an immediate reward Immediate reward = 0 unless you win (+100) or lose (-100) the game You want to learn to “control” the environment (board) so as to maximize your long term reward (win the game)
5
Reinforcement Learning Problem Agent Environment s t+1 r t t states reward t r action a t s 0 a 0 r 1 s 1 a 1 r 2 s 2 a 2 r 3 s 3 r 1 + r 2 + r 3 +… discount factor) Maximize
6
Three Elements in RL ? ? action astate s reward r (Slide from Prof. Sebastian Thrun’s lecture)
7
Example 1 : Slot Machine State: configuration of slots Action: stopping time Reward: $$$ (Slide from Prof. Sebastian Thrun’s lecture)
8
Example 2 : Mobile Robot State: location of robot, people, etc. Action: motion Reward: the number of happy faces (Slide from Prof. Sebastian Thrun’s lecture)
9
Example 3 : Backgammon State: Board position Action: move Reward: –win (+100) –lose (-100) TD-Gammon best human players in the world
10
What Are We Learning Exactly? A decision function/policy –Given the state, choose an action Formally, –States: S={s 1, …s n } –Actions: A={a 1,…,a m } –Reward: R –Find :S A that maximizes R (cumulativ reward over time)
11
So, What’s Special About Reinforcement Learning? Find :S A Function Approx.
12
Reinforcement Learning Problem Agent Environment s t+1 r t t states reward t r action a t s 0 a 0 r 1 s 1 a 1 r 2 s 2 a 2 r 3 s 3 r 1 + r 2 + r 3 +… discount factor) Maximize
13
What’s So Special About CL? (Answers from “the book”) Delayed Reward Exploration Partially observable states Life-long learning
14
Now that we know the problem, How do we solve it? ==> Markov Decision Process (MDP)
15
Markov Decision Process (MDP) Finite set of states S Finite set of actions A At each time step the agent observes state s t S and chooses action a t A(s t ) Then receives immediate reward r t+1 =r(s t,a t ) And state changes to s t+1 = (s t,a t ) Markov assumption : s t+1 = (s t,a t ) and r t+1 =r(s t,a t ) –Next reward and state only depend on current state s t and action a t –Functions (s t,a t ) and r(s t,a t ) may be non-deterministic –Functions (s t,a t ) and r(s t,a t ) not necessarily known to agent
16
Learning A Policy A policy tells us how to choose an action given a state An optimal policy is one that gives the best cumulative reward from any initial state we define a cumulative value function for a policy V (s)= r t + r t+1 + r t+2 +…= i=0 r t+i i where r t, r t+1,… are generated by following the policy from start state s Task: Learn the optimal policy * that maximizes V (s) s, * = argmax V (s) Define optimal value function V*(s) = V (s)
17
Idea 1: Enumerating Policy For each policy : S A For each state s, compute the evaluation function V (s) Pick the that has the largest V (s) What’s the problem? Complexity! How do we get around this? Observation: If we know V*(s) = V (s), we can find * *(s) = argmax a [r(s,a) + V*( (s,a))]
18
Idea 2: Learn V*(s) For each state, compute V*(s) (less complexity) Given the current state s, choose action according to *(s) = argmax a [r(s,a) + V*( (s,a))] What’s the problem this time? This works, but only if we know r(s,a) and (s,a) How can we evaluate an action without knowing r(s,a) and (s,a) ? Observation: It seems that all we need is some function like Q(s,a) … …[ *(s) = argmax a Q(s,a) ]
19
Idea 3: Learn Q(s,a) Because we know *(s) = argmax a [r(s,a) + V*( (s,a))] If we want *(s) = argmax a Q(s,a), then we must have Q(s,a) = r(s,a) + V*( (s,a)) We can express V* in terms of Q! V*(s) = max a Q(s,a) So, we have THE RULE FOR Q-LEARNING Q(s,a) = r(s,a) + max a’ Q( (s,a’),a’) Value of a on sreward of a on sBest value of any action on next state
20
Q-Learning for Deterministic Worlds For each, initialize table entry Q(s,a) =0 Observe current state s Do forever: –Select an action a and execute it –Receive immediate reward r –Observe the new state s’ –Update the entry for Q(s,a) as follows: –Change to state s’ Q(s,a) = r + max a’ Q(s’,a’)
21
Why does Q-learning Work? Q-learning converges! Intuitively, for non-negative rewards, Estimated Q values never decrease and never exceed true Q values Maximum error goes down by a factor of after each state is updated Q n+1 (s,a) = r + max a’ Q n (s’,a’) True Q(s,a)
22
Nondeterministic Case Both r(s,a) and (s,a) may have probabilistic outcomes Solution: Just add expectation! Update rule is slightly different (partially updating, “stop” after enough visitings) Also converges Q(s,a) =E[ r(s,a)] + E p(s’|s,a) [max a’ Q(s’,a’)]
23
Extensions of Q-Learning How can we accelerate Q-learning? –Choose action a that can maximize Q(s,a) (exploration vs. exploitation) –Updating sequences –Store past state-action transitions –Exploit knowledge of transition and reward function (simulation) What if we can’t store all the entries? –Function Approximation (Neural Networks,etc)
24
Temporal Difference (TD) Learning Learn by reducing discrepancies between estimates made at different times Q-learning is a special case with one-step lookahead. Why not more than one-step? TD( ): Blend one-step, two-step, …, n-step lookahead with coefficients depending on When =0, we get one-step Q-learning When =1, only the observed r values are considered Q (s t,a t ) = r t + [(1- ) max a Q(s t,a t )+ Q (s t+1,a t+1 )]
25
What You Should Know All basic concepts of RL (state, action, reward, policy, value functions, discounted cumulative reward, …) Mathematical foundation of RL is MDP and dynamic programming Details of Q-learning including its limitation (You should be able to implement it!) Q-learning is a member of temporal difference algorithms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.