Download presentation
Presentation is loading. Please wait.
1
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley
2
Lecture Outline Introduction Basic Concepts Expectation, Utility, MEU Neural correlates of reward based learning Utility theory from economics Preferences, Utilities. Reinforcement Learning: AI approach The problem Computing total expected value with discounting Q-values, Bellman’s equation TD-Learning
3
Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act so as to maximize expected utility Change the rewards, change the behavior DEMO
4
Elements of RL Transition model, how action influences states Reward R, immediate value of state-action transition Policy , maps states to actions Agent Environment StateRewardAction Policy
5
Markov Decision Processes Markov decision processes (MDPs) A set of states s S A model T(s,a,s’) = P(s’ | s,a) Probability that action a in state s leads to s’ A reward function R(s, a, s’) (sometimes just R(s) for leaving a state or R(s’) for entering one) A start state (or distribution) Maybe a terminal state MDPs are the simplest case of reinforcement learning In general reinforcement learning, we don’t know the model or the reward function
6
Elements of RL r(state, action) immediate reward values 100 0 0 G 0 0 0 0 0 0 0 0 0
7
Reward Sequences In order to formalize optimality of a policy, need to understand utilities of reward sequences Typically consider stationary preferences: If I prefer one state sequence starting today, I would prefer the same starting tomorrow. Theorem: only two ways to define stationary utilities Additive utility: Discounted utility:
8
Elements of RL Value function: maps states to state values Discount factor [0, 1) (here 0.9) V * (state) values r(state, action) immediate reward values 100 0 0 G 0 0 0 0 0 0 0 0 0 G 901000 8190100 2 11 π trγ t γrtrsV... G 901000 8190100 G 901000 8190100
9
RL task (restated) Execute actions in environment, observe results. Learn action policy : state action that maximizes expected discounted reward E [r(t) + r(t + 1) + 2 r(t + 2) + …] from any starting state in S
10
Hyperbolic discounting Ainslee 1992 Short term rewards are different from long term rewards Used in many animal discounting models Has been used to explain procrastination addiction Evidence from Neuroscience (Next lecture)
11
MDP Solutions In deterministic single-agent search, want an optimal sequence of actions from start to a goal In an MDP we want an optimal policy (s) A policy gives an action for each state Optimal policy maximizes expected utility (i.e. expected rewards) if followed Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s
12
Example Optimal Policies R(s) = -2.0 R(s) = -0.4 R(s) = -0.03R(s) = -0.01
13
Utility of a State Define the utility of a state under a policy: V (s) = expected total (discounted) rewards starting in s and following Recursive definition (one-step look-ahead): Also called policy evaluation
14
Bellman’s Equation for Selecting actions Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation That’s my equation!
15
r(state, action) immediate reward values Q(state, action) values V * (state) values 100 0 0 G 0 0 0 0 0 0 0 0 0 90 81 100 G 0 81 72 90 81 72 90 81 100 G 901000 8190100 Q-values The expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))
16
Representation Explicit Implicit Weighted linear function/neural network Classical weight updating StateActionQ(s, a) 2MoveLeft81 2MoveRight100...
17
A table of values for each action: Q-Functions A q-value is the value of a (state and action) under a policy Utility of taking starting in state s, taking action a, then following thereafter
18
The Bellman Equations Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally:
19
Optimal Utilities Goal: calculate the optimal utility of each state V*(s) = expected (discounted) rewards with optimal actions Why: Given optimal utilities, MEU tells us the optimal policy
20
MDP solution methods If we know T(s, a, s’) and R(s,a,s’), then we can solve the MDP to find the optimal policy in a number of ways. Dynamic programming Iterative Estimation methods Value Iteration Assume 0 initial values for each state and update using the Bellman equation to pick actions. Policy iteration Evaluate a given policy (find V(s) for the policy), then change it using Bellman updates till there is no improvement in the policy.
21
Value Iteration Idea: Start with bad guesses at all utility values (e.g. V 0 (s) = 0) Update all values simultaneously using the Bellman equation (called a value update or Bellman update): Repeat until convergence Theorem: will converge to unique optimal values Basic idea: bad guesses get refined towards optimal values Policy may converge long before values do
22
Reinforcement Learning Reinforcement learning: W have an MDP: A set of states s S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’) Are looking for a policy (s) We don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn
23
Example: Animal Learning RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated Example: foraging Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar intake measurement to motor planning area
24
Reinforcement Learning Target function is : state action However… We have no training examples of form Training examples are of form, new-state, reward>
25
Passive Learning Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You are given a policy (s) Goal: learn the state values (and maybe the model) In this case: No choice about what actions to take Just execute the policy and learn from experience
26
Example: Direct Estimation Simple Monte Carlo Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) U(1,1) ~ (92 + -106) / 2 = -7 U(3,3) ~ (99 + 97 + -102) / 3 = 31.3 = 1, R = -1 +100 -100
27
Full Estimation (Dynamic Programming) T T T TTTTTTTTTT
28
Simple Monte Carlo TTTTTTTTTT
29
Combining DP and MC TTTTTTTTTT
30
Reinforcement Learning Target function is : state action However… We have no training examples of form Training examples are of form, new-state, reward>
31
Model-Free Learning Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates (over time) Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever successor occurs a s s, a s,a,s’ s’
32
TD Learning features On-line, Incremental Bootstrapping (like DP unlike MC) Model free Converges for any policy to the correct value of a state for that policy. On average when alpha is small With probability 1 when alpha is high in the beginning and low at the end (say 1/k)
33
Problems with TD Value Learning TD value learning is model- free for policy evaluation However, if we want to turn our value estimates into a policy, we’re sunk: Idea: Learn state-action pairings (Q-values) directly Makes action selection model-free too! a s s, a s,a,s’ s’
34
Q-Learning Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate: Nudge the old estimate towards the new sample:
35
Any problems with this? What if the starting policy doesn’t let you explore the state space? T(s,a,s’) is unknown and never estimated. The value of unexplored states is never computed. How do we address this problem? Fundamental problem in RL and in Biology AI solutions include e-greedy Softmax Evidence from Neuroscience (next lecture).
36
Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions ( -greedy) Every time step, flip a coin With probability , act randomly With probability 1- , act according to current policy (best q value for instance) Problems with random actions? You do explore the space, but keep thrashing around once learning is done One solution: lower over time Another solution: exploration functions
37
Q-Learning
38
Q Learning features On-line, Incremental Bootstrapping (like DP unlike MC) Model free Converges to an optimal policy. On average when alpha is small With probability 1 when alpha is high in the beginning and low at the end (say 1/k)
39
Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act so as to maximize expected utility Change the rewards, change the behavior Examples: Learning your way around, reward for reaching the destination. Playing a game, reward at the end for winning / losing Vacuuming a house, reward for each piece of dirt picked up Automated taxi, reward for each passenger delivered DEMO
40
Demo of Q Learning Demo arm-control Parameters learning rate) discounted reward (high for future rewards) exploration(should decrease with time) MDP Reward= number of the pixel moved to the right/ iteration number Actions : Arm up and down (yellow line), hand up and down (red line)
41
Exploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.