Download presentation
Presentation is loading. Please wait.
Published byMyrtle Charles Modified over 9 years ago
1
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley
2
Announcements Othello tournament rules up. On-line readings for this week.
3
Combining DP and MC TTTTTTTTTT
4
Model-Free Learning Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates (over time) Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever successor occurs a s s, a s,a,s’ s’
5
Example: Passive TD Take = 1, = 0.5 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done)
6
TD Learning features On-line, Incremental Bootstrapping (like DP unlike MC) Model free Converges for any policy to the correct value of a state for that policy. On average when alpha is small With probability 1 when alpha is high in the beginning and low at the end (say 1/k)
7
Driving Home Changes recommended by Monte Carlo methods =1) Changes recommended by TD methods ( =1)
8
Problems with TD Value Learning TD value learning is model- free for policy evaluation However, if we want to turn our value estimates into a policy, we’re sunk: Idea: Learn state-action pairings (Q-values) directly Makes action selection model-free too! a s s, a s,a,s’ s’
9
Q-Functions A q-value is the value of a (state and action) under a policy Utility of taking starting in state s, taking action a, then following thereafter
10
The Bellman Equations Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally:
11
Q-Learning Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate: Nudge the old estimate towards the new sample:
12
Recap: Q-Learning Learn Q*(s,a) values from samples Receive a sample (s,a,s’,r) On one hand: old estimate of return: But now we have a new estimate for this sample: Nudge the old estimate towards the new sample: Equivalently, average samples over time:
13
Q-Learning
14
Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions ( -greedy) Every time step, flip a coin With probability , act randomly With probability 1- , act according to current policy (best q value for instance) Problems with random actions? You do explore the space, but keep thrashing around once learning is done One solution: lower over time Another solution: exploration functions
15
Q-Learning Q-learning produces tables of q-values:
16
Demo of Q Learning Demo arm-control Parameters learning rate discounted reward (high for future rewards) exploration(should decrease with time) MDP Reward= number of the pixel moved to the right/ iteration number Actions : Arm up and down (yellow line), hand up and down (red line)
17
Q-Learning Properties Will converge to optimal policy If you explore enough If you make the learning rate small enough Neat property: does not learn policies which are optimal in the presence of action selection noise SE SE
18
Exploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)
19
Q-Learning In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to even hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar states This is a fundamental idea in machine learning Clustering, classification (unsupervised, supervised), non- parametric learning
20
Evaluation Functions Function which scores non-terminals Ideal function: returns the utility of the position In practice: typically weighted linear sum of features: e.g. f 1 ( s ) = (num white queens – num black queens), etc.
21
Function Approximation Problem: inefficient to learn each state’s utility (or eval function) one by one Solution: what we learn about one state (or position) should generalize to similar states Very much like supervised learning If states are treated entirely independently, we can only learn on very small state spaces
22
Linear Value Functions Another option: values are linear functions of features of states (or action-state pairs) Good if you can describe states well using a few features (e.g. for game playing board evaluations) Now we only have to learn a few weights rather than a value for each state 0.60 0.70 0.800.85 0.650.70 0.80 0.90 0.75 0.85 0.95
23
TD Updates for Linear Qs Can use TD learning with linear Qs (Actually it’s just like the perceptron!) Old Q-learning update: Simply update weights of features in Q (a,s)
24
Generalization of Q-functions Non-linear Q functions are required for more complex spaces. Such functions can be learnt using Multi-Layer Perceptrons (TD-gammon) Support Vector Machines Non-Parametric Methods
25
Demo: Learning walking controllers (From Stanford AI Lab)
26
Policy Search
27
Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best E.g. the value functions from the gridworld were probably bad estimates of future rewards, but they could still produce good decisions Solution: learn the policy that maximizes rewards rather than the value that predicts rewards This is the idea behind policy search, such as what controlled the upside-down helicopter
28
Policy Search Simplest policy search: Start with an initial linear value function or q-function Nudge each feature weight up and down and see if your policy is better than before Problems: How do we tell the policy got better? Need to run many sample episodes! If there are a lot of features, this can be impractical
29
Policy Search* Advanced policy search: Write a stochastic (soft) policy: Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, but you don’t have to know them) Take uphill steps, recalculate derivatives, etc.
30
Helicopter Control (Andrew Ng)
31
Neural Correlates of RL Parkinson’s Disease Motor control + initialtion? Intracranial self-stimulation; Drug addiction; Natural rewards Reward pathway? Learning? Also involved in: Working memory Novel situations ADHD Schizophrenia ……
32
Conditioning Ivan Pavlov = Conditional stimulus = Unconditional stimulus Response = Unconditional response (reflex); conditional response (reflex)
33
Dopamine Levels track RL signals Unpredicted reward (unlearned/no stimulus) Predicted reward (learned task) Omitted reward (probe trial) (Montague et al. 1996)
34
Current Hypothesis Phasic dopamine encodes a reward prediction error Precise (normative!) theory for generation of DA firing patterns Compelling account for the role of DA in classical conditioning: prediction error acts as signal driving learning in prediction areas Evidence Monkey single cell recordings Human fMRI studies Current Research Better information processing model Other reward/punishment circuits including Amygdala (for visual perception) Overall circuit (PFC-Basal Ganglia interaction)
35
Reinforcement Learning What you should know MDPs Utilities, discounting Policy Evaluation Bellman’s equation Value iteration Policy iteration Reinforcement Learning Adaptive Dynamic Programming TD learning (Model-free) Q Learning Function Approximation
36
Hierarchical Learning
37
Stratagus: Example of a large RL task, from Bhaskara Marthi’s thesis (w/ Stuart Russell) Stratagus is hard for reinforcement learning algorithms > 10 100 states > 10 30 actions at each point Time horizon ≈ 10 4 steps Stratagus is hard for human programmers Typically takes several person-months for game companies to write computer opponent Still, no match for experienced human players Programming involves much trial and error Hierarchical RL Humans supply high-level prior knowledge using partial program Learning algorithm fills in the details Hierarchical RL
38
(defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move)))) Partial “Alisp” Program
39
Hierarchical RL They then define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point Very good at balancing resources and directing rewards to the right region Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.