CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Reinforcement Learning
Reinforcement learning
Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Reinforcement learning
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
CS 188: Artificial Intelligence Fall 2009
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Details and Biology 4/3/2008 Srini Narayanan – ICSI and UC Berkeley.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2006
Announcements Homework 3 due today (grace period through Friday)
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley

Announcements  Othello tournament rules up.  On-line readings for this week.

Combining DP and MC TTTTTTTTTT

Model-Free Learning  Big idea: why bother learning T?  Update each time we experience a transition  Frequent outcomes will contribute more updates (over time)  Temporal difference learning (TD)  Policy still fixed!  Move values toward value of whatever successor occurs a s s, a s,a,s’ s’

Example: Passive TD Take  = 1,  = 0.5 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done)

TD Learning features  On-line, Incremental  Bootstrapping (like DP unlike MC)  Model free  Converges for any policy to the correct value of a state for that policy.  On average when alpha is small  With probability 1 when alpha is high in the beginning and low at the end (say 1/k)

Driving Home Changes recommended by Monte Carlo methods  =1) Changes recommended by TD methods (  =1)

Problems with TD Value Learning  TD value learning is model- free for policy evaluation  However, if we want to turn our value estimates into a policy, we’re sunk:  Idea: Learn state-action pairings (Q-values) directly  Makes action selection model-free too! a s s, a s,a,s’ s’

Q-Functions  A q-value is the value of a (state and action) under a policy  Utility of taking starting in state s, taking action a, then following  thereafter

The Bellman Equations  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy  Formally:

Q-Learning  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Nudge the old estimate towards the new sample:

Recap: Q-Learning  Learn Q*(s,a) values from samples  Receive a sample (s,a,s’,r)  On one hand: old estimate of return:  But now we have a new estimate for this sample:  Nudge the old estimate towards the new sample:  Equivalently, average samples over time:

Q-Learning

Exploration / Exploitation  Several schemes for forcing exploration  Simplest: random actions (  -greedy)  Every time step, flip a coin  With probability , act randomly  With probability 1- , act according to current policy (best q value for instance)  Problems with random actions?  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions

Q-Learning  Q-learning produces tables of q-values:

Demo of Q Learning  Demo arm-control  Parameters  learning rate  discounted reward (high for future rewards)  exploration(should decrease with time)  MDP  Reward= number of the pixel moved to the right/ iteration number  Actions : Arm up and down (yellow line), hand up and down (red line)

Q-Learning Properties  Will converge to optimal policy  If you explore enough  If you make the learning rate small enough  Neat property: does not learn policies which are optimal in the presence of action selection noise SE SE

Exploration Functions  When to explore  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function  Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)

Q-Learning  In realistic situations, we cannot possibly learn about every single state!  Too many states to visit them all in training  Too many states to even hold the q-tables in memory  Instead, we want to generalize:  Learn about some small number of training states from experience  Generalize that experience to new, similar states  This is a fundamental idea in machine learning  Clustering, classification (unsupervised, supervised), non- parametric learning

Evaluation Functions  Function which scores non-terminals  Ideal function: returns the utility of the position  In practice: typically weighted linear sum of features:  e.g. f 1 ( s ) = (num white queens – num black queens), etc.

Function Approximation  Problem: inefficient to learn each state’s utility (or eval function) one by one  Solution: what we learn about one state (or position) should generalize to similar states  Very much like supervised learning  If states are treated entirely independently, we can only learn on very small state spaces

Linear Value Functions  Another option: values are linear functions of features of states (or action-state pairs)  Good if you can describe states well using a few features (e.g. for game playing board evaluations)  Now we only have to learn a few weights rather than a value for each state

TD Updates for Linear Qs  Can use TD learning with linear Qs  (Actually it’s just like the perceptron!)  Old Q-learning update:  Simply update weights of features in Q  (a,s)

Generalization of Q-functions  Non-linear Q functions are required for more complex spaces. Such functions can be learnt using  Multi-Layer Perceptrons (TD-gammon)  Support Vector Machines  Non-Parametric Methods

Demo: Learning walking controllers  (From Stanford AI Lab)

Policy Search

 Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best  E.g. the value functions from the gridworld were probably bad estimates of future rewards, but they could still produce good decisions  Solution: learn the policy that maximizes rewards rather than the value that predicts rewards  This is the idea behind policy search, such as what controlled the upside-down helicopter

Policy Search  Simplest policy search:  Start with an initial linear value function or q-function  Nudge each feature weight up and down and see if your policy is better than before  Problems:  How do we tell the policy got better?  Need to run many sample episodes!  If there are a lot of features, this can be impractical

Policy Search*  Advanced policy search:  Write a stochastic (soft) policy:  Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, but you don’t have to know them)  Take uphill steps, recalculate derivatives, etc.

Helicopter Control (Andrew Ng)

Neural Correlates of RL Parkinson’s Disease  Motor control + initialtion? Intracranial self-stimulation; Drug addiction; Natural rewards  Reward pathway?  Learning? Also involved in:  Working memory  Novel situations  ADHD  Schizophrenia ……

Conditioning Ivan Pavlov = Conditional stimulus = Unconditional stimulus Response = Unconditional response (reflex); conditional response (reflex)

Dopamine Levels track RL signals Unpredicted reward (unlearned/no stimulus) Predicted reward (learned task) Omitted reward (probe trial) (Montague et al. 1996)

Current Hypothesis Phasic dopamine encodes a reward prediction error  Precise (normative!) theory for generation of DA firing patterns  Compelling account for the role of DA in classical conditioning: prediction error acts as signal driving learning in prediction areas  Evidence  Monkey single cell recordings  Human fMRI studies  Current Research  Better information processing model  Other reward/punishment circuits including Amygdala (for visual perception)  Overall circuit (PFC-Basal Ganglia interaction)

Reinforcement Learning  What you should know  MDPs  Utilities, discounting  Policy Evaluation  Bellman’s equation  Value iteration  Policy iteration  Reinforcement Learning  Adaptive Dynamic Programming  TD learning (Model-free)  Q Learning  Function Approximation

Hierarchical Learning

 Stratagus: Example of a large RL task, from Bhaskara Marthi’s thesis (w/ Stuart Russell)  Stratagus is hard for reinforcement learning algorithms  > states  > actions at each point  Time horizon ≈ 10 4 steps  Stratagus is hard for human programmers  Typically takes several person-months for game companies to write computer opponent  Still, no match for experienced human players  Programming involves much trial and error  Hierarchical RL  Humans supply high-level prior knowledge using partial program  Learning algorithm fills in the details Hierarchical RL

(defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move)))) Partial “Alisp” Program

Hierarchical RL  They then define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point  Very good at balancing resources and directing rewards to the right region  Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO]