Warmup: Which one would you try next? 1 Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 ABCD.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Reinforcement Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
RL for Large State Spaces: Value Function Approximation
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Planning under Uncertainty
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
CS 188: Artificial Intelligence Fall 2009
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CSE 573: Artificial Intelligence
Reinforcement Learning
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
1 Introduction to Reinforcement Learning Freek Stulp.
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
© Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA.
Reinforcement learning (Chapter 21)
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Reinforcement Learning
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
"Playing Atari with deep reinforcement learning."
Instructors: Fei Fang (This Lecture) and Dave Touretzky
RL for Large State Spaces: Value Function Approximation
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Warmup: Which one would you try next? 1 Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 ABCD

CS 188: Artificial Intelligence Reinforcement Learning II Instructors: Stuart Russell and Patrick Virtue University of California, Berkeley

Reminder: Reinforcement Learning  Still assume a Markov decision process (MDP):  A set of states s  S  A set of actions (per state) A  A model P(s’|a,s)  A reward function R(s,a,s’)  Still looking for a policy  (s)  New twist: don’t know P or R  I.e. we don’t know which states are good or what the actions do  Must actually try actions and explore new states -- to boldly go where no Pacman agent has been before

Approaches to reinforcement learning 1.Learn the model, solve it, execute the solution 2.Learn values from experiences a.Direct utility estimation  Add up rewards obtained from each state onwards b.Temporal difference (TD) learning  Adjust V(s) to agree better with R(s,a,s’)+γV (s’)  Still need a transition model to make decisions c.Q-learning  Like TD learning, but learn Q(s,a) instead  Sufficient to make decisions without a model! Important idea: use samples to approximate the weighted sum over next-state values Big missing piece: how can we scale this up to large state spaces like Tetris (2 200 )?

TD learning and Q-learning  Approximate version of value iteration update gives TD learning:  V  k+1 (s)   s’ P(s’ |  (s),s) [R(s,  (s),s’) + γV  k (s’) ]  V  (s)  (1-  )  V  (s) +   [R(s,  (s),s’) + γV  (s’) ]  Approximate version of Q iteration update gives Q learning:  Q k+1 (s,a)   s’ P(s’ | a,s) [R(s,a,s’) + γ max a’ Q k (s’,a’) ]  Q(s,a)  (1-  )  Q(s,a) +   [R(s,a,s’) + γ max a’ Q (s’,a’) ]  We obtain a policy from learned Q, with no model!

Q-Learning Properties  Amazing result: Q-learning converges to optimal policy -- even if you’re acting suboptimally!  This is called off-policy learning  Technical conditions for convergence:  Explore enough:  Eventually try every state-action pair infinitely often  Decay the learning rate  properly   t  t =  and  t  t 2 <    t = O(1/t) meets these conditions

Demo Q-Learning Auto Cliff Grid

Exploration vs. Exploitation

Exploration vs exploitation  Exploration: try new things  Exploitation: do what’s best given what you’ve learned so far  Key point: pure exploitation often gets stuck in a rut and never finds an optimal policy! 9

Exploration method 1:  -greedy   -greedy exploration  Every time step, flip a biased coin  With (small) probability , act randomly  With (large) probability 1- , act on current policy  Properties of  -greedy exploration  Every s,a pair is tried infinitely often  Does a lot of stupid things  Jumping off a cliff lots of times to make sure it hurts  Keeps doing stupid things for ever  Decay  towards 0

Demo Q-learning – Epsilon-Greedy – Crawler

Sensible exploration: Bandits  Which one-armed bandit to try next?  Most people would choose C > B > A > D  Basic intuition: higher mean is better; more uncertainty is better  Gittins (1979): rank arms by an index that depends only on the arm itself 12 Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 ABCD

Exploration Functions  Exploration functions implement this tradeoff  Takes a value estimate u and a visit count n, and returns an optimistic utility, e.g., f(u,n) = u + k/  n  Regular Q-update:  Q(s,a)  (1-  )  Q(s,a) +   [R(s,a,s’) + γ max a Q (s’,a) ]  Modified Q-update:  Q(s,a)  (1-  )  Q(s,a) +   [R(s,a,s’) + γ max a f(Q (s’,a),n(s’,a’)) ]  Note: this propagates the “bonus” back to states that lead to unknown states as well!

Demo Q-learning – Exploration Function – Crawler

Optimality and exploration 15 fixed  -greedy decay  -greedy exploration function total reward per trial optimal number of trials regret

Regret  Regret measures the total cost of your youthful errors made while exploring and learning instead of behaving optimally  Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal

Approximate Q-Learning

Generalizing Across States  Basic Q-Learning keeps a table of all Q-values  In realistic situations, we cannot possibly learn about every single state!  Too many states to visit them all in training  Too many states to hold the Q-tables in memory  Instead, we want to generalize:  Learn about some small number of training states from experience  Generalize that experience to new, similar situations  Can we apply some machine learning tools to do this? [demo – RL pacman]

Example: Pacman Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!

Demo Q-Learning Pacman – Tiny – Watch All

Demo Q-Learning Pacman – Tiny – Silent Train

Demo Q-Learning Pacman – Tricky – Watch All

Break 23

Feature-Based Representations  Solution: describe a state using a vector of features (properties)  Features are functions from states to real numbers (often 0/1) that capture important properties of the state  Example features:  Distance to closest ghost f GST  Distance to closest dot  Number of ghosts  1 / (dist to closest dot) f DOT  Is Pacman in a tunnel? (0/1)  …… etc.  Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

Linear Value Functions  We can express V and Q (approximately) as weighted linear functions of feature values:  V w (s) = w 1 f 1 (s) + w 2 f 2 (s) + … + w n f n (s)  Q w (s,a) = w 1 f 1 (s,a) + w 2 f 2 (s,a) + … + w n f n (s,a)  Important: depending on the features used, the best possible approximation may be terrible!  But in practice we can compress a value function for chess (10 43 states) down to about 30 weights and get decent play!!!

Updating a linear value function  Original Q learning rule tries to reduce prediction error at s,a:  Q(s,a)  Q(s,a) +   [R(s,a,s’) + γ max a’ Q (s’,a’) - Q(s,a) ]  Instead, we update the weights to try to reduce the error at s,a:  w i  w i +   [R(s,a,s’) + γ max a’ Q (s’,a’) - Q(s,a) ]  Q w (s,a)/  w i = w i +   [R(s,a,s’) + γ max a’ Q (s’,a’) - Q(s,a) ] f i (s,a)  Qualitative justification:  Pleasant surprise: increase weights on +ve features, decrease on –ve ones  Unpleasant surprise: decrease weights on +ve features, increase on –ve ones 26

Example: Q-Pacman Q(s,a) = 4.0 f DOT (s,a) – 1.0 f GST (s,a) Q(s,a) = 3.0 f DOT (s,a) – 3.0 f GST (s,a) s’s f DOT (s, NORTH ) = 0.5 f GST (s, NORTH ) = 1.0 Q(s’,  ) = 0 Q(s, NORTH ) = +1 r + γ max a’ Q (s’,a’) = – difference = –501 w DOT   [–501]0.5 w GST  –1.0 +  [–501]1.0 a = NORTH r = –500

Demo Approximate Q-Learning -- Pacman

Convergence*  Let V L be the closest linear approximation to V*.  TD learning with a linear function approximator converges to some V that is pretty close to V L  Q-learning with a linear function approximator may diverge  With much more complicated update rules, stronger convergence results can be proved – even for nonlinear function approximators such as neural nets 29

Nonlinear function approximators  We can still use the gradient-based update for any Q w :  w i  w i +   [R(s,a,s’) + γ max a’ Q (s’,a’) - Q(s,a) ]  Q w (s,a)/  w i  Neural network error back-propagation already does this!  Maybe we can get much better V or Q approximators using a complicated neural net instead of a linear function 30

Backgammon 31

TDGammon  4-ply lookahead using V(s) trained from 1,500,000 games of self-play  3 hidden layers, ~100 units each  Input: contents of each location plus several handcrafted features  Experimental results:  Plays approximately at parity with world champion  Led to radical changes in the way humans play backgammon 32

DeepMind DQN  Used a deep learning network to represent Q:  Input is last 4 images (84x84 pixel values) plus score  49 Atari games, incl. Breakout, Space Invaders, Seaquest, Enduro 33

34

35

RL and dopamine 36  Dopamine signal generated by parts of the striatum  Encodes predictive error in value function (as in TD learning)

37