Reinforcement Learning (1)

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Reinforcement learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Reinforcement Learning
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement learning (Chapter 21)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
Reinforcement Learning
CS 188: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
Instructor: Vincent Conitzer
Reinforcement Learning
CSE 473: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Reinforcement Learning (1) Lirong Xia Spring, 2017

Last time Markov decision processes Computing the optimal policy value iteration policy iteration

Today: Reinforcement learning Still have an MDP: A set of states A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’) Still looking for an optimal policy π*(s) New twist: don’t know T and/or R, but can observe R Learning by doing can have multiple episodes (trials)

Example: animal learning Studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc.

What can you do with RL Stanford autonomous helicopter http://heli.stanford.edu/

Choosing a treasure box First time playing this game Can only open one chest Let’s cheat by googling a walkthrough A: 1, 10, 10, 10, 1, 10, 10, 10, 10,… B: 1, 1, 1, 100, 1, 100, 1, 1, 1, 1, 1, 1,…

Reinforcement learning methods Model-based learning learn the model of MDP (transition probability and reward) compute the optimal policy as if the learned model is correct Model-free learning learn the optimal policy without explicitly learning the transition probability Q-learning: learn the Q-state Q(s,a) directly

Choosing a treasure box Let’s cheat by googling a walkthrough A: 1, 10, 10, 10, 1, 10, 10, 10, 10, 10… B: 1, 1, 1, 100, 1, 100, 1, 1, 1, 1, … Model-based p(A=1)= 2/10, p(A=10)=8/10 p(B=1)=8/10, p(B=100)=2/10 Model-free E(A)=8.2, E(B)=20.8

Model-Based Learning Idea: Simple empirical model learning Learn the model empirically in trials (episodes) Solve for values as if the learned model were correct Simple empirical model learning Count outcomes for each s,a Normalize to give estimate of T(s,a,s’) Discover R(s,a,s’) when we experience (s,a,s’) Solving the MDP with the learned model Iterative policy evaluation, for example

Example: Model-Based Learning Trials: (1,1) up-1 (1,1) up-1 (1,2) up-1 (1,2) up-1 (1,2) up-1 (1,3) right-1 (1,3) right-1 (2,3) right-1 (2,3) right-1 (3,3) right-1 (3,3) right-1 (3,2) up-1 (3,2) up-1 (4,2) exit-100 (3,3) right-1 (done) (4,3) exit +100 (done) γ= 1 T(<3,3>, right, <4,3>) = 1/3 T(<2,3>, right, <3,3>) =2/2 Trial 1 Trial 2

Sample-Based Policy Evaluation Approximate the expectation with samples (drawn from an unknown T!) Almost! But we cannot rewind time to get samples from s in the same trial

Temporal-Difference Learning Big idea: learn from every experience! Update V𝜋(s) each time we experience (s,a,s’,R) Likely s’ will contribute updates more often Temporal difference learning Policy still fixed Sample of V(s): sample = R(s, 𝜋(s),s’)+ γ V𝜋(s’) Update: V𝜋(s)(1- α) V𝜋(s) + α sample Same as: V𝜋(s)V𝜋(s) + α (sample-V𝜋(s))

Exponential Moving Average Makes recent samples more important Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average Decreasing learning rate can give converging averages

Problems with TD Value Learning TD value learning does policy evaluation Hard to turn values into a (new) policy Idea: learn Q-value directly Makes action selection model-free too!

Active Learning Full reinforcement learning In this case: You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You can choose any actions you like Goal: learn the optimal policy In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation exploration: try new actions exploitation: focus on optimal actions based on the current estimation This is NOT offline planning! You actually take actions in the world and see what happens…

Detour: Q-Value Iteration Value iteration: find successive approx optimal values Start with V0(s)=0 Given Vi, calculate the values for all states for depth i+1: But Q-values are more useful Start with Q0(s,a)=0 Given Qi, calculate the Q-values for all Q-states for depth i+1:

Q-Learning Q-Learning: sample-based Q-value iteration Learn Q*(s,a) values Receive a sample (s,a,s’) with reward R(s,a,s’) Consider your old estimate: Q(s,a) Consider your new sample estimate: sample = R(s,a,s’)+ γ max a’ Q(s’,a’) Incorporate the new estimate into a running average Q(s,a)(1- α) Q(s,a) + α sample

Q-Learning Properties Amazing result: Q-learning converges to optimal policy If you explore enough If you make the learning rate small enough …but not decrease it too quickly! Basically doesn’t matter how you select actions

Q-Learning Q-learning produces tables of Q-values:

Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problems with random actions? You do explore the space, but keep thrashing around once learning is done One solution: lower ε over time Another solution: exploration functions

Exploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) sample =

Recap Model-based learning Model-free learning using sampling to learn T and R policy iteration Model-free learning directly learn Q-states Q learning

Overview of Project 3 MDPs Q-learning Q1: value iteration Q2: find parameters that lead to certain optimal policy Q3: similar to Q2 Q-learning Q4: implement the Q-learning algorithm Q5: implement ε greedy action selection Q6: try the algorithm Approximate Q-learning and state abstraction Q7: Pacman Tips make your implementation general