Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
brings-uas-sensor-technology-to- smartphones/ brings-uas-sensor-technology-to-
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement learning (Chapter 21)
Markov Decision Processes
Reinforcement learning
Reinforcement Learning
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Introduction Many decision making problems in real life
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Reinforcement Learning
Stochastic tree search and stochastic games
Introduction of Reinforcement Learning
Deep Reinforcement Learning
A Crash Course in Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Markov Decision Processes
Deep reinforcement learning
AlphaGO from Google DeepMind in 2016, beat human grandmasters
Reinforcement Learning
Reinforcement Learning
Reinforcement learning with unsupervised auxiliary tasks
"Playing Atari with deep reinforcement learning."
Reinforcement Learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
Deep Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Morteza Kheirkhah University College London
Reinforcement Learning
Presentation transcript:

Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019

How can we teach a neural network to act? Direct ‘policy’ supervision, or imitation learning Provide learner with a teaching signal at each step, back propagate. What problems might we encounter with this approach? What if we don’t get examples from the environment to tell us what the correct action is? Instead we only get rewards when special events occur, e.g. we stumble onto a silver dollar. Computer games can be like this, and maybe when animals forage in the wild they face this problem Core intuition Increase probability of actions that maximize the ‘expected discounted future reward’ Central concepts: V(s) and Q(s,a) Karpathy’s version: Directly calculate expected discounted future reward from gameplay rollouts The classical RL approach: Base actions on Value estimates directly Gradually update estimates of V and/or Q via the Bellman Equation

The Bellman Equation Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. 𝑉 𝑠 = max 𝑎 𝑄 𝑠,𝑎 + 𝛾𝑉 𝑠′ Problem 1: Value depends on policy Problem 2: Value information may require exploration to obtain

Example 2-d grid world Discrete states 4 possible actions Next state based on action (no change if move into wall) Positive reward occurs when we stumble onto the silver dollar Negative reward occurs when we fall into the black hole How can we learn about this? We must explore: Softmax based exploration E-greedy exploration Estimated reward depends on exploration policy!

Discussion? Still staying with V and Q learning, how can we speed things up?

Q-learning as in Mnih et al

Value learning based on rehearsal buffer Take an action Store state, action, reward, next state tuple in rehearsal buffer Sample a batch of tuples from buffer and for each: Use stored policy parameters to estimate Q value of next state Calculate the loss as the difference between the reward and what the old parameters told you about the loss and your estimate of the value of the next state Update your weights based on the loss over a batch If buffer is full, discard oldest item in the buffer

Advantage Actor Critic (A2C) Estimate Value and policy Use Value estimate of next state to as measure of advantage Use value estimate of next state to update value estimate Solve the independent samples problem by using many independent actors simultaneously learning in the same environment