Deep Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Infinite Horizon Problems
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Machine Learning Lecture 11: Reinforcement Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Retraction: I’m actually 35 years old. Q-Learning.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Deep Reinforcement Learning
Reinforcement Learning
Continuous Control with Prioritized Experience Replay
Generative Adversarial Imitation Learning
Reinforcement Learning
Deep Reinforcement Learning
Chapter 6: Temporal Difference Learning
Mastering the game of Go with deep neural network and tree search
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
AlphaGo with Deep RL Alpha GO.
Reinforcement learning (Chapter 21)
An Overview of Reinforcement Learning
Deep reinforcement learning
Reinforcement Learning
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
Reinforcement learning with unsupervised auxiliary tasks
"Playing Atari with deep reinforcement learning."
RL methods in practice Alekh Agarwal.
Reinforcement learning
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
یادگیری تقویتی Reinforcement Learning
Double Dueling Agent for Dialogue Policy Learning
Reinforcement Learning
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Machine learning overview
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Morteza Kheirkhah University College London
A Deep Reinforcement Learning Approach to Traffic Management
Presentation transcript:

Deep Reinforcement Learning On the road to Skynet! UW CSE Deep Learning – Felix Leeb

Overview Today Next time MDPs – formalizing decisions Function Approximation Value Function – DQN Policy Gradients – REINFORCE, NPG Actor Critic – A3C, DDPG Model Based RL – forward/inverse Planning – MCTS, MPPI Imitation Learning – DAgger, GAIL Advanced Topics – Exploration, MARL, Meta-learning, LMDPs… UW CSE Deep Learning - Felix Leeb

Paradigm Objective Classification Regression Inference Generation Supervised Learning Unsupervised Learning Reinforcement Learning Paradigm Objective Classification Regression Inference Generation Prediction Control In reinforcement learning the goal is to learn a policy, which gives us the action given Applications UW CSE Deep Learning - Felix Leeb

𝑥 𝑦 𝑥 𝑢 𝑦 Prediction Control Prediction: finding the likely output given the input Control is a little different, now we have to find a control given only the observation (and a reward signal) So can we use any of the tricks we learned for prediction in control? 𝑦 UW CSE Deep Learning - Felix Leeb

Setting Environment Agent State/Observation Reward Action using policy UW CSE Deep Learning - Felix Leeb

Markov Decision Processes Transition function Reward function State space Action space Markov – only previous state matters Decision – agent takes actions, and those decisions have consequences Process – there is some transition function Transition function is sometimes called the dynamics of the system Reward function can in general depend on both the state and action, but often it’s only related to the state Goal: maximize overall reward UW CSE Deep Learning - Felix Leeb

Discount Factor Return: We want to be greedy but not impulsive Implicitly takes uncertainty in dynamics into account Mathematically: γ<1 allows infinite horizon returns Return: UW CSE Deep Learning - Felix Leeb

Solving an MDP Objective: Goal: UW CSE Deep Learning - Felix Leeb

Value Functions Value = expected gain of a state Q function – action specific value function Advantage function – how much more valuable is an action Value depends on future rewards  depends on policy UW CSE Deep Learning - Felix Leeb

Tabular Solution: Policy Iteration Policy Evaluation Policy Update UW CSE Deep Learning - Felix Leeb

Q Learning Without knowing the transition function UW CSE Deep Learning - Felix Leeb

Function Approximation Model: Training data: Allows continuous state spaces Where’s the deep in “deep RL” Whats with this other Q? At the beginning of training our Q function will be really bad, so the updates will be bad, but each update is moving in the right direction, so overall we’re moving in the right direction Take the derivative of loss wrt Q -> gives you q learning update -> shows the mse loss for params is equivalent to the tabular setting of updating the q values Loss function: where UW CSE Deep Learning - Felix Leeb

Implementation Action-in Action-out Off-Policy Learning The target depends in part on our model  old observations are still useful Use a Replay Buffer of most recent transitions as dataset Data: off policy data UW CSE Deep Learning - Felix Leeb

Deep Q Networks (DQN) Mnih et al. (2015) UW CSE Deep Learning - Felix Leeb

DQN Issues Convergence is not guaranteed – hope for deep magic! Replay Buffer Error Clipping Using replicas Reward scaling Double Q Learning – decouple action selection and value estimation Clipping errors Scaling rewards Use replay buffer – prioritizing recent actions Double Q Learning – Using separate target and training Q networks Sample complexity is not great – training deep CNN through RL Continuous action spaces are essentially impossible This is all really annoying UW CSE Deep Learning - Felix Leeb

Policy Gradients Parameterize policy and update those parameters directly Enables new kinds of policies: stochastic, continuous action spaces On policy learning  learn directly from your actions Do we have to bother with a value function? On policy learning – learn directly from actions Any model that can be trained, could be a policy: Allows continuous action spaces, learning a stochastic policy UW CSE Deep Learning - Felix Leeb

Policy Gradients Approximate expectation value from samples Note: we’re going to be a little hand wavy with the notation Essentially importance sampling No guarantee of finding a global optimum Approximate expectation value from samples UW CSE Deep Learning - Felix Leeb

REINFORCE Sutton et al. (2000) UW CSE Deep Learning - Felix Leeb

Variance Reduction Constant offsets make it harder to differentiate the right direction Remove offset  a priori value of each state Use baseline to reduce variance – the return of the value function Turns out standard gradient descent is not necessarily the direction of steepest descent for stochastic function optimization – consider natural gradients UW CSE Deep Learning - Felix Leeb

Advanced Policy Gradient Methods For stochastic functions, the gradient is not the best direction Consider the KL divergence Approximating the Fisher information matrix Computing gradients with KL constraint Gradients with KL penalty NPG  TRPO  PPO  Natural Policy Gradients – use Fisher information matrix to choose gradient TRPO – adjust gradient subject to KL divergence constraint PPO – take a step directly related to KL divergence UW CSE Deep Learning - Felix Leeb

Advanced Policy Gradient Methods Natural Policy Gradients – use Fisher information matrix to choose gradient TRPO – adjust gradient subject to KL divergence constraint PPO – take a step directly related to KL divergence Rajeswaran et al. (2017) Heess et al. (2017) UW CSE Deep Learning - Felix Leeb

Actor Critic Critic Actor Estimate Advantage Propose Actions using Q learning update Estimate Advantage Propose Actions Get the convergence of policy gradients, and the sample complexity of q learning Actor using policy gradient update UW CSE Deep Learning - Felix Leeb

Async Advantage Actor-Critic (A3C) Async – parallelizes updates Uses advantage function REINFORCE updates to policy Mnih et al. (2016) UW CSE Deep Learning - Felix Leeb

DDPG Off-policy learning – using deterministic policy gradients Max Ferguson (2017) Continuous control Replay buffer EMA between target and training networks for stability Eps-greedy exploration Batch normalization UW CSE Deep Learning - Felix Leeb