Continuous Control with Prioritized Experience Replay

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
RL for Large State Spaces: Value Function Approximation
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Fuzzy Inference System Learning By Reinforcement Presented by Alp Sardağ.
Reinforcement Learning
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Chapter 6: Temporal Difference Learning
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement learning (Chapter 21)
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Deep Reinforcement Learning
Reinforcement Learning
Deep Feedforward Networks
Deep Reinforcement Learning
A Comparison of Learning Algorithms on the ALE
Adversarial Learning for Neural Dialogue Generation
Chapter 6: Temporal Difference Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Reinforcement Learning
An Overview of Reinforcement Learning
Deep reinforcement learning
Reinforcement learning with unsupervised auxiliary tasks
Neural Networks and Backpropagation
"Playing Atari with deep reinforcement learning."
Hidden Markov Models Part 2: Algorithms
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
یادگیری تقویتی Reinforcement Learning
Double Dueling Agent for Dialogue Policy Learning
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Deep Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Designing Neural Network Architectures Using Reinforcement Learning
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction Gyu-Young Hwang
Angel A. Cantu, Nami Akazawa Department of Computer Science
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Morteza Kheirkhah University College London
Presentation transcript:

Continuous Control with Prioritized Experience Replay GnaRLy: Joseph Simonian, Daniel Sochor

Problem Statement Deep Q-Networks are effective in a variety of high-dimensional observation spaces, as evidenced by performance on a variety of tasks such as Atari. Unfortunately, DQNs are only useful in discrete and low-dimensional action spaces. The Deep DPG algorithm extends the ideas underlying the success of DQNs to continuous action spaces. We explore whether recent improvements to DQNs have similar effects when applied to the DDPG algorithm. We will be measuring our model’s performance by evaluating the average score it achieves on various tasks involving continuous action spaces. DQNs can only handle discrete and low-dimensional action spaces. Since they rely on searching the action space for the action that maximizes a value function. Many tasks, such as physical control, have continuous and high-dimensional actions spaces.

Data Source To train and evaluate our model, we used a variety of physical environments from the MuJoCo module of OpenAI Gym. We focused on environments with continuous action spaces. For physical tasks, we experimented on low-dimensional state data that did not require preprocessing. OpenAI provides a simple API for extracting an environment’s state and the reward achieved by the model’s actions. Note: benchmarks for MuJoCo tasks available in DeepMind paper. MuJoCo: Multi-joint dynamics with contact

Finding the greedy policy requires an optimization of at at every timestep. Instead, use actor-critic Actor: Critic: Critic estimates action-value function using actor output. Critic output drives learning in both actor and critic. Actor is updated by applying the chain rule to the expected return from the start distribution J: Baseline Model We use the Deep Deterministic Policy Gradient algorithm as our baseline. DDPG uses actor-critic approach to approximate the true action-value function Q. DDPG introduced by DeepMind in February 2016. Note: optimizing over all of action space is b a d . Actor network specifies the current policy by deterministically mapping states to a specific action. Critic is a learned action-value function Q.

Final Model: Prioritized Experience Replay Transitions are sampled from the environment according to the exploration policy, and the tuple is stored in the buffer. In DDPG, Actor and Critic are both updated by sampling a minibatch uniformly from the buffer (as in DQN). However, an RL agent can learn more effectively from some transitions than from others. With PER, we sample each transition with probability , where is proportional to the TD error of transition (the difference between expected and actual reward). As with DQNs, DDPG uses a replay buffer to learn in mini-batches of uncorrelated updates. We extend DDPG with the addition of Prioritized Experience Replay.

Implementing Prioritized Experience Replay Minibatch of experience, with form (st, at, rt, st+1) Optimizer Replay Buffer Updates to pi for ea. i in minibatch, where pi is 1/rank(i)* *rank(i) is this experience’s rank in the buffer, accord. to temporal difference error

Final Model Details Ranking of experiences by ẟ (TD-error) was done by storing the priority-to-experience mappings in a heap, which provides an ordering of priorities that is close enough to correct for our purposes. The heap was sorted once per 1000 steps. Actor and critic networks in final model each consisted of two affine layers of size 400 and 300, with ReLU nonlinearity in between. Note: benchmarks for MuJoCo tasks available in DeepMind paper. MuJoCo: Multi-joint dynamics with contact

Results We tested our model on three Mujoco environments so far - Cheetah, Hopper, and Ant. Our baseline model experienced difficulties learning the Ant environment, so we ran our final model on Cheetah and Hopper.

Results: Cheetah After enjoying strong performance for roughly 100,000 steps, our final model became unstable and began to underperform the baseline model with a simple replay pool.

Results: Hopper After 500 epochs of 1000 timesteps each, our Hopper Model achieved an average reward of 2400. This is a considerable gain over the baseline.

Comparison to DeepMind Results w/ DQNs

Lessons Learned & Challenges Training a model on a laptop or on a hive machine is extremely slow. We would like to switch to something faster before submitting our final report. We experienced difficulties converging on certain problems. We will try to attain improved convergence by tuning and annealing experience prioritization.