Continuous Control with Prioritized Experience Replay

Continuous Control with Prioritized Experience Replay
GnaRLy: Joseph Simonian, Daniel Sochor

Problem Statement Deep Q-Networks are effective in a variety of high-dimensional observation spaces, as evidenced by performance on a variety of tasks such as Atari. Unfortunately, DQNs are only useful in discrete and low-dimensional action spaces. The Deep DPG algorithm extends the ideas underlying the success of DQNs to continuous action spaces. We explore whether recent improvements to DQNs have similar effects when applied to the DDPG algorithm. We will be measuring our model’s performance by evaluating the average score it achieves on various tasks involving continuous action spaces. DQNs can only handle discrete and low-dimensional action spaces. Since they rely on searching the action space for the action that maximizes a value function. Many tasks, such as physical control, have continuous and high-dimensional actions spaces.

Data Source To train and evaluate our model, we used a variety of physical environments from the MuJoCo module of OpenAI Gym. We focused on environments with continuous action spaces. For physical tasks, we experimented on low-dimensional state data that did not require preprocessing. OpenAI provides a simple API for extracting an environment’s state and the reward achieved by the model’s actions. Note: benchmarks for MuJoCo tasks available in DeepMind paper. MuJoCo: Multi-joint dynamics with contact

Finding the greedy policy requires an optimization of at at every timestep. Instead, use actor-critic Actor: Critic: Critic estimates action-value function using actor output. Critic output drives learning in both actor and critic. Actor is updated by applying the chain rule to the expected return from the start distribution J: Baseline Model We use the Deep Deterministic Policy Gradient algorithm as our baseline. DDPG uses actor-critic approach to approximate the true action-value function Q. DDPG introduced by DeepMind in February 2016. Note: optimizing over all of action space is b a d . Actor network specifies the current policy by deterministically mapping states to a specific action. Critic is a learned action-value function Q.

Final Model: Prioritized Experience Replay
Transitions are sampled from the environment according to the exploration policy, and the tuple is stored in the buffer. In DDPG, Actor and Critic are both updated by sampling a minibatch uniformly from the buffer (as in DQN). However, an RL agent can learn more effectively from some transitions than from others. With PER, we sample each transition with probability , where is proportional to the TD error of transition (the difference between expected and actual reward). As with DQNs, DDPG uses a replay buffer to learn in mini-batches of uncorrelated updates. We extend DDPG with the addition of Prioritized Experience Replay.

Implementing Prioritized Experience Replay
Minibatch of experience, with form (st, at, rt, st+1) Optimizer Replay Buffer Updates to pi for ea. i in minibatch, where pi is 1/rank(i)* *rank(i) is this experience’s rank in the buffer, accord. to temporal difference error

Final Model Details Ranking of experiences by ẟ (TD-error) was done by storing the priority-to-experience mappings in a heap, which provides an ordering of priorities that is close enough to correct for our purposes. The heap was sorted once per 1000 steps. Actor and critic networks in final model each consisted of two affine layers of size 400 and 300, with ReLU nonlinearity in between. Note: benchmarks for MuJoCo tasks available in DeepMind paper. MuJoCo: Multi-joint dynamics with contact

Results We tested our model on three Mujoco environments so far - Cheetah, Hopper, and Ant. Our baseline model experienced difficulties learning the Ant environment, so we ran our final model on Cheetah and Hopper.

Results: Cheetah After enjoying strong performance for roughly 100,000 steps, our final model became unstable and began to underperform the baseline model with a simple replay pool.

Results: Hopper After 500 epochs of timesteps each, our Hopper Model achieved an average reward of This is a considerable gain over the baseline.

Comparison to DeepMind Results w/ DQNs

Lessons Learned & Challenges
Training a model on a laptop or on a hive machine is extremely slow. We would like to switch to something faster before submitting our final report. We experienced difficulties converging on certain problems. We will try to attain improved convergence by tuning and annealing experience prioritization.

Continuous Control with Prioritized Experience Replay

Similar presentations

Presentation on theme: "Continuous Control with Prioritized Experience Replay"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Continuous Control with Prioritized Experience Replay

Similar presentations

Presentation on theme: "Continuous Control with Prioritized Experience Replay"— Presentation transcript:

Similar presentations

About project

Feedback