Deep Reinforcement Learning

Deep Reinforcement Learning
On the road to Skynet! UW CSE Deep Learning – Felix Leeb

Overview Today Next time MDPs – formalizing decisions
Function Approximation Value Function – DQN Policy Gradients – REINFORCE, NPG Actor Critic – A3C, DDPG Model Based RL – forward/inverse Planning – MCTS, MPPI Imitation Learning – DAgger, GAIL Advanced Topics – Exploration, MARL, Meta-learning, LMDPs… UW CSE Deep Learning - Felix Leeb

Paradigm Objective Classification Regression Inference Generation
Supervised Learning Unsupervised Learning Reinforcement Learning Paradigm Objective Classification Regression Inference Generation Prediction Control In reinforcement learning the goal is to learn a policy, which gives us the action given Applications UW CSE Deep Learning - Felix Leeb

𝑥 𝑦 𝑥 𝑢 𝑦 Prediction Control
Prediction: finding the likely output given the input Control is a little different, now we have to find a control given only the observation (and a reward signal) So can we use any of the tricks we learned for prediction in control? 𝑦 UW CSE Deep Learning - Felix Leeb

Setting Environment Agent State/Observation Reward Action using policy
UW CSE Deep Learning - Felix Leeb

Markov Decision Processes
Transition function Reward function State space Action space Markov – only previous state matters Decision – agent takes actions, and those decisions have consequences Process – there is some transition function Transition function is sometimes called the dynamics of the system Reward function can in general depend on both the state and action, but often it’s only related to the state Goal: maximize overall reward UW CSE Deep Learning - Felix Leeb

Discount Factor Return: We want to be greedy but not impulsive
Implicitly takes uncertainty in dynamics into account Mathematically: γ<1 allows infinite horizon returns Return: UW CSE Deep Learning - Felix Leeb

Solving an MDP Objective: Goal: UW CSE Deep Learning - Felix Leeb

Value Functions Value = expected gain of a state
Q function – action specific value function Advantage function – how much more valuable is an action Value depends on future rewards  depends on policy UW CSE Deep Learning - Felix Leeb

Tabular Solution: Policy Iteration
Policy Evaluation Policy Update UW CSE Deep Learning - Felix Leeb

Q Learning Without knowing the transition function

Function Approximation
Model: Training data: Allows continuous state spaces Where’s the deep in “deep RL” Whats with this other Q? At the beginning of training our Q function will be really bad, so the updates will be bad, but each update is moving in the right direction, so overall we’re moving in the right direction Take the derivative of loss wrt Q -> gives you q learning update -> shows the mse loss for params is equivalent to the tabular setting of updating the q values Loss function: where UW CSE Deep Learning - Felix Leeb

Implementation Action-in Action-out Off-Policy Learning
The target depends in part on our model  old observations are still useful Use a Replay Buffer of most recent transitions as dataset Data: off policy data UW CSE Deep Learning - Felix Leeb

Deep Q Networks (DQN) Mnih et al. (2015)

DQN Issues Convergence is not guaranteed – hope for deep magic!
Replay Buffer Error Clipping Using replicas Reward scaling Double Q Learning – decouple action selection and value estimation Clipping errors Scaling rewards Use replay buffer – prioritizing recent actions Double Q Learning – Using separate target and training Q networks Sample complexity is not great – training deep CNN through RL Continuous action spaces are essentially impossible This is all really annoying UW CSE Deep Learning - Felix Leeb

Policy Gradients Parameterize policy and update those parameters directly Enables new kinds of policies: stochastic, continuous action spaces On policy learning  learn directly from your actions Do we have to bother with a value function? On policy learning – learn directly from actions Any model that can be trained, could be a policy: Allows continuous action spaces, learning a stochastic policy UW CSE Deep Learning - Felix Leeb

Policy Gradients Approximate expectation value from samples
Note: we’re going to be a little hand wavy with the notation Essentially importance sampling No guarantee of finding a global optimum Approximate expectation value from samples UW CSE Deep Learning - Felix Leeb

REINFORCE Sutton et al. (2000) UW CSE Deep Learning - Felix Leeb

Variance Reduction Constant offsets make it harder to differentiate the right direction Remove offset  a priori value of each state Use baseline to reduce variance – the return of the value function Turns out standard gradient descent is not necessarily the direction of steepest descent for stochastic function optimization – consider natural gradients UW CSE Deep Learning - Felix Leeb

Advanced Policy Gradient Methods
For stochastic functions, the gradient is not the best direction Consider the KL divergence Approximating the Fisher information matrix Computing gradients with KL constraint Gradients with KL penalty NPG  TRPO  PPO  Natural Policy Gradients – use Fisher information matrix to choose gradient TRPO – adjust gradient subject to KL divergence constraint PPO – take a step directly related to KL divergence UW CSE Deep Learning - Felix Leeb

Advanced Policy Gradient Methods
Natural Policy Gradients – use Fisher information matrix to choose gradient TRPO – adjust gradient subject to KL divergence constraint PPO – take a step directly related to KL divergence Rajeswaran et al. (2017) Heess et al. (2017) UW CSE Deep Learning - Felix Leeb

Actor Critic Critic Actor Estimate Advantage Propose Actions
using Q learning update Estimate Advantage Propose Actions Get the convergence of policy gradients, and the sample complexity of q learning Actor using policy gradient update UW CSE Deep Learning - Felix Leeb

Async Advantage Actor-Critic (A3C)
Async – parallelizes updates Uses advantage function REINFORCE updates to policy Mnih et al. (2016) UW CSE Deep Learning - Felix Leeb

DDPG Off-policy learning – using deterministic policy gradients
Max Ferguson (2017) Continuous control Replay buffer EMA between target and training networks for stability Eps-greedy exploration Batch normalization UW CSE Deep Learning - Felix Leeb

Deep Reinforcement Learning

Similar presentations

Presentation on theme: "Deep Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Reinforcement Learning

Similar presentations

Presentation on theme: "Deep Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback