Download presentation
Presentation is loading. Please wait.
Published byἈστάρτη Κασιδιάρης Modified over 5 years ago
1
Deep Reinforcement Learning: Learning how to act using a deep neural network
Psych 209, Winter 2019 February 12, 2019
2
How can we teach a neural network to act?
Direct ‘policy’ supervision, or imitation learning Provide learner with a teaching signal at each step, back propagate. What problems might we encounter with this approach? What if we don’t get examples from the environment to tell us what the correct action is? Instead we only get rewards when special events occur, e.g. we stumble onto a silver dollar. Computer games can be like this, and maybe when animals forage in the wild they face this problem Core intuition Increase probability of actions that maximize the ‘expected discounted future reward’ Central concepts: V(s) and Q(s,a) Karpathy’s version: Directly calculate expected discounted future reward from gameplay rollouts The classical RL approach: Base actions on Value estimates directly Gradually update estimates of V and/or Q via the Bellman Equation
3
The Bellman Equation Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. 𝑉 𝑠 = max 𝑎 𝑄 𝑠,𝑎 + 𝛾𝑉 𝑠′ Problem 1: Value depends on policy Problem 2: Value information may require exploration to obtain
4
Example 2-d grid world Discrete states 4 possible actions
Next state based on action (no change if move into wall) Positive reward occurs when we stumble onto the silver dollar Negative reward occurs when we fall into the black hole How can we learn about this? We must explore: Softmax based exploration E-greedy exploration Estimated reward depends on exploration policy!
5
Discussion? Still staying with V and Q learning, how can we speed things up?
6
Q-learning as in Mnih et al
7
Value learning based on rehearsal buffer
Take an action Store state, action, reward, next state tuple in rehearsal buffer Sample a batch of tuples from buffer and for each: Use stored policy parameters to estimate Q value of next state Calculate the loss as the difference between the reward and what the old parameters told you about the loss and your estimate of the value of the next state Update your weights based on the loss over a batch If buffer is full, discard oldest item in the buffer
8
Advantage Actor Critic (A2C)
Estimate Value and policy Use Value estimate of next state to as measure of advantage Use value estimate of next state to update value estimate Solve the independent samples problem by using many independent actors simultaneously learning in the same environment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.