Download presentation
Presentation is loading. Please wait.
1
A Crash Course in Reinforcement Learning
Oliver Schulte Simon Fraser University emphasis on connections with neural net learning
2
Outline What is Reinforcement Learning? Key Definitions
Key Learning Tasks Reinforcement Learning Techniques Reinforcement Learning with Neural Nets
3
Overview
4
Learning To Act So far: learning to predict Now: learn to act
In engineering: control theory Economics, operations research: decision and game theory Examples: fly helicopter drive car play Go play soccer
5
RL at a glance
6
Acting in Action Autonomous Helicopter Learning to play video games
An example of imitation learning: start by observing human actions Learning to play video games “Deep Q works best when it lives in the moment” Learn to flip pancakes helicopter
7
MARKOV DECISION PROCESSES
8
Markov Decision Processes
Recall Markov process (MP) state = vector x ≅ s of input variable values can contain hidden variables = partially observable (POMDP) transition probability P(s’|s) Markov reward process (MRP) = MP + rewards r Markov decision process (MDP) = MRP + actions a Markov game = MDP with actions, rewards for > 1 agent
9
Model Parameters: transition probabilities
Markov process: P(s(t+1)|s(t)) MDP: P(s(t+1)|s(t),a(t)) E(r(t+1)|s(t),a(t)) expected reward recall basketball example also hockey example grid example David Poole’s demo
10
derived concepts
11
Returns and discounting
A trajectory is a (possibly infinite) sequence s(0),a(0),r(0),s(1),a(1),r(1),...,s(n),a(n),r(n),... The return is the total sum of rewards. But: if the trajectory is infinite, we have an infinite sum! Solution: Weight by discount factor γ between 0 and 1. Return = r(0)+γr(1)+γ2r(n)+... can also interpret as probability of process eding
12
RL Concepts These 3 functions can be computed by neural networks
13
Policies and Values A deterministic policy π is a function that maps states to actions. i.e. tells us how to act. Can also be probabilistic. Can be implemented using neural nets. Given a policy and an MDP, we have the expected return from using the policy at a state. Notation: Vπ(s)
14
Optimal Policies A policy π* is optimal if for any other policy and for all states s Vπ*(s) ≥ Vπ(s) The value of the optimal policy is written as V*(s).
15
The action value function
Given a policy, the expected reward at a state given an action is denoted as Qπ(s,a). Similarly Q*(s,a) for the value of an action given the optimal policy. grid example Show Mitchell example
16
LEARNING
17
Two Learning Problems Prediction: For a fixed policy, learn Vπ(s).
Control: For a given MDP, learn V*(s) (optimal policy). Variants for Q-function.
18
Model-Based Learning Transition Probabilities Value Function Data
dynamic programming Transition Probabilities Value Function Data Bellmann equation: Vπ(s) = Ps’,a π(a) x ( E(r)|s,a + P(s’|s,a) x Vπ (s’) ) Developed for transition probabilities that are “nice” discrete, Gaussian, Poisson,... grid example
19
Model-free Learning By-pass estimating transition probabilities
Why? Continuous state variables, no “nice” functional form. (How about using LSTM/RNN dynamic model? deep dynamic programming?)
20
Model-free Learning Directly learn optimal policy π* (policy iteration) Directly learn optimal value function V*. Directly learn optimal action-value function Q*. All of these functions can be implemented in a neural network. NN learning = reinforcement learning
21
Model-free Learning: What are the data?
Data is simply a sequence of events s(0),a(0),r(0),s(1),a(1),r(1),... doesn’t tell us expected values or optimal actions. Monte Carlo learning: to learn V, observe return at end of episode. e.g. chessbase gives percentage of wins by white for any position
22
Temporal Difference Learning
Consistency idea: using current model, and given data, s(0),a(0),r(0),s(1),a(1),r(1), estimate the value V(s(t)) at current state the next-step value V1(s(t)) = r(t)+γV(s(t+1)) Minimize the “error” [V1(s(t))-V(s(t))]2 s(0),a(0),r(0),s(1),a(1),r(1),
23
Model-Free Learning Example
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.