Download presentation
Presentation is loading. Please wait.
1
Model based RL Part 1
2
400ms: signal processing, swing ect. => 125ms (300-400ms to blink)
3
Model based RL 𝑅 π = 𝐸 π 𝑡=0 𝑇 γ 𝑡 𝑟 𝑡
𝑅 π = 𝐸 π 𝑡=0 𝑇 γ 𝑡 𝑟 𝑡 𝑅 π = 𝐸 π 𝑡=0 𝑇 𝑃 𝑠 ′ 𝑠,𝑎 ∗𝑅( 𝑠 ′ ) - Maximize the discounted cumulative reward in each episode by finding a good policy
4
Model based RL 𝑚𝑎𝑥 θ 𝐸 τ~ 𝑝 θ (τ) 𝑡=0 𝑇 γ 𝑡 𝑅( 𝑠 𝑡 , 𝑎 𝑡 )
𝑚𝑎𝑥 θ 𝐸 τ~ 𝑝 θ (τ) 𝑡=0 𝑇 γ 𝑡 𝑅( 𝑠 𝑡 , 𝑎 𝑡 ) 𝑝 θ 𝑠 0 , 𝑎 0 ,…, 𝑠 𝑇 , 𝑎 𝑇 =𝑝( 𝑠 0 ) 𝑡=0 𝑇 π θ 𝑎 𝑡 𝑠 𝑡 ∗ 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 Model-free learns model indirectly trough value function by sampling data ponits with a given policy policy Model i.e. transition probability
5
Why use Model-based RL? Partially observable environments Generality
Sample Efficiency
6
Training Model-based RL
Run intitial policy π 0 𝑎 𝑡 𝑠 𝑡 , e.g. random policy Learn the model, i.e. 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 Learn a new policy π 𝑛 𝑎 𝑡 𝑠 𝑡 Execute π 𝑛 and collect new data Irerative training 1) don’t stuck in local minimas 2) learn skills step by step
7
Recurrent neural network (RNN)
𝑦 𝑡 ℎ 𝑡 A 𝑥 𝑡
8
Recurrent neural network (RNN)
𝑦 1 𝑦 2 𝑦 3 𝑦 𝑡 ℎ 0 A ℎ 1 A ℎ 2 A ℎ 3 ℎ 𝑡−1 A Vanishing gradients Only short term memory, forgets wahts its seen in long sequences 𝑥 1 𝑥 2 𝑥 3 … 𝑥 𝑡
9
Long Short Term Memory network (LSTM)
10
David Ha Google Brain Jürgen Schmidhuber AI Lab, IDSIA (USI & SUPSI)
World Models David Ha Google Brain Jürgen Schmidhuber AI Lab, IDSIA (USI & SUPSI)
11
World Model
12
V Model - Decoder only for training and
13
Variational Autoencoder (VAE)
Input Output X’ x Sample Latent vector z Encoder Decoder σ
14
M Model MDN-RNN 𝑃( 𝑧 𝑡+1 | 𝑎 𝑡 , 𝑧 𝑡 , ℎ 𝑡 )
𝑃( 𝑧 𝑡+1 | 𝑎 𝑡 , 𝑧 𝑡 , ℎ 𝑡 ) MDN models gaussian mixture sample z_t+1, use temperature to add uncertainty for sampling
15
C Model Simple linear model 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡
𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡 Only a few hundred parameters Complexity lies in model Many ways to train C -> evolutionary strategies Tackle chalenges wehre credit assignment problem is hard
16
Putting all together
17
Car Racing Experiment visiting as many tiles as possible in the least amount of time solving = average reward of 900 over 100 consecutive trials The agent controls three continuous actions: steering left/right, acceleration, and brake. visiting as many tiles as possible in the least amount of time
18
Car Racing Experiment V Model only: full World Model:
V model only: no access to prediction of future Full world model: acces to future prediction -> more ‘reflexive’ behaviour 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ] + 𝑏 𝑡 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡
19
Car Racing Experiment Method Average Score over 100 Random Tracks DQN
343 ± 18 AC3 (contignous) 891 ± 45 AC3 (discrete) 652 ± 10 V model only, z input 632 ± 251 V model only, z input with a hidden layer 788 ± 141 Full World Model, z and h 906 ± 21 - World model first paper to solve task
20
VizDoom Experiment Avoid fireballs shot by the monsters
reward = number of time steps the agent survives solving = average survival time of more than 750 over 100 consecutive trials Avoid fireballs shot by monsters M model predicts wheter agent dies in next rouns in addition to z_t
21
VizDoom Experiment - Remember M model: able to construct next latent vector z_t+1 => can train in virtual enviroment Cropped 64x64px frame of environment Reconstruction from latent vector
22
VizDoom Experiment Agent find adverserial policy: moves in a way s.t. Monsters never fire Uncertainty in the model was to low
23
VizDoom Experiment Temperature Score in Virtual Enviroment
Score in Actual Enviroment 0.10 2086 ± 140 193 ± 58 0.50 2060 ± 277 196 ± 50 1.00 1145 ± 690 868 ± 511 1.15 918 ± 546 1092 ± 556 1.30 732 ± 269 753 ± 139 Random Policy Baseline N/A 210 ± 108 Gym Leaderboard 820 ± 58
24
Wrap up World Models Learn dynamics of enviroment
Train controller in virtual enviroment («dreams») Different ways to train C Model Combine Model-free with Model-based RL
25
Temporal Difference Variational Auto-Encoder
Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, Théophane Weber DeepMind More involved No reinforcement learning, just build a model
26
TD-VAE Learn an abstract state representation of the data
Learn a belief state 𝑏 𝑡 Learn temporal abstraction capable of making predictions at the state level, not just the observation level, use latent variables to model transition between states i.e. a deterministic, coded representation of the filtering posterior of the state given all the observations up to a given time. A belief state contains all the information an agent has about the state of the world and thus about how to act optimally. Learn from temporally separated time steps, able to do ‘jump’ predictions, predict the state in the future In RL: state of agent represents belief about sum of discounted rewards, here state represents belief about possible future states
27
TD-VAE 𝑏 𝑡 1 𝑏 𝑡 2 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2 State prediction network
𝑏 𝑡 1 𝑏 𝑡 2 State prediction network 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2 - Goal is: predict future time step t2 (i.e. Predict state) using all data we have seen so far
28
TD-VAE - Training 𝑏 𝑡 2 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑞 𝑠 𝑡 1 | 𝑡 1 𝑝 𝐵 𝑡 2 𝑧 𝑡 1
Beliefe network State prediction network Inference network Decoder network Sampling 𝑏 𝑡 2 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑞 𝑠 𝑡 1 | 𝑡 1 𝑝 𝐵 𝑡 2 𝑧 𝑡 1 𝑧 𝑡 2 Randomly sample t1 and t2 𝑝 𝑇 𝑡 2 𝑝 𝐷 𝑡 2 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2
29
TD-VAE - Predicting 𝑡 1 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑡 2 𝑧 𝑡 2 𝑧 𝑡 1 𝑝 𝑇 𝑡 2 𝑥 𝑡 1
Beliefe network State prediction network Inference network Decoder network Sampling 𝑡 1 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑡 2 𝑧 𝑡 1 𝑧 𝑡 2 𝑝 𝑇 𝑡 2 𝑥 𝑡 1 Goal: given all observations up to t1, predict state at t2 Generate belief state at t1, sample z1 Predict state distribution at t2, sample z2
30
Noisy Harmonic Oscillator
Noise is added to the position and velocity state consists of frequency, magnitude and position (phase) Hierarchical with two layers: stack TD-VEA on top of each other and it is only the position that cannot be accurately predicted, accurate for dt=20 but not for dt=100 Autoregressive model: simple LSTM that predicts step-by-step
31
Wrap up TD-VAE Different from step-by-step prediction, jumps in time
Hiranchical model: stack TD-VEA on top of each other Build states from observation -> Role out possible future scenarios Belief state represents several possible futures
32
Thank you for your attention
33
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.