Model based RL Part 1.

Model based RL Part 1

400ms: signal processing, swing ect. => 125ms (300-400ms to blink)

Model based RL 𝑅 π = 𝐸 π 𝑡=0 𝑇 γ 𝑡 𝑟 𝑡
𝑅 π = 𝐸 π 𝑡=0 𝑇 γ 𝑡 𝑟 𝑡 𝑅 π = 𝐸 π 𝑡=0 𝑇 𝑃 𝑠 ′ 𝑠,𝑎 ∗𝑅( 𝑠 ′ ) - Maximize the discounted cumulative reward in each episode by finding a good policy

Model based RL 𝑚𝑎𝑥 θ 𝐸 τ~ 𝑝 θ (τ) 𝑡=0 𝑇 γ 𝑡 𝑅( 𝑠 𝑡 , 𝑎 𝑡 )
𝑚𝑎𝑥 θ 𝐸 τ~ 𝑝 θ (τ) 𝑡=0 𝑇 γ 𝑡 𝑅( 𝑠 𝑡 , 𝑎 𝑡 ) 𝑝 θ 𝑠 0 , 𝑎 0 ,…, 𝑠 𝑇 , 𝑎 𝑇 =𝑝( 𝑠 0 ) 𝑡=0 𝑇 π θ 𝑎 𝑡 𝑠 𝑡 ∗ 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 Model-free learns model indirectly trough value function by sampling data ponits with a given policy policy Model i.e. transition probability

Why use Model-based RL? Partially observable environments Generality
Sample Efficiency

Training Model-based RL
Run intitial policy π 0 𝑎 𝑡 𝑠 𝑡 , e.g. random policy Learn the model, i.e. 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 Learn a new policy π 𝑛 𝑎 𝑡 𝑠 𝑡 Execute π 𝑛 and collect new data Irerative training 1) don’t stuck in local minimas 2) learn skills step by step

Recurrent neural network (RNN)
𝑦 𝑡 ℎ 𝑡 A 𝑥 𝑡

Recurrent neural network (RNN)
𝑦 1 𝑦 2 𝑦 3 𝑦 𝑡 ℎ 0 A ℎ 1 A ℎ 2 A ℎ 3 ℎ 𝑡−1 A Vanishing gradients Only short term memory, forgets wahts its seen in long sequences 𝑥 1 𝑥 2 𝑥 3 … 𝑥 𝑡

Long Short Term Memory network (LSTM)

David Ha Google Brain Jürgen Schmidhuber AI Lab, IDSIA (USI & SUPSI)
World Models David Ha Google Brain Jürgen Schmidhuber AI Lab, IDSIA (USI & SUPSI)

World Model

V Model - Decoder only for training and

Variational Autoencoder (VAE)
Input Output X’ x Sample Latent vector z Encoder Decoder σ

M Model MDN-RNN 𝑃( 𝑧 𝑡+1 | 𝑎 𝑡 , 𝑧 𝑡 , ℎ 𝑡 )
𝑃( 𝑧 𝑡+1 | 𝑎 𝑡 , 𝑧 𝑡 , ℎ 𝑡 ) MDN models gaussian mixture sample z_t+1, use temperature to add uncertainty for sampling

C Model Simple linear model 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡
𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡 Only a few hundred parameters Complexity lies in model Many ways to train C -> evolutionary strategies Tackle chalenges wehre credit assignment problem is hard

Putting all together

Car Racing Experiment visiting as many tiles as possible in the least amount of time solving = average reward of 900 over 100 consecutive trials The agent controls three continuous actions: steering left/right, acceleration, and brake. visiting as many tiles as possible in the least amount of time

Car Racing Experiment V Model only: full World Model:
V model only: no access to prediction of future Full world model: acces to future prediction -> more ‘reflexive’ behaviour 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ] + 𝑏 𝑡 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡

Car Racing Experiment Method Average Score over 100 Random Tracks DQN
343 ± 18 AC3 (contignous) 891 ± 45 AC3 (discrete) 652 ± 10 V model only, z input 632 ± 251 V model only, z input with a hidden layer 788 ± 141 Full World Model, z and h 906 ± 21 - World model first paper to solve task

VizDoom Experiment Avoid fireballs shot by the monsters
reward = number of time steps the agent survives solving = average survival time of more than 750 over 100 consecutive trials Avoid fireballs shot by monsters M model predicts wheter agent dies in next rouns in addition to z_t

VizDoom Experiment - Remember M model: able to construct next latent vector z_t+1 => can train in virtual enviroment Cropped 64x64px frame of environment Reconstruction from latent vector

VizDoom Experiment Agent find adverserial policy: moves in a way s.t. Monsters never fire Uncertainty in the model was to low

VizDoom Experiment Temperature Score in Virtual Enviroment
Score in Actual Enviroment 0.10 2086 ± 140 193 ± 58 0.50 2060 ± 277 196 ± 50 1.00 1145 ± 690 868 ± 511 1.15 918 ± 546 1092 ± 556 1.30 732 ± 269 753 ± 139 Random Policy Baseline N/A 210 ± 108 Gym Leaderboard 820 ± 58

Wrap up World Models Learn dynamics of enviroment
Train controller in virtual enviroment («dreams») Different ways to train C Model Combine Model-free with Model-based RL

Temporal Difference Variational Auto-Encoder
Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, Théophane Weber DeepMind More involved No reinforcement learning, just build a model

TD-VAE Learn an abstract state representation of the data
Learn a belief state 𝑏 𝑡 Learn temporal abstraction capable of making predictions at the state level, not just the observation level, use latent variables to model transition between states i.e. a deterministic, coded representation of the ﬁltering posterior of the state given all the observations up to a given time. A belief state contains all the information an agent has about the state of the world and thus about how to act optimally. Learn from temporally separated time steps, able to do ‘jump’ predictions, predict the state in the future In RL: state of agent represents belief about sum of discounted rewards, here state represents belief about possible future states

TD-VAE 𝑏 𝑡 1 𝑏 𝑡 2 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2 State prediction network
𝑏 𝑡 1 𝑏 𝑡 2 State prediction network 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2 - Goal is: predict future time step t2 (i.e. Predict state) using all data we have seen so far

TD-VAE - Training 𝑏 𝑡 2 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑞 𝑠 𝑡 1 | 𝑡 1 𝑝 𝐵 𝑡 2 𝑧 𝑡 1
Beliefe network State prediction network Inference network Decoder network Sampling 𝑏 𝑡 2 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑞 𝑠 𝑡 1 | 𝑡 1 𝑝 𝐵 𝑡 2 𝑧 𝑡 1 𝑧 𝑡 2 Randomly sample t1 and t2 𝑝 𝑇 𝑡 2 𝑝 𝐷 𝑡 2 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2

TD-VAE - Predicting 𝑡 1 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑡 2 𝑧 𝑡 2 𝑧 𝑡 1 𝑝 𝑇 𝑡 2 𝑥 𝑡 1
Beliefe network State prediction network Inference network Decoder network Sampling 𝑡 1 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑡 2 𝑧 𝑡 1 𝑧 𝑡 2 𝑝 𝑇 𝑡 2 𝑥 𝑡 1 Goal: given all observations up to t1, predict state at t2 Generate belief state at t1, sample z1 Predict state distribution at t2, sample z2

Noisy Harmonic Oscillator
Noise is added to the position and velocity state consists of frequency, magnitude and position (phase) Hierarchical with two layers: stack TD-VEA on top of each other and it is only the position that cannot be accurately predicted, accurate for dt=20 but not for dt=100 Autoregressive model: simple LSTM that predicts step-by-step

Wrap up TD-VAE Different from step-by-step prediction, jumps in time
Hiranchical model: stack TD-VEA on top of each other Build states from observation -> Role out possible future scenarios Belief state represents several possible futures

Thank you for your attention

Questions?

Model based RL Part 1.

Similar presentations

Presentation on theme: "Model based RL Part 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Model based RL Part 1.

Similar presentations

Presentation on theme: "Model based RL Part 1."— Presentation transcript:

Similar presentations

About project

Feedback