Model based RL Part 1.

Slides:



Advertisements
Similar presentations
Dougal Sutherland, 9/25/13.
Advertisements

RL for Large State Spaces: Value Function Approximation
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Artificial Spiking Neural Networks
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
A Roadmap towards Machine Intelligence
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Deep Reinforcement Learning
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
CSE 190 Modeling sequences: A brief overview
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Continuous Control with Prioritized Experience Replay
Learning Deep Generative Models by Ruslan Salakhutdinov
End-To-End Memory Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Introduction of Reinforcement Learning
CSC321 Lecture 18: Hopfield nets and simulated annealing
Deep Learning Amin Sobhani.
Randomness in Neural Networks
Recurrent Neural Networks for Natural Language Processing
Adversarial Learning for Neural Dialogue Generation
A Crash Course in Reinforcement Learning
Reinforcement learning (Chapter 21)
István Szita & András Lőrincz
Particle Filtering for Geometric Active Contours
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Intelligent Information System Lab
Random walk initialization for training very deep feedforward networks
Hybrid computing using a neural network with dynamic external memory
"Playing Atari with deep reinforcement learning."
Human-level control through deep reinforcement learning
A critical review of RNN for sequence learning Zachary C
PixelGAN Autoencoders
Hidden Markov Models Part 2: Algorithms
(boy, that’s a long title)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
The free-energy principle: a rough guide to the brain? K Friston
Double Dueling Agent for Dialogue Policy Learning
Other Classification Models: Recurrent Neural Network (RNN)
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
CS 416 Artificial Intelligence
Attention for translation
@ NeurIPS 2016 Tim Dunn, May
Neural Machine Translation using CNN
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Cengizhan Can Phoebe de Nooijer
CSC 578 Neural Networks and Deep Learning
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Model based RL Part 1

400ms: signal processing, swing ect. => 125ms (300-400ms to blink)

Model based RL 𝑅 π = 𝐸 π 𝑡=0 𝑇 γ 𝑡 𝑟 𝑡 𝑅 π = 𝐸 π 𝑡=0 𝑇 γ 𝑡 𝑟 𝑡 𝑅 π = 𝐸 π 𝑡=0 𝑇 𝑃 𝑠 ′ 𝑠,𝑎 ∗𝑅( 𝑠 ′ ) - Maximize the discounted cumulative reward in each episode by finding a good policy

Model based RL 𝑚𝑎𝑥 θ 𝐸 τ~ 𝑝 θ (τ) 𝑡=0 𝑇 γ 𝑡 𝑅( 𝑠 𝑡 , 𝑎 𝑡 ) 𝑚𝑎𝑥 θ 𝐸 τ~ 𝑝 θ (τ) 𝑡=0 𝑇 γ 𝑡 𝑅( 𝑠 𝑡 , 𝑎 𝑡 ) 𝑝 θ 𝑠 0 , 𝑎 0 ,…, 𝑠 𝑇 , 𝑎 𝑇 =𝑝( 𝑠 0 ) 𝑡=0 𝑇 π θ 𝑎 𝑡 𝑠 𝑡 ∗ 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 Model-free learns model indirectly trough value function by sampling data ponits with a given policy policy Model i.e. transition probability

Why use Model-based RL? Partially observable environments Generality Sample Efficiency

Training Model-based RL Run intitial policy π 0 𝑎 𝑡 𝑠 𝑡 , e.g. random policy Learn the model, i.e. 𝑃 𝑠 𝑡+1 𝑠 𝑡 , 𝑎 𝑡 Learn a new policy π 𝑛 𝑎 𝑡 𝑠 𝑡 Execute π 𝑛 and collect new data Irerative training 1) don’t stuck in local minimas 2) learn skills step by step

Recurrent neural network (RNN) 𝑦 𝑡 ℎ 𝑡 A 𝑥 𝑡

Recurrent neural network (RNN) 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑡 ℎ 0 A ℎ 1 A ℎ 2 A ℎ 3 ℎ 𝑡−1 A Vanishing gradients Only short term memory, forgets wahts its seen in long sequences 𝑥 1 𝑥 2 𝑥 3 … 𝑥 𝑡

Long Short Term Memory network (LSTM) https://www.youtube.com/watch?v=_h66BW-xNgk

David Ha Google Brain Jürgen Schmidhuber AI Lab, IDSIA (USI & SUPSI) World Models David Ha Google Brain Jürgen Schmidhuber AI Lab, IDSIA (USI & SUPSI) https://worldmodels.github.io/

World Model

V Model - Decoder only for training and

Variational Autoencoder (VAE) Input Output X’ x Sample Latent vector µ z Encoder Decoder σ

M Model MDN-RNN 𝑃( 𝑧 𝑡+1 | 𝑎 𝑡 , 𝑧 𝑡 , ℎ 𝑡 ) 𝑃( 𝑧 𝑡+1 | 𝑎 𝑡 , 𝑧 𝑡 , ℎ 𝑡 ) MDN models gaussian mixture sample z_t+1, use temperature to add uncertainty for sampling

C Model Simple linear model 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡 Only a few hundred parameters Complexity lies in model Many ways to train C -> evolutionary strategies Tackle chalenges wehre credit assignment problem is hard

Putting all together

Car Racing Experiment visiting as many tiles as possible in the least amount of time solving = average reward of 900 over 100 consecutive trials https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/carracing_z_and_h.mp4 The agent controls three continuous actions: steering left/right, acceleration, and brake. visiting as many tiles as possible in the least amount of time

Car Racing Experiment V Model only: full World Model: https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/carracing_z_and_h.mp4 https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/carracing_z_only.mp4 V model only: no access to prediction of future Full world model: acces to future prediction -> more ‘reflexive’ behaviour 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ] + 𝑏 𝑡 𝑎 𝑡 = 𝑊 𝑐 [ 𝑧 𝑡 ℎ 𝑡 ] + 𝑏 𝑡

Car Racing Experiment Method Average Score over 100 Random Tracks DQN 343 ± 18 AC3 (contignous) 891 ± 45 AC3 (discrete) 652 ± 10 V model only, z input 632 ± 251 V model only, z input with a hidden layer 788 ± 141 Full World Model, z and h 906 ± 21 - World model first paper to solve task

VizDoom Experiment Avoid fireballs shot by the monsters reward = number of time steps the agent survives solving = average survival time of more than 750 over 100 consecutive trials https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/doom_lazy_small.mp4 Avoid fireballs shot by monsters M model predicts wheter agent dies in next rouns in addition to z_t

VizDoom Experiment https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/doom_real_vae.mp4 - Remember M model: able to construct next latent vector z_t+1 => can train in virtual enviroment Cropped 64x64px frame of environment Reconstruction from latent vector

VizDoom Experiment https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/doom_adversarial.mp4 Agent find adverserial policy: moves in a way s.t. Monsters never fire Uncertainty in the model was to low

VizDoom Experiment Temperature Score in Virtual Enviroment Score in Actual Enviroment 0.10 2086 ± 140 193 ± 58 0.50 2060 ± 277 196 ± 50 1.00 1145 ± 690 868 ± 511 1.15 918 ± 546 1092 ± 556 1.30 732 ± 269 753 ± 139 Random Policy Baseline N/A 210 ± 108 Gym Leaderboard 820 ± 58

Wrap up World Models Learn dynamics of enviroment Train controller in virtual enviroment («dreams») Different ways to train C Model Combine Model-free with Model-based RL

Temporal Difference Variational Auto-Encoder Karol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, Théophane Weber DeepMind More involved No reinforcement learning, just build a model

TD-VAE Learn an abstract state representation of the data Learn a belief state 𝑏 𝑡 Learn temporal abstraction capable of making predictions at the state level, not just the observation level, use latent variables to model transition between states i.e. a deterministic, coded representation of the filtering posterior of the state given all the observations up to a given time. A belief state contains all the information an agent has about the state of the world and thus about how to act optimally. Learn from temporally separated time steps, able to do ‘jump’ predictions, predict the state in the future In RL: state of agent represents belief about sum of discounted rewards, here state represents belief about possible future states

TD-VAE 𝑏 𝑡 1 𝑏 𝑡 2 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2 State prediction network 𝑏 𝑡 1 𝑏 𝑡 2 State prediction network 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2 - Goal is: predict future time step t2 (i.e. Predict state) using all data we have seen so far

TD-VAE - Training 𝑏 𝑡 2 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑞 𝑠 𝑡 1 | 𝑡 1 𝑝 𝐵 𝑡 2 𝑧 𝑡 1 Beliefe network State prediction network Inference network Decoder network Sampling 𝑏 𝑡 2 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑞 𝑠 𝑡 1 | 𝑡 1 𝑝 𝐵 𝑡 2 𝑧 𝑡 1 𝑧 𝑡 2 Randomly sample t1 and t2 𝑝 𝑇 𝑡 2 𝑝 𝐷 𝑡 2 𝑥 𝑡 1 𝑥 𝑡 2 𝑡 1 𝑡 2

TD-VAE - Predicting 𝑡 1 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑡 2 𝑧 𝑡 2 𝑧 𝑡 1 𝑝 𝑇 𝑡 2 𝑥 𝑡 1 Beliefe network State prediction network Inference network Decoder network Sampling 𝑡 1 𝑏 𝑡 1 𝑝 𝐵 𝑡 1 𝑡 2 𝑧 𝑡 1 𝑧 𝑡 2 𝑝 𝑇 𝑡 2 𝑥 𝑡 1 Goal: given all observations up to t1, predict state at t2 Generate belief state at t1, sample z1 Predict state distribution at t2, sample z2

Noisy Harmonic Oscillator Noise is added to the position and velocity state consists of frequency, magnitude and position (phase) Hierarchical with two layers: stack TD-VEA on top of each other and it is only the position that cannot be accurately predicted, accurate for dt=20 but not for dt=100 Autoregressive model: simple LSTM that predicts step-by-step

Wrap up TD-VAE Different from step-by-step prediction, jumps in time Hiranchical model: stack TD-VEA on top of each other Build states from observation -> Role out possible future scenarios Belief state represents several possible futures

Thank you for your attention

Questions?