Generative Adversarial Imitation Learning

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Optimization Tutorial
1. Algorithms for Inverse Reinforcement Learning 2
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Linear Discriminant Functions Chapter 5 (Duda et al.)
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
EM and expected complete log-likelihood Mixture of Experts
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Demand for Repeated Insurance Contracts with Unknown Loss Probability Emilio Venezian Venezian Associates Chwen-Chi Liu Feng Chia University Chu-Shiu.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
BCS547 Neural Decoding.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Reinforcement learning (Chapter 21)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS479/679 Pattern Recognition Dr. George Bebis
Learning Deep Generative Models by Ruslan Salakhutdinov
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Artificial Neural Networks
Machine Learning & Deep Learning
Deep Reinforcement Learning
LECTURE 11: Advanced Discriminant Analysis
Deep Predictive Model for Autonomous Driving
Adversarial Learning for Neural Dialogue Generation
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Deep reinforcement learning
CSCI 5922 Neural Networks and Deep Learning: NIPS Highlights
Daniel Brown and Scott Niekum The University of Texas at Austin
Probabilistic Models for Linear Regression
"Playing Atari with deep reinforcement learning."
Click to edit Master title style
RL methods in practice Alekh Agarwal.
Probabilistic Models with Latent Variables
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Reinforcement Learning with Partially Known World Dynamics
Apprenticeship Learning via Inverse Reinforcement Learning
Dr. Unnikrishnan P.C. Professor, EEE
Deep Learning for Non-Linear Control
Confidence as Bayesian Probability: From Neural Origins to Behavior
Deep Reinforcement Learning
Introduction to Neural Networks
Markov Decision Processes
Markov Decision Processes
CS 440/ECE448 Lecture 22: Reinforcement Learning
ONNX Training Discussion
Presentation transcript:

Generative Adversarial Imitation Learning Stefano Ermon Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, and Jiaming Song

Reinforcement Learning Goal: Learn policies High-dimensional, raw observations RL needs cost signal action

Imitation Input: expert behavior generated by πE Goal: learn cost function (reward) (Ng and Russell, 2000), (Abbeel and Ng, 2004; Syed and Schapire, 2007), (Ratliff et al., 2006), (Ziebart et al., 2008), (Kolter et al., 2008), (Finn et al., 2016), etc.

Expert’s Trajectories Inverse Reinforcement Problem setup Cost Function c(s) Reinforcement Learning (RL) Optimal policy p Environment (MDP) Expert’s Trajectories s0, s1, s2, … Cost Function c(s) Inverse Reinforcement Learning (IRL) Assumptions: Expert is following a policy Rational: Policy is optimal with respect to an (unknown) cost Needs “regularization” Everything else has high cost Expert has small cost (Ziebart et al., 2010; Rust 1987)

Inverse Reinforcement Expert’s Trajectories Problem setup Cost Function c(s) Reinforcement Learning (RL) Optimal policy p ≈ (similar wrt ψ) Environment (MDP) ? Inverse Reinforcement Learning (IRL) Cost Function c(s) Expert’s Trajectories s0, s1, s2, … Assumptions: Expert is following a policy Rational: Policy is optimal with respect to an (unknown) cost Needs “regularization” PSI Convex cost regularizer

Expert’s Trajectories ψ-regularized Inverse Reinforcement Combining RLoIRL ρp = occupancy measure = distribution of state-action pairs encountered when navigating the environment with the policy Optimal policy p Reinforcement Learning (RL) Expert’s Trajectories s0, s1, s2, … ψ-regularized Inverse Reinforcement Learning (IRL) ≈ (similar w.r.t. ψ) ρpE = Expert’s occupancy measure Assumptions: Expert is following a policy Rational: Policy is optimal with respect to an (unknown) cost Needs “regularization” Theorem: ψ-regularized inverse reinforcement learning, implicitly, seeks a policy whose occupancy measure is close to the expert’s, as measured by ψ* (convex conjugate of ψ)

Takeaway Theorem: ψ-regularized inverse reinforcement learning, implicitly, seeks a policy whose occupancy measure is close to the expert’s, as measured by ψ* Typical IRL definition: finding a cost function c such that the expert policy is uniquely optimal w.r.t. c Alternative view: IRL as a procedure that tries to induce a policy that matches the expert’s occupancy measure (generative model)

Special cases If ψ(c)=constant, then Not a useful algorithm. In practice, we only have sampled trajectories Overfitting: Too much flexibility in choosing the cost function (and the policy) All cost functions ψ(c)=constant

Towards Apprenticeship learning Solution: use features fs,a Cost c(s,a) = θ . fs,a Only these “simple” cost functions are allowed Linear in features ψ(c)= 0 ψ(c)= ∞ All cost functions

Apprenticeship learning For that choice of ψ, RL oIRLψ framework gives apprenticeship learning Apprenticeship learning: find π performing better than πE over costs linear in the features Abbeel and Ng (2004) Syed and Schapire (2007)

Issues with Apprenticeship learning Need to craft features very carefully unless the true expert cost function (assuming it exists) lies in C, there is no guarantee that AL will recover the expert policy RL o IRLψ(pE) is “encoding” the expert behavior as a cost function in C. it might not be possible to decode it back if C is too simple All cost functions pE pR IRL RL

Generative Adversarial Imitation Learning Solution: use a more expressive class of cost functions All cost functions Linear in features

Generative Adversarial Imitation Learning ψ* = optimal negative log-loss of the binary classification problem of distinguishing between state-action pairs of π and πE D Policy π Expert Policy πE Equation (14) is proportional to the optimal negative log loss of the binary classification problem of distinguishing between state-action pairs of π and πE If identical, high classification loss (-1)  small psi* If different, small classification loss (-100) large psi* Negative log loss Max log(d) = - Min – log (d)

Generative Adversarial Networks Figure from Goodfellow et al, 2014

Generative Adversarial Imitation Expert demonstrations Environment Policy Trajectories

How to optimize the objective Previous Apprenticeship learning work: Full dynamics model Small environment Repeated RL We propose: gradient descent over policy parameters (and discriminator) J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization. ICML 2016.

Properties Inherits pros of policy gradient Convergence to local minima Can be model free Inherits cons of policy gradient High variance Small steps required

Properties Inherits pros of policy gradient Convergence to local minima Can be model free Inherits cons of policy gradient High variance Small steps required Solution: trust region policy optimization

Results Input: driving demonstrations (Torcs) Output policy: From raw visual inputs Li et al, 2017. Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs https://github.com/YunzhuLi/InfoGAIL

Results

Experimental results

InfoGAIL Maximize mutual information (Chen et al, 2016) model Latent variables Z Policy Environment Trajectories Semantically meaningful latent structure? Interpolation in latent space by Hou, Shen, Qiu, 2016

InfoGAIL Pass right (z=1) Pass left (z=0) model Li et al, 2017. Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs model Latent variables Z Policy Environment Trajectories Pass left (z=0) Pass right (z=1)

InfoGAIL Turn outside (z=1) Turn inside (z=0) model Li et al, 2017. Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs model Latent variables Z Policy Environment Trajectories Turn inside (z=0) Turn outside (z=1)

Conclusions IRL is a dual of an occupancy measure matching problem (generative modeling) Might need flexible cost functions GAN style approach Policy gradient approach Scales to high dimensional settings Towards unsupervised learning of latent structure from demonstrations