Tianhe Yu, Pieter Abbeel, Sergey Levine, Chelsea Finn

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSC321: Neural Networks Lecture 3: Perceptrons
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Distributed Representations of Sentences and Documents
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Multimodal Interaction Dr. Mike Spann
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
Augmenting Physical State Prediction Through Structured Activity Inference Nam Vo & Aaron Bobick ICRA 2015.
Stable Multi-Target Tracking in Real-Time Surveillance Video
A Roadmap towards Machine Intelligence
1/23 Intelligent Agents Chapter 2 Modified by Vali Derhami.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
TensorFlow– A system for large-scale machine learning
The role of optimization in machine learning
Unsupervised Learning of Video Representations using LSTMs
Convolutional Neural Network
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Data Mining, Neural Network and Genetic Programming
Deep Reinforcement Learning
Recurrent Neural Networks for Natural Language Processing
Adversarial Learning for Neural Dialogue Generation
Article Review Todd Hricik.
DM-Group Meeting Liangzhe Chen, Nov
Reinforcement learning (Chapter 21)
Intro to NLP and Deep Learning
Reinforcement learning (Chapter 21)
Introductory Seminar on Research: Fall 2017
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Deep reinforcement learning
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Dynamical Statistical Shape Priors for Level Set Based Tracking
CS 188: Artificial Intelligence
Hybrid computing using a neural network with dynamic external memory
Adversarially Tuned Scene Generation
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Deep Reinforcement Learning in Navigation
A critical review of RNN for sequence learning Zachary C
Two-Stream Convolutional Networks for Action Recognition in Videos
A First Look at Music Composition using LSTM Recurrent Neural Networks
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Market-based Dynamic Task Allocation in Mobile Surveillance Systems
The loss function, the normal equation,
Visual Navigation Yukun Cui.
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Learning linguistic structure with simple recurrent neural networks
Machine learning overview
Neural networks (3) Regularization Autoencoder
Meta Learning (Part 2): Gradient Descent as LSTM
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Attention for translation
Unsupervised Perceptual Rewards For Imitation Learning
Reuben Feinman Research advised by Brenden Lake
David Kauchak CS158 – Spring 2019
Modeling IDS using hybrid intelligent systems
Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction Gyu-Young Hwang
Week 7 Presentation Ngoc Ta Aidean Sharghi
Akram Bitar and Larry Manevitz Department of Computer Science
Morteza Kheirkhah University College London
Model based RL Part 1.
ONNX Training Discussion
Presentation transcript:

Tianhe Yu, Pieter Abbeel, Sergey Levine, Chelsea Finn One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks (and two related papers) Tianhe Yu, Pieter Abbeel, Sergey Levine, Chelsea Finn Yuchen Mo Feb 26, 2019

Contents Objectives & General approaches Model Phase predictor Domain-adaptive meta-learning One-shot skill composition Experiments Contributions & Future improvements

Objective “Learning multi-stage vision-based tasks on a real robot from a single video of a human performing the task, while leveraging demonstration data of subtasks with other objects.” What we have during meta-training: Both videos of humans and teleoperated episodes performing primitive skills (w/o label) What we have during meta-testing: A single video of a human performing multi-stage task What we want: Imitate the multi-stage task Previous work for compound task learning: use demonstrations that are labeled, or segmented into individual activities, commands, descriptions. Or assume a library of pre-specified primitives. Ours: a dataset of demonstrations that are not labeled by activity or command.

Method Consider we have a phase predictor for each primitive, then our objective can be decomposed into: - Decomposes the test-time human video into primitives using a primitive phase predictor - Computes a sequence of policies for each primitive - Sequentially executes each policy

General approach Learn a phase predictor For each premitive, perform imitation learning One-shot composition of primitives

Phase predictor Remember our input dataset: a dataset of demonstrations of primitives w/o label Question: 1. How to segment human demonstrations of compound tasks? 2. When to transit from one policy to the next? 3. What if our human and robot demonstrations are different in length? Solution: Given a video of a partial demonstration or execution of any primitive, we predict how much temporal progress has been made towards the completion of that primitive w/o label: without any label for the particular primitive that was executed in that segment Obj1: (So that the robot can learn a seq of policies in meta-test time)

Phase predictor Two predictors for human/robot demonstration data Training data: first t frames of human/robot demonstration in meta-training dataset Training label: t/T, where T is total length of this demonstration Objective: Minimize MSE Network: VGG-16 + LSTM, swish nonlinearity, layer norm + dropout These two predictors are the same except for the data

Imitation learning Domain-adaptive meta-learning (DAML) is an extention to model-agnostic meta-learning (MAML) Imitation learning assumes meta-training and meta-test tasks are sampled from the same distribution, and that there exists a common structure among tasks. Thus, meta-learning corresponds to structure learning. Imitation learning: to infer a policy from a single human demonstration (It can be explained with probablistic graph models.)

(Chelsea Finn, Pieter Abbeel, Sergey Levine) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML) (Chelsea Finn, Pieter Abbeel, Sergey Levine) Aims to learn a shared structure across tasks, by learning a set of parameters of a neural network that: with one or few more gradient descent steps, the model will generalize effectively to one of those tasks T: task DT: labeled data for task T theta: initial model parameters L(theta, D): loss function Meta training: Meta testing: Figure 1. In the parameter space

MAML (cont.) large alpha, small beta randomly init theta while: sample some tasks for each task i: calculate grad on training set w/ param theta do gradient descent (stored in theta_i’) end for for each task i, calculate its gradient on validation data w/ param theta_i’. Update the ‘global param’ theta w/ their sum. end while MAML itself has reinforcement learning version but not mentioned in this paper.

Domain-Adaptive Meta-Learning (DAML) DAML learns to translate from a video of a human performing a task to a policy that performs that task. Add ‘adaptive objective’ to MAML objective function: dh, dr: human/robot demostration L𝜓: a learned critic s.t. gradient descent on L𝜓 => effective adaption Where for a policy ɸ and demonstration dr, LBC is behavior cloning loss defined as Inner -> Outer Loss of domain transferring => Loss of robot cloning behavior => Loss of the task Minimize not only theta but also the domain transferring param phi Previous DAML: different meta-learning tasks means different objects This paper: primitives performed w/ different sets of objects

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning (DAML) Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, Sergey Levine An example of L\psi, a NN. \psi are its parameters

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning (DAML) Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, Sergey Levine L𝜓: a learned critic s.t. gradient descent on L𝜓 => effective adaption… What?

Composition Composition algorithm: Feed human demonstration into human phase predictor frame by frame, till output > 1-ε => get a primitive segment. Repeat 1 till the end. Given those primitive demonstrations, translate them into policies w/ DAML one-shot imitator => get policies Run policies sequentially, feed observed images into robot phase predictor, change policy when output > 1-ε What we have now: 1. Human phase predictor 2. Robot phase predictor 3. DAML one-shot imitator 3. learn robot policy of a specific premitive from human demnstration

Composition Composition algorithm: Feed human demonstration into human phase predictor frame by frame, till output > 1-ε => get a primitive segment. Repeat 1 till the end. Given those primitive demonstrations, translate them into policies w/ DAML one-shot imitator => get policies Run policies sequentially, feed observed images into robot phase predictor, change policy when output > 1-ε

Model summary

Experiments (Simulated Order Fulfillment in MuJoCo, Sawyer arm) Demonstration: Picking and placing a single object into the bin, and pushing the bin to the goal. The other objects are distractors. Human demonstration is given by an expert policy using proximal policy optimization (PPO). Actually, it gives demonstration for each primitive and they concatenate them together to generate more test data. Success is defined as having the correct objects in the bin and the bin at the target. Sliding window: that fixed-length windows are an appropriate representation of primitives Results: Phase prediction and gradient-based meta learning are essential for good performance. Failures: Caused by the one-shot imitator on grasping. Requires precise visual object localization and precise control.

Experiments (PR2 Kitchen Serving)

Demo https://sites.google.com/view/one-shot-hil

Contributions & Future improvements 1. Meta-learning from unlabeled demonstrations of primitive tasks. 2. End-to-end video=>policy at test time Improvements: 1. The performance of one-shot imitation learning methods, potentially incorporate reinforcement learning or other forms of online feedback 2. More unstructured demonstration data, such as human and robot demonstrations of temporally-extended tasks that are from distinct sources rather than full-length long tasks, and needs no label convey motions or subtasks that are difficult to describe, such as the transitions from one subtask to another; it also does not require the human to have any knowledge of the grammar of commands 2. provide the ability to scale to large, unlabeled datasets. May remove the need for an expert to decide the set of skills that constitute primitives during training

Thank you for listening! Q&A?