Deep Predictive Model for Autonomous Driving

Slides:



Advertisements
Similar presentations
Patch to the Future: Unsupervised Visual Prediction
Advertisements

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.
Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.
Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.
Distributed Representations of Sentences and Documents
Multimodal Deep Learning
Generic object detection with deformable part-based models
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Dynamic 3D Scene Analysis from a Moving Vehicle Young Ki Baik (CV Lab.) (Wed)
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
Feedforward semantic segmentation with zoom-out features
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Learning video saliency from human gaze using candidate selection CVPR2013 Poster.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
Learning to Compare Image Patches via Convolutional Neural Networks
ECE 417 Lecture 1: Multimedia Signal Processing
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Object detection with deformable part-based models
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Deep Reinforcement Learning
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Saliency-guided Video Classification via Adaptively weighted learning
Action Recognition in the Presence of One
Fast Preprocessing for Robust Face Sketch Synthesis
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Deep Learning and Newtonian Physics
Adversarially Tuned Scene Generation
Proposed (MoDL-SToRM)
Context-Aware Modeling and Recognition of Activities in Video
Computer Vision James Hays
Variational Knowledge Graph Reasoning
PixelGAN Autoencoders
Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
Object Detection + Deep Learning
Oral presentation for ACM International Conference on Multimedia, 2014
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Neural Speech Synthesis with Transformer Network
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
RCNN, Fast-RCNN, Faster-RCNN
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Learning Object Context for Dense Captioning
CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.
Introduction to Object Tracking
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
Advances in Deep Audio and Audio-Visual Processing
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Attention for translation
Sequence to Sequence Video to Text
Human-object interaction
Deep Object Co-Segmentation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Weak-supervision based Multi-Object Tracking
SFNet: Learning Object-aware Semantic Correspondence
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
CVPR 2019 Poster.
Jianbo Chen*, Le Song†✦, Martin J. Wainwright*◇ , Michael I. Jordan*
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Deep Predictive Model for Autonomous Driving Wongun Choi

Scene Type Image classification: from where the image is taken? City

Static Scene Elements Semantic segmentation: what is the pixel? Road Sidewalk

Dynamic Objects Object detection: where are certain types of objects?

Dynamic Objects Object detection: where are certain types of objects?

Dynamic Objects Multiple target tracking: how each object has been moving?

Planning? ?

Future Prediction Behavior prediction: how each objects will be moving?

Challenges Multi-modal inputs

Challenges Multi-modal inputs Multi-modal future

Challenges Multi-modal inputs Multi-modal future Accurate time horizon

Challenges Multi-modal inputs Multi-modal future Accurate time horizon Large search space / Limited training data

Previous Works Conditional Variational Autoencoder, Walker et al 2016. Adversarial Transformers, Vondrick et al 2017. No previous work address all the challenges critical for the prediction in driving scenario.

Previous Works Conditional Variational Autoencoder, Walker et al 2016. Adversarial Transformers, Vondrick et al 2017. Activity Forecasting, Kitani et al 2012. No previous work address all the challenges critical for the prediction in driving scenario. Guided Cost Learning, Finn et al 2016.

DESIRE: Deep Stochastic IOC RNN Encoder-decoder N. Lee, W. Choi, P. Vernaza, C. Choy, P. Torr, and M. Chandraker, CVPR 2017 End-to-end trainable framework for behavior prediction. Diverse hypotheses generation via cVAE. Data efficient learning via IOC based framework to rank the hypotheses. Iterative refinement of the hypotheses. Sample Generation Scoring and Refinement

Overall Model Images / preprocessed BEV map

Sampling with cVAE Encoding the past trajectory. Reconstruct the future trajectory. Latent variable z with KLD regularization. Encoding the future trajectory. Train only. Images / preprocessed BEV map

Sampling with cVAE Images / preprocessed BEV map During training, cVAE is learned to reconstruct the target future trajectory given the past trajectory, while enforcing z to match the prior distribution (KLD). During testing, z is drawn from the prior distribution. The latent random variable z encourages to learn diverse predictions. We condition the sampler solely on the past dynamics information, which leads to better generalization. Kingma and Welling 2013, Walker et al 2016

Ranking with IOC RNN decoder provide score of states of samples. Encoding the past trajectory. Global regression vector is learned by using the last hidden vector. Images / preprocessed BEV map CNN learns the static spatial context (e.g., favored drivable location, turn direction, etc).

Ranking with IOC Scene context via CNN features. Interaction among dynamic agents. Dynamics. Images / preprocessed BEV map Need some work to improve!!!

Ranking with IOC Images / preprocessed BEV map Need some work to improve!!!

Ranking with IOC The CNN learns the static cost features. Images / preprocessed BEV map The CNN learns the static cost features. SCF module combines dynamics, scene context and interactions to provide time-varying cost function. Regression vector is learned to refine “blind” samples further. The model is learned with max-entropy IOC framework in an end-to-end manner. Ziebart et al 2008, Finn et al 2016

Experiments Datasets Set-up KITTI dataset Stanford Drone Dataset 24 video sequences, about 6,000 frames 2,500 prediction instances. Preprocessed BEV maps using velodyne points and semantic segmentation. Stanford Drone Dataset 16,000 prediction instances. Use the images directly. Set-up Predict 40 frames (4 sec) in the future given 20 frames past trajectory. 4 / 5 fold cross validation.

Experiments Baselines Linear prediction. RNN ED: a deterministic RNN autoencoder without scene/interaction. RNN ED-SI: a deterministic RNN autoencoder with scene/interaction. CVAE. DESIRE-S: the proposed method with scene context. DESIRE-SI: the proposed method with scene context and interaction.

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Iterative feed-back

Iterative feed-back

Iterative feed-back

Conclusion We propose an end-to-end trainable model for bahavior prediction. Our model can produce multi-modal future prediction with an accurate temporal horizon. The scene context fusion module naturally integrates multiple cues. IOC based framework enables us to learn a predictive model.

Questions & career: wongun@nec-labs.com