Deep Predictive Model for Autonomous Driving

Slides:

Advertisements

Similar presentations

Patch to the Future: Unsupervised Visual Prediction

Advertisements

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.

Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.

Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.

Distributed Representations of Sentences and Documents

Multimodal Deep Learning

Generic object detection with deformable part-based models

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Dynamic 3D Scene Analysis from a Moving Vehicle Young Ki Baik (CV Lab.) (Wed)

Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.

Feedforward semantic segmentation with zoom-out features

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

Learning to Compare Image Patches via Convolutional Neural Networks

ECE 417 Lecture 1: Multimedia Signal Processing

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Object detection with deformable part-based models

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Deep Reinforcement Learning

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Saliency-guided Video Classification via Adaptively weighted learning

Action Recognition in the Presence of One

Fast Preprocessing for Robust Face Sketch Synthesis

Intelligent Information System Lab

Neural networks (3) Regularization Autoencoder

Deep Learning and Newtonian Physics

Adversarially Tuned Scene Generation

Proposed (MoDL-SToRM)

Context-Aware Modeling and Recognition of Activities in Video

Computer Vision James Hays

Variational Knowledge Graph Reasoning

PixelGAN Autoencoders

Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong

PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD

Object Detection + Deep Learning

Oral presentation for ACM International Conference on Multimedia, 2014

Outline Background Motivation Proposed Model Experimental Results

Neural Speech Synthesis with Transformer Network

Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2

RCNN, Fast-RCNN, Faster-RCNN

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

Learning Object Context for Dense Captioning

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Introduction to Object Tracking

实习生汇报 ——北邮张安迪.

Neural networks (3) Regularization Autoencoder

Advances in Deep Audio and Audio-Visual Processing

Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.

Attention for translation

Sequence to Sequence Video to Text

Human-object interaction

Deep Object Co-Segmentation

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Weak-supervision based Multi-Object Tracking

SFNet: Learning Object-aware Semantic Correspondence

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

CVPR 2019 Poster.

Jianbo Chen*, Le Song†✦, Martin J. Wainwright*◇ , Michael I. Jordan*

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Deep Predictive Model for Autonomous Driving Wongun Choi

Scene Type Image classification: from where the image is taken? City

Static Scene Elements Semantic segmentation: what is the pixel? Road Sidewalk

Dynamic Objects Object detection: where are certain types of objects?

Dynamic Objects Object detection: where are certain types of objects?

Dynamic Objects Multiple target tracking: how each object has been moving?

Planning? ?

Future Prediction Behavior prediction: how each objects will be moving?

Challenges Multi-modal inputs

Challenges Multi-modal inputs Multi-modal future

Challenges Multi-modal inputs Multi-modal future Accurate time horizon

Challenges Multi-modal inputs Multi-modal future Accurate time horizon Large search space / Limited training data

Previous Works Conditional Variational Autoencoder, Walker et al 2016. Adversarial Transformers, Vondrick et al 2017. No previous work address all the challenges critical for the prediction in driving scenario.

Previous Works Conditional Variational Autoencoder, Walker et al 2016. Adversarial Transformers, Vondrick et al 2017. Activity Forecasting, Kitani et al 2012. No previous work address all the challenges critical for the prediction in driving scenario. Guided Cost Learning, Finn et al 2016.

DESIRE: Deep Stochastic IOC RNN Encoder-decoder N. Lee, W. Choi, P. Vernaza, C. Choy, P. Torr, and M. Chandraker, CVPR 2017 End-to-end trainable framework for behavior prediction. Diverse hypotheses generation via cVAE. Data efficient learning via IOC based framework to rank the hypotheses. Iterative refinement of the hypotheses. Sample Generation Scoring and Refinement

Overall Model Images / preprocessed BEV map

Sampling with cVAE Encoding the past trajectory. Reconstruct the future trajectory. Latent variable z with KLD regularization. Encoding the future trajectory. Train only. Images / preprocessed BEV map

Sampling with cVAE Images / preprocessed BEV map During training, cVAE is learned to reconstruct the target future trajectory given the past trajectory, while enforcing z to match the prior distribution (KLD). During testing, z is drawn from the prior distribution. The latent random variable z encourages to learn diverse predictions. We condition the sampler solely on the past dynamics information, which leads to better generalization. Kingma and Welling 2013, Walker et al 2016

Ranking with IOC RNN decoder provide score of states of samples. Encoding the past trajectory. Global regression vector is learned by using the last hidden vector. Images / preprocessed BEV map CNN learns the static spatial context (e.g., favored drivable location, turn direction, etc).

Ranking with IOC Scene context via CNN features. Interaction among dynamic agents. Dynamics. Images / preprocessed BEV map Need some work to improve!!!

Ranking with IOC Images / preprocessed BEV map Need some work to improve!!!

Ranking with IOC The CNN learns the static cost features. Images / preprocessed BEV map The CNN learns the static cost features. SCF module combines dynamics, scene context and interactions to provide time-varying cost function. Regression vector is learned to refine “blind” samples further. The model is learned with max-entropy IOC framework in an end-to-end manner. Ziebart et al 2008, Finn et al 2016

Experiments Datasets Set-up KITTI dataset Stanford Drone Dataset 24 video sequences, about 6,000 frames 2,500 prediction instances. Preprocessed BEV maps using velodyne points and semantic segmentation. Stanford Drone Dataset 16,000 prediction instances. Use the images directly. Set-up Predict 40 frames (4 sec) in the future given 20 frames past trajectory. 4 / 5 fold cross validation.

Experiments Baselines Linear prediction. RNN ED: a deterministic RNN autoencoder without scene/interaction. RNN ED-SI: a deterministic RNN autoencoder with scene/interaction. CVAE. DESIRE-S: the proposed method with scene context. DESIRE-SI: the proposed method with scene context and interaction.

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Iterative feed-back

Iterative feed-back

Iterative feed-back

Conclusion We propose an end-to-end trainable model for bahavior prediction. Our model can produce multi-modal future prediction with an accurate temporal horizon. The scene context fusion module naturally integrates multiple cues. IOC based framework enables us to learn a predictive model.

Questions & career: wongun@nec-labs.com