Deep Predictive Model for Autonomous Driving Wongun Choi
Scene Type Image classification: from where the image is taken? City
Static Scene Elements Semantic segmentation: what is the pixel? Road Sidewalk
Dynamic Objects Object detection: where are certain types of objects?
Dynamic Objects Object detection: where are certain types of objects?
Dynamic Objects Multiple target tracking: how each object has been moving?
Planning? ?
Future Prediction Behavior prediction: how each objects will be moving?
Challenges Multi-modal inputs
Challenges Multi-modal inputs Multi-modal future
Challenges Multi-modal inputs Multi-modal future Accurate time horizon
Challenges Multi-modal inputs Multi-modal future Accurate time horizon Large search space / Limited training data
Previous Works Conditional Variational Autoencoder, Walker et al 2016. Adversarial Transformers, Vondrick et al 2017. No previous work address all the challenges critical for the prediction in driving scenario.
Previous Works Conditional Variational Autoencoder, Walker et al 2016. Adversarial Transformers, Vondrick et al 2017. Activity Forecasting, Kitani et al 2012. No previous work address all the challenges critical for the prediction in driving scenario. Guided Cost Learning, Finn et al 2016.
DESIRE: Deep Stochastic IOC RNN Encoder-decoder N. Lee, W. Choi, P. Vernaza, C. Choy, P. Torr, and M. Chandraker, CVPR 2017 End-to-end trainable framework for behavior prediction. Diverse hypotheses generation via cVAE. Data efficient learning via IOC based framework to rank the hypotheses. Iterative refinement of the hypotheses. Sample Generation Scoring and Refinement
Overall Model Images / preprocessed BEV map
Sampling with cVAE Encoding the past trajectory. Reconstruct the future trajectory. Latent variable z with KLD regularization. Encoding the future trajectory. Train only. Images / preprocessed BEV map
Sampling with cVAE Images / preprocessed BEV map During training, cVAE is learned to reconstruct the target future trajectory given the past trajectory, while enforcing z to match the prior distribution (KLD). During testing, z is drawn from the prior distribution. The latent random variable z encourages to learn diverse predictions. We condition the sampler solely on the past dynamics information, which leads to better generalization. Kingma and Welling 2013, Walker et al 2016
Ranking with IOC RNN decoder provide score of states of samples. Encoding the past trajectory. Global regression vector is learned by using the last hidden vector. Images / preprocessed BEV map CNN learns the static spatial context (e.g., favored drivable location, turn direction, etc).
Ranking with IOC Scene context via CNN features. Interaction among dynamic agents. Dynamics. Images / preprocessed BEV map Need some work to improve!!!
Ranking with IOC Images / preprocessed BEV map Need some work to improve!!!
Ranking with IOC The CNN learns the static cost features. Images / preprocessed BEV map The CNN learns the static cost features. SCF module combines dynamics, scene context and interactions to provide time-varying cost function. Regression vector is learned to refine “blind” samples further. The model is learned with max-entropy IOC framework in an end-to-end manner. Ziebart et al 2008, Finn et al 2016
Experiments Datasets Set-up KITTI dataset Stanford Drone Dataset 24 video sequences, about 6,000 frames 2,500 prediction instances. Preprocessed BEV maps using velodyne points and semantic segmentation. Stanford Drone Dataset 16,000 prediction instances. Use the images directly. Set-up Predict 40 frames (4 sec) in the future given 20 frames past trajectory. 4 / 5 fold cross validation.
Experiments Baselines Linear prediction. RNN ED: a deterministic RNN autoencoder without scene/interaction. RNN ED-SI: a deterministic RNN autoencoder with scene/interaction. CVAE. DESIRE-S: the proposed method with scene context. DESIRE-SI: the proposed method with scene context and interaction.
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
Experiments
Iterative feed-back
Iterative feed-back
Iterative feed-back
Conclusion We propose an end-to-end trainable model for bahavior prediction. Our model can produce multi-modal future prediction with an accurate temporal horizon. The scene context fusion module naturally integrates multiple cues. IOC based framework enables us to learn a predictive model.
Questions & career: wongun@nec-labs.com