Diversity meets Deep Networks: Inference, Ensembles, and Applications

Diversity meets Deep Networks: Inference, Ensembles, and Applications
Alexander Kirillov Bogdan Savchynskyy Carsten Rother Stefan Lee Indiana University  Virginia Tech Dhruv Batra Virginia Tech Technische Universität Dresden

Schedule Time Topic Presenter 2:15 – 3:00
Opening Remarks + Need for Multiple Diverse Solutions Dhruv 3:00 – 3:15 Coffee Break 3:15 – 4:45 Generating Diverse Solutions from a Single Model Alex & Bogdan 4:45 – 5:00 5:00 – 5:45 Training Diverse Deep Ensembles Stefan (C) Dhruv Batra

Schedule 1. Please interrupt & ask questions!
Time Topic Presenter 2:15 – 3:00 Opening Remarks + Need for Multiple Diverse Solutions Dhruv 3:00 – 3:15 Coffee Break 3:15 – 4:45 Generating Diverse Solutions from a Single Model Alex & Bogdan 4:45 – 5:00 5:00 – 5:45 Training Diverse Deep Ensembles Stefan 1. Please interrupt & ask questions! 2. All slides available online. (C) Dhruv Batra

Image Classification 1000 object classes 1.4M/50k/100k images
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 1000 object classes M/50k/100k images Person Dalmatian (C) Dhruv Batra

Image Credit: [He et al. CVPR16]
Image Classification (C) Dhruv Batra Image Credit: [He et al. CVPR16]

Image Credit: [He et al. CVPR16]
Revolution of Depth a (C) Dhruv Batra Image Credit: [He et al. CVPR16]

Image Credit: [Vinyals et al. CVPR15]
Image Captioning (C) Dhruv Batra Image Credit: [Vinyals et al. CVPR15]

Visual Question Answering (VQA)
(C) Dhruv Batra

Visual Question Answering (VQA)
Slide Credit: Stan Antol

AI far from perfect (C) Dhruv Batra

A Brief History of AI (C) Dhruv Batra
Image Credit: Joseph Mehling;

A Brief History of AI “We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire.” … [Our] conjecture is that every aspect of learning or intelligence can be so precisely described that a machine can be made to simulate it. An attempt will be made … to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.” (C) Dhruv Batra

Why is AI hard? “The three biggest challenges to a computer being able to build knowledge from data are Ambiguity, Ambiguity, Ambiguity.” -- Ray Mooney, CS, UT-Austin, (quote in reference to IBM Watson). (C) Dhruv Batra

Image Credit: Liang Huang
Linguistic Ambiguity “I saw her duck” (C) Dhruv Batra Image Credit: Liang Huang

What is this tutorial about?
Multiple Diverse Predictions in ML and AI (C) Dhruv Batra

Classical Machine Learning
Image Classification “Person” (C) Dhruv Batra

Semantic Segmentation (C) Dhruv Batra

Pose Estimation (C) Dhruv Batra

Image Captioning “Two people are petting horses.” (C) Dhruv Batra

VQA “2” How many people are there? (C) Dhruv Batra

Dialogue System “Count us in!” (C) Dhruv Batra Image Credit: Google Research Blog

Input Output (C) Dhruv Batra

Input Output This Tutorial Machine Learning Multiple Outputs Input (C) Dhruv Batra

Example: Segmentation
, , , [Batra et al. ECCV12], [Guzman-Rivera et al. NIPS12], [Yadollahpour et al. CVPR13], [Gimpel et al. EMNLP13], [Guzman-Rivera AISTATS13], [Premachandran et al. CVPR14], [Prasad et a. NIPS14], [Guzman-Rivera et al. AISTATS14], [Sun et al. CVPR15], [Ahmed et al. ICCV15], [Sun et al. NIPS15] (C) Dhruv Batra

Exponentially-Large Item Set
Semantic Segmentation (C) Dhruv Batra

Pose Estimation (C) Dhruv Batra

Image Captioning Two people are standing next to two horses. A man pets a horse while a woman looks on. There is a man sitting on a horse. Two people and two horses standing in a field. (C) Dhruv Batra

Neural Image Captioning
Image Embedding (VGGNet) 4096-dim Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP (C) Dhruv Batra

Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Image Embedding (VGGNet) (C) Dhruv Batra

P(next) P(next) P(next) P(next) P(next) RNN RNN RNN RNN RNN RNN RNN <start> Two people and two horses. (C) Dhruv Batra

Beam Search Demo Classical Beam Search Diverse Beam Search
(C) Dhruv Batra

[What?]: Multiple Diverse Predictions in ML and AI [Why?]: Need for Diversity Overcoming poor models Hedging against ambiguity Don’t be boring [How?]: Techniques Diverse Solutions from a Single (Deep) Model Part 1: Alex/Bogdan Training Diverse Deep Ensembles Part 2: Stefan [Now what?]: What do I do multiple predictions? [What?]: Multiple Diverse Predictions in ML and AI (C) Dhruv Batra

[What?]: Multiple Diverse Predictions in ML and AI [Why?]: Need for Diversity Overcoming poor models Hedging against ambiguity Don’t be boring [How?]: Techniques Diverse Solutions from a Single (Deep) Model Part 1: Alex/Bogdan Training Diverse Deep Ensembles Part 2: Stefan [Now what?]: What do I do multiple predictions? (C) Dhruv Batra

Need#1: Poor Models Approximation Error Human Body ≠ Tree
-- Model-Class is Wrong! Human Body ≠ Tree Figure Courtesy: [Yang & Ramanan ICCV ‘11] Unfortunately, we often run into a number of problems with MAP. Most often, our model is simply wrong. So even if we predict the most probable state from our model, it could be very far from ground-truth. For example, a tree model assumes that we walk around like this, with our limbs always un-occluded. (C) Dhruv Batra

Need#1: Poor Models Approximation Error Embedding (VGGNet) Image
Neural Network Softmax over top K answers Embedding (VGGNet) Image Approximation Error -- Model-Class is Wrong! Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding (LSTM) Question “How many horses are in this image?” (C) Dhruv Batra

Need#1: Poor Models (C) Dhruv Batra
So how can we make multiple predictions. Well, it’s a probabilistic model. We could sample from the distribution. Unfortunately, sampling is rather wasteful since we observe the same modes of the distribution over and over again. And if there is a low-probability mode, we will have to wait a long time to observe a sample from it. (C) Dhruv Batra

Need#2: Ambiguity Bayes Error “I saw her duck”
-- Not enough information ? “I saw her duck” Even if you can compute MAP, there may simply be multiple acceptable answers. For example, this woman could be rotating left or rotating right. This could be a young woman looking away or an old lady looking left. When we have a user-in-the-loop, different users may expect different outputs from the same input. One instance / Two instances? (C) Dhruv Batra

Need#2: Ambiguity Dialogue System “We’ll be there!”
“Sorry, we won’t be able to make it” “Count us in!” “Thanks so much, but we’re out of town.” “Can I bring my dog?” (C) Dhruv Batra Image Credit: Google Research Blog

Need#2: Ambiguity Image Captioning
“Single engine train rolling down the tracks” “A steam locomotive is blowing steam” Image Captioning “A locomotive drives along the tracks among trees and bushes” “An engine is coming down the tracks” “An old fashioned train with steam coming out of its pipe” (C) Dhruv Batra

Need#2: Ambiguity (C) Dhruv Batra
“An old fashioned train with steam coming out of its pipe” “A steam locomotive is blowing steam” “Single engine train rolling down the tracks” “An engine is coming down the tracks” (C) Dhruv Batra

Need#3: Don’t be boring (C) Dhruv Batra
Even if you can compute MAP, there may simply be multiple acceptable answers. For example, this woman could be rotating left or rotating right. This could be a young woman looking away or an old lady looking left. When we have a user-in-the-loop, different users may expect different outputs from the same input. “An old fashioned train with steam coming out of its pipe” “A steam locomotive is blowing steam” “Single engine train rolling down the tracks” “An engine is coming down the tracks” (C) Dhruv Batra

Need#3: Don’t be boring [One] bizarre feature of our early prototype was its propensity to respond with “I love you” to seemingly anything. As adorable as this sounds, it wasn’t really what we were hoping for. [It] turns out that responses like “Thanks", "Sounds good", and “I love you” are super common -- so the system would lean on them as a safe bet if it was unsure. (C) Dhruv Batra

Need#3: Don’t be boring (C) Dhruv Batra

Need#3: Don’t be boring [It] just said everything was awesome all the time — 'all the people had a great time; everybody had an awesome time; it was a great day. Meg Mitchell (C) Dhruv Batra

Need#3: Don’t be boring “I don’t know” “no” “I love you” “yes”
Even if you can compute MAP, there may simply be multiple acceptable answers. For example, this woman could be rotating left or rotating right. This could be a young woman looking away or an old lady looking left. When we have a user-in-the-loop, different users may expect different outputs from the same input. “I don’t know” “no” “I love you” “yes” (C) Dhruv Batra

“I love you” Input Boring Output Machine Learning Input Multiple Outputs This Tutorial (C) Dhruv Batra

Diverse Predictions Now what? (C) Dhruv Batra
[Batra et al. ECCV12], [Guzman-Rivera et al. NIPS12], [Yadollahpour et al. CVPR13], [Gimpel et al. EMNLP13], [Guzman-Rivera AISTATS13], [Premachandran et al. CVPR14], [Prasad et a. NIPS14], [Guzman-Rivera et al. AISTATS14], [Sun et al. CVPR15], [Ahmed et al. ICCV15], [Sun et al. NIPS15] (C) Dhruv Batra

Increasing Side Information
Your Options Nothing: User-in-the-loop [ECCV12] Additional Information: None Tracking [ECCV12] Additional Information: Time (Approximate) Min Bayes Risk [CVPR14] Additional Information: Loss function Re-ranking [CVPR13] Additional Information: higher-order constraints Holistic Scene Understanding Increasing Side Information (C) Dhruv Batra

User-in-the-loop (C) Dhruv Batra

Increasing Side Information
Your Options Nothing: User-in-the-loop [ECCV12] Additional Information: None Tracking [ECCV12] Additional Information: Time (Approximate) Min Bayes Risk [CVPR14] Additional Information: Loss function Re-ranking [CVPR13] Additional Information: higher-order constraints Holistic Scene Understanding Increasing Side Information (C) Dhruv Batra

Image Credit: [Yang & Ramanan, ICCV ‘11]
Pose Estimation Setup Model: Mixture of Parts Tree [Park & Ramanan, ICCV ‘11] Inference: Dynamic Programming Dataset: PARSE Next, we applied our approach to pose-tracking in videos. We replicated the setup of Park & Ramanan who use a mixture of parts tree model. Exact inference can be performed by dynamic programming. (C) Dhruv Batra Image Credit: [Yang & Ramanan, ICCV ‘11]

Pose Estimation: 10 guesses/frame
(C) Dhruv Batra [Premachandran, Tarlow, Batra, CVPR14]

Image Credit: [Yang & Ramanan, ICCV ‘11]
Pose Tracking Chain CRF with M states at each frame We compute M solutions in each frame of the video, and then choose a smooth trajectory using the Viterbi algorithm. DivMBest Solutions (C) Dhruv Batra Image Credit: [Yang & Ramanan, ICCV ‘11]

[Batra, Yadollahpour, Guzman-Rivera, Shakhnarovich, ECCV12]
Pose Tracking Here, on the left, I am showing you the MAP pose on each frame. We can see that is quite noisy and jumps around, while the DivMBest solution is smooth. MAP / 1-Best DivMBest + Viterbi (C) Dhruv Batra [Batra, Yadollahpour, Guzman-Rivera, Shakhnarovich, ECCV12]

Your Options Nothing: User-in-the-loop [ECCV12] Tracking [ECCV12]
Additional Information: None Tracking [ECCV12] Additional Information: Time (Approximate) Min Bayes Risk [CVPR14] Additional Information: Loss function Re-ranking [CVPR13] Additional Information: higher-order constraints Holistic Scene Understanding (C) Dhruv Batra

Pose Estimation: 10 guesses/frame
(C) Dhruv Batra [Premachandran, Tarlow, Batra, CVPR14]

Fit model (CRF, etc) to mimic
Statistics 101 Loss Hamming, Jaccard Index, … “True” Distribution Expected Loss: Min Bayes Risk Fit model (CRF, etc) to mimic (C) Dhruv Batra

Pose Estimation #Solutions / Frame DivMBest (Oracle) 22%-gain possible
Better DivMBest (Oracle) 22%-gain possible Same Features Same Model ~7% Gain Same Model No new information! MBR [PTB, CVPR14] State of art 2012 [Yang & Ramanan PAMI12] #Solutions / Frame (C) Dhruv Batra [Premachandran, Tarlow, Batra, CVPR14]

Re-ranking Diverse Segmentations
(C) Dhruv Batra [Yadollahpour, Batra, Shakhnarovich, CVPR13]

PASCAL Sentence Dataset
“A dog is standing next to a woman on a couch” Ambiguity: (dog next to woman) on couch vs dog next to (woman on couch) Vision: Semantic Segmentation NLP: Sentence Parsing Labels: Chairs, desks, etc Output: Parse Tree Couch Person Dog Hypothesis #1 Hypothesis #M Consistent Person Couch PASCAL Sentence Dataset (C) Dhruv Batra

Schedule Time Topic Presenter 2:15 – 3:00
Opening Remarks + Need for Multiple Diverse Solutions Dhruv 3:00 – 3:15 Coffee Break 3:15 – 4:45 Generating Diverse Solutions from a Single Model Alex & Bogdan 4:45 – 5:00 5:00 – 5:45 Training Diverse Deep Ensembles Stefan (C) Dhruv Batra

Diversity meets Deep Networks: Inference, Ensembles, and Applications

Similar presentations

Presentation on theme: "Diversity meets Deep Networks: Inference, Ensembles, and Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diversity meets Deep Networks: Inference, Ensembles, and Applications

Similar presentations

Presentation on theme: "Diversity meets Deep Networks: Inference, Ensembles, and Applications"— Presentation transcript:

Similar presentations

About project

Feedback