Download presentation
1
Diverse Beam Search Ashwin Kalyan
Explain the objective we minimize in sequence modeling Then, tell the 3 problems with current RNN based modeling 1) Loss – evaluation mismatch (cite paper from Sasha Rush’s lab) 2) Train – test mismatch: The model is not exposed to its own predictions (cite nips 16 curriculum paper that tries to solve this. Maybe not necessary to relate to DAGGER, etc. ) Broken inference: It’s NP-hard to do exact inference (show V^T complexity) and so, approximate inference methods like beam search are used. (explain beam search right here) Take-away: modeling is not precise and on top of that inference is approximate. So, can’t say the “most-likely” caption under the model is of high quality. Further broken inference results in sequences with ”minor” changes. Give example of broken beam search In this work – we don’t fix the modeling and instead fix the inference procedure to have more diversity. The hope is to decode lists of such sentences that differ from each other significantly. ”diversity” Ashwin Kalyan
2
Sequence Modeling – RNN Recap
Task: Model the sequence RNN RNN
3
Sequence Modeling – RNN Recap
Task: Model the sequence Effectively, RNNs model the probability of the next token given the history i.e. RNN And the joint probability is
4
Sequence Modeling When the output is a sequence, we optimize for (on the training set) In captioning, image feature can be added as the first “word” boy good , a, is , This,
5
Inference in Seq2Seq models
There are |V| choices for each word and so, the search space has |𝑉| 𝑇 sequences. So, inferring the “most-likely” sequence under the model is NP-hard. A tractable alternative is to use greedy approximate methods that decode sequences word by word – in a left to right manner.
6
Inference in Seq2Seq models
Now that we have a “trained” model, how do we generate sequences ? Method 1: Sampling Does not necessarily generate “good quality sequences” Initial words are crucial in deciding the sentence – sampling a sub-optimal continuation can hurt
7
Inference in Seq2Seq models
Method 2: Beam Search Instead of sampling, select top-B words at each time step This A
8
Inference in Seq2Seq models
Method 2: Beam Search Instead of sampling, select top-B words at each time step is This picture man A person
9
Inference in Seq2Seq models
Method 2: Beam Search Instead of sampling, select top-B words at each time step #sequences grows as 𝐵 𝑇 ! Solution: Retain top-B sequences In other words, Beam Search = truncated BFS is This picture man A person
10
Inference in Seq2Seq models
Method 2: Beam Search Instead of sampling, select top-B words at each time step is Top-2 sequences are selected This picture man A person
11
Inference in Seq2Seq models
Method 2: Beam Search Instead of sampling, select top-B words at each time step is This picture man A Continuations are discarded person
12
Inference in Seq2Seq models
Method 2: Beam Search Instead of sampling, select top-B words at each time step a is the This shows picture Till end token is generated or max time is
13
Problems with Decoding
Beam Search tends to produce nearly identical sequences that typically differ in the endings. Beam Search outputs (B=4) A kitchen with a stove. A kitchen with a stove and a sink. A kitchen with a stove and a microwave A kitchen with a stove and a refrigerator
14
Problems with Decoding
Beam Search tends to produce nearly identical sequences that typically differ in the endings. Beam Search outputs (B=4) A woman and a child sitting at a table with food. A woman and a child sitting at a table with a plate of food. A woman and a child sitting at a table eating food. A little girl sitting at a table with a plate of food.
15
Problems with Decoding
Beam Search tends to produce nearly identical sequences that typically differ in the endings. This is not accurate when there are multiple “correct” sequences For example, the same image can be explained in many ways – you can talk about different objects in the image, different perspectives, etc. Multiple correct translations! Also, computationally wasteful – since you need to forward the same set of inputs into the RNN repeatedly!
16
Inference in Seq2Seq models
Method 3: Diverse Beam Search Select top-B words that result in “different” sequences log-likelihood diversity term This Person
17
Inference in Seq2Seq models
Method 3: Diverse Beam Search Select top-B words that result in “different” sequences This is Person in
18
Inference in Seq2Seq models
Method 3: Diverse Beam Search Select top-B words that result in “different” sequences This is a Cannot select ‘a’ if ∆ say hamming distance Person in a
19
Inference in Seq2Seq models
Method 3: Diverse Beam Search Select top-B words that result in “different” sequences This is a Cannot select ‘a’ if ∆ say hamming distance Person in striped
20
Diverse Beam Search outputs (B=4)
Modifies the inference procedure to produce “diverse” lists. Diverse Beam Search outputs (B=4) A kitchen with a stove and a microwave. A kitchen with a stove and a sink. A kitchen with a sink and a refrigerator. The kitchen is clean and ready to be used.
21
Diverse Beam Search outputs (B=4)
Modifies the inference procedure to produce “diverse” lists. Diverse Beam Search outputs (B=4) A woman and a child are eating a meal. A woman and a child sitting at a table with a plate of food. A young girl is eating a piece of cake. Two girls are sitting at a table with a cake.
22
Diverse Beam Search Modifies the inference procedure to produce “diverse” lists. Tries to capture the ”inherent multi-modal” nature of the task Quantitatively finds better (more human-like) captions Diversity comes for free (almost!) Requires about the same memory and computations
23
General Problems in Sequence Modeling
Loss – Evaluation Mismatch: We care about producing “human-like” captions but optimize for a surrogate loss i.e. log-likelihood of ground truth caption given image. Train – Test Mismatch: At train time, the model is not exposed to its own outputs. But at test time, the input is sampled (or selected) from its own previous predictions. Broken Inference: Approximate (left-right greedy) inference methods. Training is not “aware” of the inference
24
Work with Michael Cogswell Ramprasath Selvaraju Qing Sun David Crandall Stefan Lee Dhruv Batra paper: code:
25
Questions ?
26
Interesting Related Papers Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning [paper] Sequence-to-Sequence Learning as Beam-Search Optimization [paper] Learning to Decode for Future Success [paper] Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training [paper] Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks [paper]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.