Download presentation
Presentation is loading. Please wait.
Published byKaj Hansen Modified over 5 years ago
1
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg Presented by: Santhosh Kumar Ramakrishnan and Taylor Kessler Faulkner
2
Referring Expressions vs. Captions
Discriminative expressions/sentences (difference between referring expression and image caption) Caption: “Man holding a wine cup” Referring expression: “Man on right” From [2]
3
Generation 2. Comprehension
Dual Tasks Generation Comprehension Man in the yellow hat Woman on the left ? From [2]
4
Framework Speaker Listener Reinforcer From [1]
Proposed framework consists of 3 parts From [1]
5
Framework - Image Features
(F, x) Concatenate I1 I I I4 (F1, x1) (F2, x2) (F3, x3) (F4, x4) F*
6
Ranking loss on expressions
Framework - Speaker MMI Ranking loss on expressions Takes in the image features and generates referring expressions. There are 3 cost term associated. Generation loss, penalizing the negative log likelihood MMI loss, the current expression should explain the current object better than other objects Ranking loss, the current object should be explained best by the associated expression From [1]
7
Framework - Listener From [1]
The listener takes in a referring expression and a bounding box, and finds their similarity in the embedding space. Two loss terms associated: the current expression should be closer to the current object than other objects the current object should be closer to the current expression than other expression From [1]
8
Framework - Reinforcer
Reinforcer rewards the speaker for generating the correct expression. From [1]
9
Framework Overall loss is the sum of individual losses From [1]
10
Experiment 1 The person pointing left?
Do the assigned probabilities make sense in ambiguous cases? If not, is there a detectable bias? The person pointing left? The person holding a skateboard? The person looking right? Points: center bias (three men, baseball), works well with ambiguous descriptions From [2]
11
Experiment 1 Many ambiguous cases worked well Ex: Person pointing left
0.074 0.071 From [2]
12
Experiment 1 Person holding skateboard 0.088 0.081 From [2]
13
Experiment 1 Kid looking right 0.305 0.304 From [2]
14
Experiment 1 Kid wearing helmet 0.305 0.304 From [2]
15
Experiment 1 However, some suggest a possible center bias
Man holding bat 0.165 0.105 0.115 0.154 From [2]
16
Experiment 1 However, some suggest a possible center bias Right man
0.221 0.118 0.136 0.145 From [2]
17
Experiment 1 However, some suggest a possible center bias
Person wearing glasses 0.105 From [2]
18
Experiment 1 Some words may also provide stronger cues than others
Woman in blue 0.146 From [2]
19
Experiment 1 Some words may also provide stronger cues than others
Baby in blue 0.146 0.137 From [2]
20
Experiment 2 How are the generation results without MMI?
Qualitative sentence results are shown for MMI, but not without We try generating sentences without MMI Wanted to look at results for Generation and Comprehension, qualitative results seemed useful but we noted that some weren’t shown
21
Experiment 2 Sometimes MMI was more descriptive No MMI MMI
blue shirt (logp=-3.56, ppl=3.28) red shirt (logp=-3.61, ppl=3.33) guy in blue shirt (logp=-6.11, ppl=3.39) MMI guy in blue shirt in back (logp=-7.20, ppl=2.80) man in blue shirt in back (logp=-7.57, ppl=2.95) blue shirt in back (logp=-5.42, ppl=2.96) MMI model expressions tend to be more discriminative than the non-MMI models. Here, the expressions are not exactly accurate, but more discriminative . From [2]
22
Experiment 2 No MMI MMI mom (logp=-2.00, ppl=2.72)
woman (logp=-2.21, ppl=3.02) woman on right (logp=-4.53, ppl=3.10) MMI blue shirt (logp=-2.38, ppl=2.21) girl in blue shirt (logp=-5.20, ppl=2.83) green shirt (logp=-3.15, ppl=2.86) Again, more discriminative. From [2]
23
Experiment 2 No MMI MMI second guy from right (logp=-4.89, ppl=2.66)
man in white shirt (logp=-4.95, ppl=2.69) guy in white shirt (logp=-4.96, ppl=2.70) MMI man in white shirt (logp=-4.52, ppl=2.47) guy in white shirt (logp=-4.59, ppl=2.51) man in white (logp=-4.08, ppl=2.77) From [2]
24
References [1] A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg. CVPR 2017 [2] Microsoft COCO: Common Objects in Context, Lin etal. 2014
25
Experiment 1 Sometimes both MMI and regular were not descriptive
No MMI boy on left (logp=-4.40, ppl=3.01) boy in blue shirt (logp=-5.57, ppl=3.05) boy on left in blue shirt (logp=-8.37, ppl=3.31) MMI boy in blue shirt (logp=-4.66, ppl=2.54) boy on left (logp=-4.14, ppl=2.82) boy in striped shirt (logp=-5.35, ppl=2.91) In most of the examples we observed, MMI was either comparable or better than the non-MMI model. However, there were failure cases for both the models. From [2]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.