A Joint Speaker-Listener-Reinforcer Model for Referring Expressions Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg Presented by: Santhosh Kumar Ramakrishnan and Taylor Kessler Faulkner
Referring Expressions vs. Captions Discriminative expressions/sentences (difference between referring expression and image caption) Caption: “Man holding a wine cup” Referring expression: “Man on right” From [2]
Generation 2. Comprehension Dual Tasks Generation 2. Comprehension Man in the yellow hat Woman on the left ? From [2]
Framework Speaker Listener Reinforcer From [1] Proposed framework consists of 3 parts From [1]
Framework - Image Features (F, x) Concatenate I1 I2 I3 I4 (F1, x1) (F2, x2) (F3, x3) (F4, x4) F*
Ranking loss on expressions Framework - Speaker MMI Ranking loss on expressions Takes in the image features and generates referring expressions. There are 3 cost term associated. Generation loss, penalizing the negative log likelihood MMI loss, the current expression should explain the current object better than other objects Ranking loss, the current object should be explained best by the associated expression From [1]
Framework - Listener From [1] The listener takes in a referring expression and a bounding box, and finds their similarity in the embedding space. Two loss terms associated: the current expression should be closer to the current object than other objects the current object should be closer to the current expression than other expression From [1]
Framework - Reinforcer Reinforcer rewards the speaker for generating the correct expression. From [1]
Framework Overall loss is the sum of individual losses From [1]
Experiment 1 The person pointing left? Do the assigned probabilities make sense in ambiguous cases? If not, is there a detectable bias? The person pointing left? The person holding a skateboard? The person looking right? Points: center bias (three men, baseball), works well with ambiguous descriptions From [2]
Experiment 1 Many ambiguous cases worked well Ex: Person pointing left 0.074 0.071 From [2]
Experiment 1 Person holding skateboard 0.088 0.081 From [2]
Experiment 1 Kid looking right 0.305 0.304 From [2]
Experiment 1 Kid wearing helmet 0.305 0.304 From [2]
Experiment 1 However, some suggest a possible center bias Man holding bat 0.165 0.105 0.115 0.154 From [2]
Experiment 1 However, some suggest a possible center bias Right man 0.221 0.118 0.136 0.145 From [2]
Experiment 1 However, some suggest a possible center bias Person wearing glasses 0.105 From [2]
Experiment 1 Some words may also provide stronger cues than others Woman in blue 0.146 From [2]
Experiment 1 Some words may also provide stronger cues than others Baby in blue 0.146 0.137 From [2]
Experiment 2 How are the generation results without MMI? Qualitative sentence results are shown for MMI, but not without We try generating sentences without MMI Wanted to look at results for Generation and Comprehension, qualitative results seemed useful but we noted that some weren’t shown
Experiment 2 Sometimes MMI was more descriptive No MMI MMI blue shirt (logp=-3.56, ppl=3.28) red shirt (logp=-3.61, ppl=3.33) guy in blue shirt (logp=-6.11, ppl=3.39) MMI guy in blue shirt in back (logp=-7.20, ppl=2.80) man in blue shirt in back (logp=-7.57, ppl=2.95) blue shirt in back (logp=-5.42, ppl=2.96) MMI model expressions tend to be more discriminative than the non-MMI models. Here, the expressions are not exactly accurate, but more discriminative . From [2]
Experiment 2 No MMI MMI mom (logp=-2.00, ppl=2.72) woman (logp=-2.21, ppl=3.02) woman on right (logp=-4.53, ppl=3.10) MMI blue shirt (logp=-2.38, ppl=2.21) girl in blue shirt (logp=-5.20, ppl=2.83) green shirt (logp=-3.15, ppl=2.86) Again, more discriminative. From [2]
Experiment 2 No MMI MMI second guy from right (logp=-4.89, ppl=2.66) man in white shirt (logp=-4.95, ppl=2.69) guy in white shirt (logp=-4.96, ppl=2.70) MMI man in white shirt (logp=-4.52, ppl=2.47) guy in white shirt (logp=-4.59, ppl=2.51) man in white (logp=-4.08, ppl=2.77) From [2]
References [1] A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg. CVPR 2017 [2] Microsoft COCO: Common Objects in Context, Lin etal. 2014
Experiment 1 Sometimes both MMI and regular were not descriptive No MMI boy on left (logp=-4.40, ppl=3.01) boy in blue shirt (logp=-5.57, ppl=3.05) boy on left in blue shirt (logp=-8.37, ppl=3.31) MMI boy in blue shirt (logp=-4.66, ppl=2.54) boy on left (logp=-4.14, ppl=2.82) boy in striped shirt (logp=-5.35, ppl=2.91) In most of the examples we observed, MMI was either comparable or better than the non-MMI model. However, there were failure cases for both the models. From [2]