A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Slides:

Advertisements

Similar presentations

What’s wrong with… “colourful people” googled ( )

Advertisements

My Penguin Math Book By:. I see How many penguins do you see? Count them & type the number in the box penguins.

LEVEL 1 Lesson 22 Topic: Office Meeting and Tall Man Length:15 Minutes.

Classification using intersection kernel SVMs is efficient Joint work with Subhransu Maji and Alex Berg Jitendra Malik UC Berkeley.

Cognitive modelling (Cognitive Science MSc.) Fintan Costello

You write the caption Thunderstorms - Chap Instructions §Write a two-sentence caption for each of the following. The first sentence has been started.

Studying Relationships Between Human Gaze, Description, and Computer Vision Kiwon Yun 1, Yifan Peng 1 Dimitris Samaras 1, Gregory J. Zelinsky 1,2, Tamara.

Observations vs. Inferences “Accuracy of observation is the equivalent of accuracy of thinking.” ~ Wallace Stevens.

Kids S1 Vocabulary U1 Colors. Listen and say the color:

Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.

Qualitative Observation: descriptions that cannot be expressed in numbers Ex. The bike is blue Quantitative Observation: the actual numbers or amounts.

Choose the right picture. Choose the right word.

LEVEL 1 Lesson #1. Lesson #1: School & Bus Questions Here are the questions we will be asking in this conversation: Yes/No Colors Numbers.

My hat is yellow. My shirt is white. My shirt is red. My bag is green. My hat is blue. My pencil is black. My umbrella is green. My pencil is yellow.

Objectives: 1.Find the probability of a Simple Event 2. Determine the outcome of an experiment 3. Predict which events are likely or unlikely.

Presentation Title Subtitle or Speaker, Role in Project, Institution Place Institution Logo Here.

GENDER BIAS AND FAIRNESS Lisa Dunn and Christina Fernandez February 28, 2009.

Unit 5: Nothing is what it seems 1. What are stereotypes?

Mendel’s Principles of Inheritance

Please copy your homework into your assignment book

March 6, 2017 Entry task: What is your favorite color? What is your favorite color to wear? Are they different or the same? Why? Target: Observe how.

6 Thinking Hats Edward De Bono.

Qualitative Observation: descriptions that cannot be expressed in numbers Ex. The bike is blue Quantitative Observation: the actual numbers or.

Clue in on Color Fashion Strategies

Statistical Data Analysis - Lecture /04/03

Object Detection based on Segment Masks

RATIOS BY: JENNIFER BODELL.

Unit 3 Clothes.

Lecture 9 Theory of AUTOMATA

Thinking Hats There are 6 Thinking Hats and they are used to help us focus and guide our thinking. INFORMATION HAT The white hat is used for information.

Periodic Trends Chemistry 5(C)

Describing people and their clothing

How to Use: Go to dafonts.com and download the font KG Blank Space Solid to have the same font. Select 4 areas of your room to use for centers. Take a.

Clue in on Color Fashion Strategies

Enhanced-alignment Measure for Binary Foreground Map Evaluation

Lesson 1 Target I will understand that a ratio is an ordered pair of non-negative numbers, which are not both zero. I will use the precise language and.

Observations and Inferences

Research-Based Answers to Frequently Asked Questions About: Remarriage

6.5 Solving Open Sentences involving Absolute Value

mEEC: A Novel Error Estimation Code with Multi-Dimensional Feature

Experiment Basics: Designs

On-going research on Object Detection *Some modification after seminar

Agenda 1).go over lesson 6 2). Review 3).exit ticket.

CornerNet: Detecting Objects as Paired Keypoints

Hierarchical Models.

What’s another way to write 4 5 using addition?

Blue Hat You are in charge of making sure your group sticks to the ‘hat’ they should be wearing. Example: you must make sure people are positive if they.

Compound Probability.

Objective: Students will learn how to add integers.

This means in front of. The boy Stands before the class

The Scientific Method.

Learning Object Context for Dense Captioning

I have one bug..

Mathematical Foundations of BME

Association between 2 variables

Chapter 1 Introducing Small Basic

Station #1 Tell whether the events are independent or dependent. Explain why.

Association between 2 variables

Experiment Basics: Designs

Is all about sharing and being fair

The Present Continuous Memory Recall

Stereotypes in the Media

Visual Grounding 专题报告 Lejian Ren 4.23.

Week 6 Presentation Ngoc Ta Aidean Sharghi.

Presentation transcript:

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg Presented by: Santhosh Kumar Ramakrishnan and Taylor Kessler Faulkner

Referring Expressions vs. Captions Discriminative expressions/sentences (difference between referring expression and image caption) Caption: “Man holding a wine cup” Referring expression: “Man on right” From [2]

Generation 2. Comprehension Dual Tasks Generation 2. Comprehension Man in the yellow hat Woman on the left ? From [2]

Framework Speaker Listener Reinforcer From [1] Proposed framework consists of 3 parts From [1]

Framework - Image Features (F, x) Concatenate I1 I2 I3 I4 (F1, x1) (F2, x2) (F3, x3) (F4, x4) F*

Ranking loss on expressions Framework - Speaker MMI Ranking loss on expressions Takes in the image features and generates referring expressions. There are 3 cost term associated. Generation loss, penalizing the negative log likelihood MMI loss, the current expression should explain the current object better than other objects Ranking loss, the current object should be explained best by the associated expression From [1]

Framework - Listener From [1] The listener takes in a referring expression and a bounding box, and finds their similarity in the embedding space. Two loss terms associated: the current expression should be closer to the current object than other objects the current object should be closer to the current expression than other expression From [1]

Framework - Reinforcer Reinforcer rewards the speaker for generating the correct expression. From [1]

Framework Overall loss is the sum of individual losses From [1]

Experiment 1 The person pointing left? Do the assigned probabilities make sense in ambiguous cases? If not, is there a detectable bias? The person pointing left? The person holding a skateboard? The person looking right? Points: center bias (three men, baseball), works well with ambiguous descriptions From [2]

Experiment 1 Many ambiguous cases worked well Ex: Person pointing left 0.074 0.071 From [2]

Experiment 1 Person holding skateboard 0.088 0.081 From [2]

Experiment 1 Kid looking right 0.305 0.304 From [2]

Experiment 1 Kid wearing helmet 0.305 0.304 From [2]

Experiment 1 However, some suggest a possible center bias Man holding bat 0.165 0.105 0.115 0.154 From [2]

Experiment 1 However, some suggest a possible center bias Right man 0.221 0.118 0.136 0.145 From [2]

Experiment 1 However, some suggest a possible center bias Person wearing glasses 0.105 From [2]

Experiment 1 Some words may also provide stronger cues than others Woman in blue 0.146 From [2]

Experiment 1 Some words may also provide stronger cues than others Baby in blue 0.146 0.137 From [2]

Experiment 2 How are the generation results without MMI? Qualitative sentence results are shown for MMI, but not without We try generating sentences without MMI Wanted to look at results for Generation and Comprehension, qualitative results seemed useful but we noted that some weren’t shown

Experiment 2 Sometimes MMI was more descriptive No MMI MMI blue shirt (logp=-3.56, ppl=3.28) red shirt (logp=-3.61, ppl=3.33) guy in blue shirt (logp=-6.11, ppl=3.39) MMI guy in blue shirt in back (logp=-7.20, ppl=2.80) man in blue shirt in back (logp=-7.57, ppl=2.95) blue shirt in back (logp=-5.42, ppl=2.96) MMI model expressions tend to be more discriminative than the non-MMI models. Here, the expressions are not exactly accurate, but more discriminative . From [2]

Experiment 2 No MMI MMI mom (logp=-2.00, ppl=2.72) woman (logp=-2.21, ppl=3.02) woman on right (logp=-4.53, ppl=3.10) MMI blue shirt (logp=-2.38, ppl=2.21) girl in blue shirt (logp=-5.20, ppl=2.83) green shirt (logp=-3.15, ppl=2.86) Again, more discriminative. From [2]

Experiment 2 No MMI MMI second guy from right (logp=-4.89, ppl=2.66) man in white shirt (logp=-4.95, ppl=2.69) guy in white shirt (logp=-4.96, ppl=2.70) MMI man in white shirt (logp=-4.52, ppl=2.47) guy in white shirt (logp=-4.59, ppl=2.51) man in white (logp=-4.08, ppl=2.77) From [2]

References [1] A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg. CVPR 2017 [2] Microsoft COCO: Common Objects in Context, Lin etal. 2014

Experiment 1 Sometimes both MMI and regular were not descriptive No MMI boy on left (logp=-4.40, ppl=3.01) boy in blue shirt (logp=-5.57, ppl=3.05) boy on left in blue shirt (logp=-8.37, ppl=3.31) MMI boy in blue shirt (logp=-4.66, ppl=2.54) boy on left (logp=-4.14, ppl=2.82) boy in striped shirt (logp=-5.35, ppl=2.91) In most of the examples we observed, MMI was either comparable or better than the non-MMI model. However, there were failure cases for both the models. From [2]