Presentation is loading. Please wait.

Presentation is loading. Please wait.

CVPR 2019 Poster.

Similar presentations


Presentation on theme: "CVPR 2019 Poster."— Presentation transcript:

1 CVPR 2019 Poster

2 Task Grounding referring expressions is typically formulated as a task that identifies a proposal referring to the expressions from a set of proposals in an image . Summarize visual features of single objects global visual contexts CNN pairwise visual differences object pair context global language contexts LSTM language features of the decomposed phrases

3 Problem existing work on global language context modeling and global visual context modeling introduces noisy information and makes it hard to match these two types of contexts pairwise visual differences computed in existing work can only represent instance-level visual differences among objects of the same category. existing work on context modeling for object pairs only considers first-order relationships but not multi-order relationships. multi-order relationships are actually structured information, and the context encoders adopted by existing work on grounding referring expressions are simply incapable of modeling them.

4 Pipeline

5 Spatial Relation Graph

6 Language Context word type word refer to vertex
vertex language context

7 Language-Guided Visual Relation Graph
这种attend方式都是隐式的 vertex edge

8 Language-Vision Feature
Semantic Context Modeling Loss Function 最符合的一个会获得最全局的expression信息

9

10

11 ICCV 2019

12 Problem almost all the existing approaches for referring expression comprehension do not introduce reasoning or only support single-step reasoning the models trained with those approaches have poor interpretability

13 Pipeline

14 Language-Guided Visual Reasoning Process
q: which is the concatenation of the last hidden states of both the forward and backward LSTMs

15 Static Attention

16 Dynamic Graph Attention

17

18

19 CVPR 2019 Oral

20 Motivation when we feed an unseen image scene into the framework, we usually get a simple and trivial caption about the salient objects such as “there is a dog on the floor”, which is no better than just a list of object detection once we abstract the scene into symbols, the generation will be almost disentangled from the visual perception

21 Inductive Bias everyday practice makes us performs better than machines in high-level reasoning template / rule-based caption models, is well-known ineffective compared to the encoder-decoder ones, due to the large gap between visual perception and language composition Scene graph --> bridge the gap between two worlds we can embed the graph structure into vector representations; the vector representations are expected to transfer the inductive bias from the pure language domain to the vision-language domain

22 Encoder-Decoder Revisited

23 Auto-Encoding Scene Graphs
Dictionary

24 Overall Model: SGAE-based Encoder-Decoder
object detector + relation detector + attribute classifier multi-modal graph convolution network pre-train D  cross-entropy loss  RL-based loss two decoders:

25

26

27 ICCV 2019

28 Motivation unlike a visual concept in ImageNet which has 650 training images on average, a specific sentence in MS-COCO has only one single image, which is extremely scarce in the conventional view of supervised training given a sentence pattern in Figure 1b, your descriptions for the three images in Figure 1a should be much more constrained studies in cognitive science show that do us humans not speak an entire sentence word by word from scratch; instead, we compose a pattern first, then fill in the pattern with concepts, and we repeat this process until the whole sentence is finished

29 Tackling the dataset bias

30 Relation Module Object Module Function Module Attribute Module

31 Controller Multi-step Reasoning: repeat the soft fusion and language decoding M times. Linguistic Loss:

32

33


Download ppt "CVPR 2019 Poster."

Similar presentations


Ads by Google