Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vision and language: attention, navigation, and making it work ‘in the wild’ Peter Anderson Australian National University -> Macquarie University -> Georgia.

Similar presentations


Presentation on theme: "Vision and language: attention, navigation, and making it work ‘in the wild’ Peter Anderson Australian National University -> Macquarie University -> Georgia."— Presentation transcript:

1 Vision and language: attention, navigation, and making it work ‘in the wild’
Peter Anderson Australian National University -> Macquarie University -> Georgia Tech CVPR VQA Workshop June 2018

2 Overview 1. Bottom-up and top-down attention
2. Vision-and-language navigation 3. Image captioning ‘in the wild’

3 Visual attention Vision and language tasks often require fine- grained visual processing, e.g. VQA: Q: What is the man holding?

4 Visual attention Vision and language tasks often require fine- grained visual processing, e.g. VQA: Q: What is the man holding? A: phone

5 Example: Stacked attention networks1
1Yang et al. CVPR 2016

6 Example: Stacked attention networks1
set of attention candidates, 𝑉 1Yang et al. CVPR 2016

7 Example: Stacked attention networks1
set of attention candidates, 𝑉 task context representation 1Yang et al. CVPR 2016

8 Example: Stacked attention networks1
set of attention candidates, 𝑉 task context representation learned attention function 1Yang et al. CVPR 2016

9 Attention candidates, 𝑉
Standard approach: use the spatial output of a CNN to extract vectors for each position in a grid 𝒗 2 𝒗 1 CNN feature map 2048 10 𝑉={ 𝒗 1 , …, 𝒗 100 }

10 Attention candidates, 𝑉
Standard approach: spatial output of a CNN 10 Our approach: object-based attention 𝑘 regions

11 Attention candidates, 𝑉
Our approach: bottom-up attention (using Faster R-CNN2) Each salient object / image region is detected and represented by it’s mean-pooled feature vector 𝑉={ 𝒗 1 , …, 𝒗 𝑘 } CNN RPN 𝒗 1 𝒗 2 𝒗 3 ROI POOL feature map feat. CNN feat. CNN feat. CNN 2Ren et al. NIPS, 2015

12 Faster R-CNN pre-training
Using Visual Genome3 with: 1600 filtered object classes 400 filtered attribute classes bench worn wooden grey weathered 3Krishna et al. arXiv , 2016 Example training data:

13 Bottom-up and top-down attention
Bottom-up process: Extract all objects and other salient regions from the image (independent of the question / partially-completed caption) 𝑉 Top-down process: Given task context, weight the attention candidates (i.e., use existing VQA / captioning models)

14 Benefits of ‘Up-Down’ attention
Unifies vision & language tasks with object detection models Transfer learning by pre-training on object detection datasets Complementary to other models (just swap attention candidates) Can be fine-tuned More interpretable attention weights

15 ResNet (10×10):  A man sitting on a toilet in a bathroom.
Up-Down (Ours):  A man sitting on a couch in a bathroom.

16 VQA examples Q: What room are they in? A: kitchen

17 VQA examples - counting
Q: How many oranges are on pedestals? A: two

18 VQA examples - reading Q: What is the name of the realty company?
A: none

19 COCO Captions results 1st COCO Captions leaderboard (July 2017)
COCO Captions “Karpathy” test set (single-model): BLEU-4 METEOR CIDEr SPICE ResNet (10×10) 34.0 26.5 111.1 20.2 Up-Down (Ours) 36.3 27.7 120.1 21.4 +6%

20 VQA results 1st 2017 VQA Challenge (June 2017)
Up-Down approach now incorporated into many other models (including the top three 2018 Challenge entries) VQA v2 val set (single-model): Yes/No Number Other Overall ResNet (1×1) 76.0 36.5 46.8 56.3 ResNet (14×14) 76.6 36.2 49.5 57.9 ResNet (7×7) 77.6 37.7 51.5 59.4 Up-Down (Ours) 80.3 42.8 55.8 63.2 +6%

21 Overview 1. Bottom-up and top-down attention
2. Vision-and-language navigation 3. Image captioning ‘in the wild’

22 Connecting vision & language to actions
Embodied agents need environments to explore Enabling trend: recent availability of 3D reconstructions at large scale = Flickr for embodied agents 700K+ envs

23 Matterport3D Simulator
Simulator for embodied visual agents, based on Matterport3D dataset (Chang et al. 2017) Contains 10,800 panoramic images / 90 buildings High visual diversity

24 Matterport3D Simulator
Feasible trajectories determined by navigation graph

25 Room-to-Room (R2R) Navigation dataset
Vision-and-Language Navigation Task: Given a natural language navigation instruction, navigate through the environment to find the goal location Dataset Statistics: 21,567 navigation instructions collected using AMT. More data on the way! Clear Evaluation Protocol: Single test run Agent must choose to stop Success measured if stopped within 3m of the goal location Project url:

26

27 Project url: https://bringmeaspoon.org/
Challenging Aspects Test set comprised of previously unseen buildings. Requires open-set recognition of new visual concepts and new vocabulary. Squiggle painting mannequins hieroglyphs Project url:

28 Success rate (< 3m navigation error)
Project url:

29 VLN Challenge Test evaluation server up and running on EvalAI.

30 Overview 1. Bottom-up and top-down attention
2. Vision-and-language navigation 3. Image captioning ‘in the wild’

31 Never forget the family tree
91 objects 328K total images Google Referring Expressions

32 State-of-the-art in image captioning5
A brown sheep standing in a field of grass. A group of people are playing a video game. 5Anderson et al. CVPR 2018

33 State-of-the-art in image captioning5
Two hot dogs on a tray with a drink. Two elephants and a baby elephant walking together. 5Anderson et al. CVPR 2018

34 Same model on non-COCO images
A zebra is laying down in the grass.

35 Vision-and-language ‘in the wild’
We must go beyond COCO in order to have impact in the wider world How can we scale-up vision and language models from thousands of visual concepts to hundreds of thousands / millions?

36 Scaling up Dataset Images Object Classes Format COCO Captions 120K 91
Open Images 2M 600 Flickr 100M 100M ? A very pretty zebra crossing a paved road. Can we use these to improve captioning models? tiger tiger, zoo, pool, three

37 Scaling up Dataset Images Object Classes Format COCO Captions 120K 91
Open Images 2-9M 600 Flickr 100M 100M ? A very pretty zebra crossing a paved road. Can we use these to improve captioning models? tiger tiger, zoo, pool, three

38 Approach Treat image labels as partial captions
interpretation A partial caption that mentions tiger!

39 How to generate full training targets?
Before: A very pretty ... road <EOS> CNN RNN RNN RNN RNN RNN

40 How to generate full training targets?
Before: A very pretty ... road <EOS> CNN RNN RNN RNN RNN RNN Now: tiger ? ? ? ? ... ? CNN RNN RNN RNN RNN ?

41 How to generate full training targets?
Before: A very pretty ... road <EOS> CNN RNN RNN RNN RNN RNN Now: ? meat ? ? tiger ? ... ? CNN RNN RNN RNN RNN ?

42 How to generate full training targets?
Before: A very pretty ... road <EOS> CNN RNN RNN RNN RNN RNN Now: ? tiger(s) ? ? meat ? ... ? CNN RNN RNN RNN RNN ?

43 Key idea Encode our limited knowledge about the true caption in a finite state machine (FSM): tiger, tigers s0 s1 meat, flesh meat, flesh tiger, tigers s2 s3

44 How to generate full training targets?
Captioning Model Beam Search Decoding A cat is sitting in the grass near a fence.

45 How to generate full training targets?
Captioning Model Beam Search Decoding A cat is sitting in the grass near a fence.

46 How to generate full training targets?
Constrained beam search6 finds high probability sequences that satisfy an FSM Captioning Model Beam Search Decoding A cat is sitting in the grass near a fence. Constrained Beam Search Decoding A tiger is sitting with some meat in the grass. 6Anderson et al. EMNLP 2017

47 Partially-supervised training using constrained decoding7
First, represent incomplete sequences (e.g., image labels) as FSMs. Then, iterate: A tiger is sitting with some meat in the grass. Step 1: Estimate the complete sequences using constrained beam search. Step 2: Update the model parameters using the completed data. 7Anderson et al. arXiv (under submission)

48 Example – COCO novel object captioning
Excluded training captions mentioning racket(s) or racquet(s) Model learns about tennis rackets from image-level tags (and pre- trained word embeddings)

49 Examples – Open Images COCO Baseline: A zebra is laying down in the grass. Ours8: A tiger that is sitting in the grass. COCO Baseline: A man taking a picture of an old car. Ours8: A man sitting in a car looking at an elephant. 8After learning some new visual concepts from Open Images v4.

50 Acknowledgments Many collaborators involved in this work:
Stephen Gould (PhD advisor), Australian National University Basura Fernando, Australian National University Mark Johnson, Macquarie University Anton van den Hengel, University of Adelaide Ian Reid, University of Adelaide Qi Wu, University of Adelaide Damien Teney, University of Adelaide Jake Bruce , Queensland University of Technology Niko Sünderhauf, Queensland University of Technology Xiaodong He, Microsoft Research Lei Zhang, Microsoft Research Collection of the R2R dataset for Vision-and-Language Navigation was generously supported by a Facebook ParlAI Research Award.

51 Thank you Bottom-up and top-down attention
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR 2018 (Oral Wed 2:50-4:30) Tips and Tricks for Visual Question Answering: Learnings From the Challenge, CVPR 2018 (Poster Wed 10:10-12:30) Vision-and-language navigation Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments, CVPR 2018 (Spotlight Wed 8:30-10:10) Image captioning ‘in the wild’ Guided Open Vocabulary Image Captioning with Constrained Beam Search, EMNLP 2017 Partially-Supervised Image Captioning, arXiv , 2018 Code, data and models:


Download ppt "Vision and language: attention, navigation, and making it work ‘in the wild’ Peter Anderson Australian National University -> Macquarie University -> Georgia."

Similar presentations


Ads by Google