1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

Slides:

Advertisements

Similar presentations

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

Advertisements

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Super-Resolution of Remotely-Sensed Images Using a Learning-Based Approach Isabelle Bégin and Frank P. Ferrie Abstract Super-resolution addresses the problem.

Information Retrieval in Practice

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Machine Learning CS 165B Spring 2012

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Presented by Tienwei Tsai July, 2005

A Language Independent Method for Question Classification COLING 2004.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Splitting Complex Temporal Questions for Question Answering systems ACL 2004.

C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.

Deep Visual Analogy-Making

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Bassem Makni SML 16 Click to add text 1 Deep Learning of RDF rules Semantic Machine Learning.

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Rationalizing Neural Predictions

Ensembling Diverse Approaches to Question Answering

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

What Convnets Make for Image Captioning?

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Bag-of-Visual-Words Based Feature Extraction

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Goodfellow: Chap 1 Introduction

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

Fast Preprocessing for Robust Face Sketch Synthesis

mengye ren, ryan kiros, richard s. zemel

Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.

Efficient Estimation of Word Representation in Vector Space

Above and below the object level

Image Question Answering

Goodfellow: Chap 1 Introduction

Common Classification Tasks

Distributed Representation of Words, Sentences and Paragraphs

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Object Recognition Today we will move on to… April 12, 2018

CornerNet: Detecting Objects as Paired Keypoints

Objects as Attributes for Scene Classification

Learning a Policy for Opportunistic Active Learning

Outline Background Motivation Proposed Model Experimental Results

Deep Cross-media Knowledge Transfer

Example: Academic Search

Lip movement Synthesis from Text

Using Natural Language Processing to Aid Computer Vision

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

Word embeddings (continued)

Heterogeneous convolutional neural networks for visual recognition

Neural Modular Networks

Attention for translation

Jointly Generating Captions to Aid Visual Question Answering

Human-object interaction

Visual Question Answering

Presented by: Anurag Paul

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Deep Structured Scene Parsing by Learning with Image Descriptions

CVPR 2019 Tutorial on Map Synchronization

CRCV REU 2019 Week 8 Aaron Honculada.

THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU

Random Neural Network Texture Model

Visual Grounding.

CVPR 2019 Poster.

Presentation transcript:

1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind The Neuro-Symbolic Concept Learner Interpreting Scenes, Words, and Sentences From Natural Supervision http://nscl.csail.mit.edu Good morning everyone. SLOW SLOW SLOW SLOW Jiayuan Mao1,2 Chuang Gan3 Pushmeet Kohli4 Josh Tenenbaum1 Jiajun Wu1 1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

Concepts in Visual Reasoning We study the problem of visual concept learning and visual reasoning. Given this input image, human can instantaneously recognize objects in the scene. We also associate various visual concepts with the object appearance. This includes colors, shapes, types of materials, and so on. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning We study the problem of visual concept learning and visual reasoning. Given this input image, human can instantaneously recognize objects in the scene. We also associate various visual concepts with the object appearance. This includes colors, shapes, types of materials, and so on. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Color Red Shape Sphere Material Rubber …… We study the problem of visual concept learning and visual reasoning. Given this input image, human can instantaneously recognize objects in the scene. We also associate various visual concepts with the object appearance. This includes colors, shapes, types of materials, and so on. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? Color Red Shape Sphere Material Rubber …… The recognized visual concepts support our reasoning over the image. As an example, to answer the question What’s the shape of the red object, we could associate the word red with the color property of the second object. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? Color Red Shape Sphere Material Rubber …… The recognized visual concepts support our reasoning over the image. As an example, to answer the question What’s the shape of the red object, we could associate the word red with the color property of the second object, and answer the question by querying the shape of the second object. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? Color Red Shape Sphere Material Rubber …… The recognized visual concepts support our reasoning over the image. As an example, to answer the question What’s the shape of the red object, we could associate the word red with the color property of the second object, and answer the question by querying the shape of the second object. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Color Red Shape Sphere Material Rubber …… The recognized visual concepts support our reasoning over the image. As an example, to answer the question What’s the shape of the red object, we could associate the word red with the color property of the second object, and answer the question by querying the shape of the second object, which gives the answer sphere. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning Color Red Shape Sphere Material Rubber …… Beyond just answering questions, visual concepts support visual reasoning of various forms. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning There is a green cube behind a red sphere. Color Red Shape Sphere Material Rubber …… Beyond just answering questions, visual concepts support visual reasoning of various forms. For example, you could write captions to describe the image you are looking at. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning There is a green cube behind a red sphere. Color Red Shape Sphere Material Rubber …… Beyond just answering questions, visual concepts support visual reasoning of various forms. For example, you could write captions to describe the image you are looking at. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning There is a green cube behind a red sphere. Color Red Shape Sphere Material Rubber …… Instance Retrieval: rubber sphere. We can also retrieve instances from the dataset that are associated with specific concepts of interest. In this case, the rubber spheres. CLEVR [Johnson et al., 2017]

Concepts in Visual Reasoning Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning Color Red Shape Sphere Material Rubber …… Instance Retrieval We first review the prior arts for visual reasoning in the visual question answering setting. CLEVR [Johnson et al., 2017]

End-to-End Visual Reasoning Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor of these models is to use an end-to-end neural architecture to predict the answer to the question. More intuitive: Magic blackbox NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

End-to-End Visual Reasoning Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network Concept (e.g., colors, shapes) Reasoning (e.g., count) A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor is to use an end-to-end neural architecture, including a image encoder, a question encoder, and a reasoning module, which predicts the answer to the question. There are two things to be learned in this loop. First, the concept, such as the meaning of red. Second, the ability to reason, such as the functionality of counting objects. NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

End-to-End Visual Reasoning Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network Entangled Concept (e.g., colors, shapes) Reasoning (e.g., count) A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor is to use an end-to-end neural architecture, including a image encoder, a question encoder, and a reasoning module, which predicts the answer to the question. There are two things to be learned in this loop. First, the concept, such as the meaning of red. Second, the ability to reason, such as the functionality of counting objects. In such end-to-end design, concept learning and reasoning are entangled with each other. NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

End-to-End Visual Reasoning Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network Entangled Concept (e.g., colors, shapes) Reasoning (e.g., count) A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor is to use an end-to-end neural architecture, including a image encoder, a question encoder, and a reasoning module, which predicts the answer to the question. There are two things to be learned in this loop. First, the concept, such as the meaning of red. Second, the ability to reason, such as the functionality of counting objects. In such end-to-end design, concept learning and reasoning are entangled with each other. Which results in that the learned concepts can not be easily transferred to other tasks, such as image captioning of instance retrieval. Hard to transfer NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing There have been recent trials on incorporating explicilty concepts in visual reasoning. The NSVQA framework, proposed by Yi and his colleagues uses a scene parsing subroutine to extract an abstract representation of images. It first detects all objects in the scene and extracts their attributes, such as color, shape, and material with off-the-shelf neural networks. Language Q: What’s the shape of the red object?

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 1 There have been recent trials on incorporating explicilty concepts in visual reasoning. The NSVQA framework, proposed by Yi and his colleagues uses a scene parsing subroutine to extract an abstract representation of images. It first detects all objects in the scene and extracts their attributes, such as color, shape, and material with off-the-shelf neural networks. Language Q: What’s the shape of the red object?

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Meanwhile, it translates natural language questions into logic programs with a hierarchical layout. In this case, the question "what's the shape of the red object" can be translated into a two step program. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object?

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning The generated program, together with the abstract scene representation, is sent to a symbolic program executor. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object?

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning The generated program, together with the abstract scene representation, is sent to a symbolic program executor. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object?

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning The reasoning module executes the program. It finds the red object in this scene. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object?

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning The reasoning module executes the program It finds the red object in this scene. And answer the question by querying the shape of the object. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object? Sphere

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing Concept Annotation ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning During training, annotation for visual concepts and programs are needed. Language Program Annotation Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object? Sphere

Incorporate Concepts in Visual Reasoning NS-VQA [Yi et al. 2018] Vision Scene Parsing Concept Annotation? This restricts its application in real-world scenarios, where neither of the concept annotation for natural images or program annotation for natural language could be easily obtained. Program Annotation? Language Semantic Parsing Q: Are the animals grazing? VQA [Agrawal et al., 2015]

Idea: Joint Learning of Concepts and Semantic Parsing Vision Scene Parsing Concept This restricts its application in real-world scenarios, where neither of the concept annotation for natural images or program annotation for natural language could be easily obtained. In this paper, we present the idea of joint learning of visual concept and semantic parsing from natural supervision, where no explicit human annotations for concepts or programs will be needed.. Analogical to human concept learning, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate parsing new sentences. Program Language Semantic Parsing Q: Are the animals grazing? VQA [Agrawal et al., 2015]

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation Feature Extraction Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. Q: What’s the shape of the red object?

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. Q: What’s the shape of the red object?

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation Obj 1 Feature Extraction Obj 2 2 Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. We also use semantic parsers to translate the input question into a symbolic program with a hierarchical structure. Q: What’s the shape of the red object?

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. We also use semantic parsers to translate the input question into a symbolic program with a hierarchical structure. Each of the concepts in the question, such as the word “red” , is associated with a vector embedding. A neuro-symbolic reasoning module takes visual representations, concept embeddings, and the parsed program as input and gives the answer to the question. ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape Sphere

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Concept Embeddings red ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red And finally, the latent programs of natural languages. ...... Semantic Parsing Q: What’s the shape of the red object? Program Filter Query red shape

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red First, we focus on the learning of visual concepts via the neuro-symbolic reasoning. We assume that a latent program has been recovered by the semantic parsing module. ...... Semantic Parsing Q: What’s the shape of the red object? Program Filter Query red shape

Neuro-Symbolic Reasoning Concept Embeddings red ...... Visual Representation Obj 1 Obj 2 1 2 First, we focus on the learning of visual concepts via the neuro-symbolic reasoning. We assume that a latent program has been recovered by the semantic parsing module. Q: What’s the shape of the red object? Filter Query red shape

Neuro-Symbolic Reasoning Visual Representation 1 Obj 1 Obj 2 2 Concept Embeddings red The perception module learns visual concepts based on the language description of the object being referred to. First, we focus on the learning of visual representations for objects and concept embeddings via the neuro-symbolic reasoning. We assume that a latent program has been recovered by the semantic parsing module. ...... Q: What’s the shape of the red object? Filter Query red shape

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 1 Obj 2 2 Concept Embeddings red Color Proj. Starting from the first filter_red operation, For each of the object in the scene, we use a small neural network to project its visual representation into a color space. The concept red also corresponds to a vector embedding in the color space. ...... Color Space Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1)

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Concept Embeddings red Color Proj. Starting from the first filter_red operation, For each of the object in the scene, we use a small neural network to project its visual representation into a color space. The concept red also corresponds to a vector embedding in the color space. ...... Color Space Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Concept Embeddings red Starting from the first filter_red operation, For each of the object in the scene, we use a small neural network to project its visual representation into a color space. The concept red also corresponds to a vector embedding in the color space. ...... Color Space red Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Sim(Color(Obj1), red) = 0.1 ✗ Concept Embeddings red We compare the cosine similarity between colors of objects and the vector embedding of red. ...... Color Space red Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Sim(Color(Obj1), red) = 0.1 ✗ Sim(Color(Obj2), red) = 0.9 ✓ Concept Embeddings red We compare the cosine similarity between colors of objects and the vector embedding of red to classify objects. In this case, the second object will be classified as red. ...... Color Space red Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Concept Embeddings red Shape Proj. Thus, the second object will be selected as the input to the second query_shape operation. Next, we extract the shape representation of the second object, and compare it with the vector embedding of different shapes in the dataset: cube, sphere, and cylinder. ...... Q: What’s the shape of the red object? Shape Space Filter Query red shape Shape(Obj 2)

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder Concept Embeddings red Thus, the second object will be selected as the input to the second query_shape operation. Next, we extract the shape representation of the second object, and compare it with the vector embedding of different shapes in the dataset: cube, sphere, and cylinder. ...... Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder ✗ Concept Embeddings red Again, we compute the cosine similarity between the shape representation of the second object and the vector embeddings of different shapes to classify the shape of the object. ...... Sim(Shape(Obj2), cube) = 0.1 Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder ✗ Concept Embeddings red ✓ Again, we compute the cosine similarity between the shape representation of the second object and the vector embeddings of different shapes to classify the shape of the object. ...... Sim(Shape(Obj2), sphere) = 0.9 Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

Neuro-Symbolic Reasoning General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder ✗ Concept Embeddings red ✓ ✗ Again, we compute the cosine similarity between the shape representation of the second object and the vector embeddings of different shapes to classify the shape of the object. In this case, the object will be classified as a sphere, which answers the question correctly. ...... Sim(Shape(Obj2), cylinder) = 0.1 Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

Neuro-Symbolic Reasoning Back-propagation Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red In the neuro-symbolic reasoning process, all operations are executed based on the similarity between object attributes and concepts, in the latent embedding spaces. Thus, the derived answer is fully differentiable with respect to the visual representations of objects as well as the concept embeddings. During training, we use the groundtruth answer as the supervision and a back-propagation process as the training algorithm. ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape Sphere

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red We now look at how the learned visual concepts facilitate parsing new sentences. ...... Semantic Parsing Q: What’s the shape of the red object? Program Filter Query red shape

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red ...... During training, for the semantic parsing, we sample multiple candidate programs from the parser. In this example, two candidate programs are sampled. We show the semantic interpretation for each of the programs in natural language. The semantics of the first program is what’s the shape of the red object, which is the correct one. The semantics of the second program is: is there any other thing of the same shape as the red object? The concepts in the question are associated with vector embeddings. The next step is to execute the candidate programs. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Groundtruth: Sphere ...... Based on the visual representations and concept embeddings Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Sphere Groundtruth: Sphere ...... Based on the visual representations and concept embeddings, the first program gives us the answer sphere. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Sphere ✓ Groundtruth: Sphere ...... Based on the visual representations and concept embeddings, the first program gives us the answer sphere, which is correct, compared with the groundtruth. Thus, the first candidate program will be marked as a positive example. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Neuro-Symbolic Reasoning Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Groundtruth: Sphere ...... We also execute the second program. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: No Groundtruth: Sphere ...... We also execute the second program, which gives us the answer No, since there is only one sphere in the scene. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: No ✗ Groundtruth: Sphere ...... The answer No is incorrect compared with the groundtruth answer to the original question. Thus, the second program will be marked as a negative example. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red ...... REINFORCE For each of the parsed program, we use the execution result compared with the groundtruth as the reward, and apply a REINFORCE algorithm to train the semantic parser. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Idea: Joint Learning of Concepts and Semantic Parsing Vision Scene Parsing Concept 1 2 Neuro-Symbolic Reasoning Language Program This closes the loop of concept learning and semantic parsing. In the NS-CL framework, these two modules are jointly learned, by only looking at images and reading paired questions and answers. No annotations for ... Semantic Parsing Q: What’s the shape of the red object? Sphere

Curriculum Learning Lesson1: Object-based questions. Lesson3: complex scenes, complex questions Q: What is the shape of the red object? A: Sphere. Lesson2: Relational questions. To facilitate the joint learning, we draw inspirations from how children learn their visual concepts. Our model starts from learning object-level concepts by looking at simple scenes and reading simple questions. It then interpret referential expressions based on the learned object-level concepts in order to learn relational concepts. Finally, we extend it to learn from complex scenes and questions. Q: Is the green cube behind the red sphere? A: Yes Q: Does the big matte object behind the big sphere have the same color as the cylinder left of the small brown cube? A: No.

High Accuracy and Data Efficiency IEP [Johnson et al. 2017] MAC [Hudson & Manning, 2018] TbD [Mascharka et al. 2018] NS-CL [Ours] 7K Images + 70K Questions Our approach brings multiple advantage over prior arts. First, looking at the QA accuracy on the standard testbed CLEVR dataset for visual reasoning, our model reaches a state-of-the-art performance compared with other baselines. Moreover, our approach is more data efficient. Using only 10% of the training data, our model can reach the performance of 98.9 in QA accuracy, surpassing all baselines by at least fourteen percent.

Application in Real-World Scenarios VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] The neuro-symbolic concelt learning framework could be easily extended to natural images and natural language. Here, we shows the execution traces of two sample questions from the VQS dataset by Gan et al. The first question is to query the color of the fire hydrant. Q: What color is the fire hydrant? Filter Query fire_hydrant color

Application in Real-World Scenarios VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] The first operation filter fire hydrant selects the fire hydrant in the image. Q: What color is the fire hydrant? Filter Query fire_hydrant color

Application in Real-World Scenarios VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] The second query color operation gives the answer yellow. Q: What color is the fire hydrant? Filter Query fire_hydrant color A: Yellow

Application in Real-World Scenarios VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] In the second example, our goal is to count the number of zebras in the scene. Q: What color is the fire hydrant? Q: How many zebras are there? Filter Query Filter Count fire_hydrant color zebra A: Yellow

Application in Real-World Scenarios VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] To do this, we first filter out all zebras in the scene. Q: What color is the fire hydrant? Q: How many zebras are there? Filter Query Filter Count fire_hydrant color zebra A: Yellow

Application in Real-World Scenarios VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] And return the number of zebras. Q: What color is the fire hydrant? Q: How many zebras are there? Filter Query Filter Count fire_hydrant color zebra A: Yellow A: 3

Concept: Person On a Skateboard Concept: Horse Concept: Person Concept: Person On a Skateboard The learned concepts can be easily transferred to other tasks. Here, we show examples of instance retrieval given object-level concepts and relational concepts on the VQS dataseat. In terms of accuracy, oiur model also achieves a comparable accuracy with other baselines for visual question answering.

Limitations and Future Directions Q: What purpose does the thing on the person’s head serve? A: Shade. VQA [Agrawal et al., 2015] There are also certain limitations of the current framework, which suggests various future directions for scene understanding, language understanding, and reasoning. Looking at this example from the VQA dataset, what purpose does the thing on the person’s head serve, while the answer is shade. It calls for robust recognition systems, and, even beyond just perception, the understanding of intentionality. It also requires language understanding algorithms that can interpret and represent noisy and complex natural language questions. Moreover, although the neuro-symbolic framework has shown significant improvement on data efficiency, it’s still unclear how to build learning systems that can acquire a new concept from just few examples. Recognition of in-the-wild images and beyond (e.g., goals). Interpretation of noisy natural language. Concept learning from fewer examples.

Conclusion NSCL learns visual concepts from language with no annotation. Advantages: high accuracy and data efficiency. transfer learned concepts to other tasks. To conclude, in this paper, we present the neuro-symbolic concept learner, which learns visual concepts from langauge with no annotation. It has several advantages, including high accuracy and data efficiency. More importantly, it allows the learned concepts to be easily transferred to other tasks by learning from only VQA datasets

Conclusion NSCL learns visual concepts from language with no annotation. Advantages: high accuracy and data efficiency. transfer learned concepts to other tasks. The principle behind our framework is the explicit grounding of visual concept via neuro-symbolic reasoning., facilitate the joint learning of concepts and language. For technical details and more results, please come to our poster of number 32.

Poster #32 Conclusion Project Page NSCL learns visual concepts from language with no annotation. Principles: explicit visual grounding of concepts with neuro-symbolic reasoning. joint learning of concepts and language with curriculum. Advantages: high accuracy and data efficiency. transfer learned concepts to other tasks. The principle behind our framework is the explicit grounding of visual concept via neuro-symbolic reasoning. To facilitate the joint learning of concepts and language, we also introduce the idea of curriculum concept learning. For technical details and more results, please come to our poster of number 32. Project Page

Idea: Joint Learning of Concepts and Semantic Parsing Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. We also use semantic parsers to translate the input question into a symbolic program with a hierarchical structure. Each of the concepts in the question, such as the word “red” , is associated with a vector embedding. A neuro-symbolic reasoning module takes visual representations, concept embeddings, and the parsed program as input and gives the answer to the question. ...... Semantic Parsing Q: Is there any red object? Filter Exist red Sphere

Combinatorial Generalization Q: What’s the shape of the big yellow thing? Training Q: What size is the cylinder that is left of the cyan thing that is in front of the gray cube? Test Another advantage of NS-CL is its better combinatorial generalization. To test it, we manually split the CLEVR dataset into four splits based on the number of objects in the scene and the complexity of the questions. The split A contains only scenes with a small number of objects and simple questions. We train different models only on the split A, and test them on other splits, which include scenes with more objects, questions with higher complexity, or both. All models perform fairly well when tested on the split A. However, all baselines fail to generalize to other splits, resulting in a noticeable drop in QA accuracy. Comparatively, NS-CL performs well on all splits.