1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind
The Neuro-Symbolic Concept Learner Interpreting Scenes, Words, and Sentences From Natural Supervision Good morning everyone. SLOW SLOW SLOW SLOW Jiayuan Mao1,2 Chuang Gan3 Pushmeet Kohli4 Josh Tenenbaum1 Jiajun Wu1 1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

Concepts in Visual Reasoning
We study the problem of visual concept learning and visual reasoning. Given this input image, human can instantaneously recognize objects in the scene. We also associate various visual concepts with the object appearance. This includes colors, shapes, types of materials, and so on. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Color Red Shape Sphere Material Rubber …… We study the problem of visual concept learning and visual reasoning. Given this input image, human can instantaneously recognize objects in the scene. We also associate various visual concepts with the object appearance. This includes colors, shapes, types of materials, and so on. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? Color Red Shape Sphere Material Rubber …… The recognized visual concepts support our reasoning over the image. As an example, to answer the question What’s the shape of the red object, we could associate the word red with the color property of the second object. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? Color Red Shape Sphere Material Rubber …… The recognized visual concepts support our reasoning over the image. As an example, to answer the question What’s the shape of the red object, we could associate the word red with the color property of the second object, and answer the question by querying the shape of the second object. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Color Red Shape Sphere Material Rubber …… The recognized visual concepts support our reasoning over the image. As an example, to answer the question What’s the shape of the red object, we could associate the word red with the color property of the second object, and answer the question by querying the shape of the second object, which gives the answer sphere. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning Color Red Shape Sphere Material Rubber …… Beyond just answering questions, visual concepts support visual reasoning of various forms. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning There is a green cube behind a red sphere. Color Red Shape Sphere Material Rubber …… Beyond just answering questions, visual concepts support visual reasoning of various forms. For example, you could write captions to describe the image you are looking at. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning There is a green cube behind a red sphere. Color Red Shape Sphere Material Rubber …… Instance Retrieval: rubber sphere. We can also retrieve instances from the dataset that are associated with specific concepts of interest. In this case, the rubber spheres. CLEVR [Johnson et al., 2017]

Color Green Shape Cube Material Metal …… Visual Question Answering Q: What’s the shape of the red object? A: Sphere. Image Captioning Color Red Shape Sphere Material Rubber …… Instance Retrieval We first review the prior arts for visual reasoning in the visual question answering setting. CLEVR [Johnson et al., 2017]

End-to-End Visual Reasoning
Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor of these models is to use an end-to-end neural architecture to predict the answer to the question. More intuitive: Magic blackbox NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network Concept (e.g., colors, shapes) Reasoning (e.g., count) A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor is to use an end-to-end neural architecture, including a image encoder, a question encoder, and a reasoning module, which predicts the answer to the question. There are two things to be learned in this loop. First, the concept, such as the meaning of red. Second, the ability to reason, such as the functionality of counting objects. NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network Entangled Concept (e.g., colors, shapes) Reasoning (e.g., count) A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor is to use an end-to-end neural architecture, including a image encoder, a question encoder, and a reasoning module, which predicts the answer to the question. There are two things to be learned in this loop. First, the concept, such as the meaning of red. Second, the ability to reason, such as the functionality of counting objects. In such end-to-end design, concept learning and reasoning are entangled with each other. NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

Visual Question Answering Q: What’s the shape of the red object? End-to-End Neural Network Entangled Concept (e.g., colors, shapes) Reasoning (e.g., count) A: Sphere. We first review the prior arts for visual reasoning in the visual question answering case. A typical favor is to use an end-to-end neural architecture, including a image encoder, a question encoder, and a reasoning module, which predicts the answer to the question. There are two things to be learned in this loop. First, the concept, such as the meaning of red. Second, the ability to reason, such as the functionality of counting objects. In such end-to-end design, concept learning and reasoning are entangled with each other. Which results in that the learned concepts can not be easily transferred to other tasks, such as image captioning of instance retrieval. Hard to transfer NMN [Andreas et al., 2016] IEP [Johnson et al., 2017] FiLM [Perez et al., 2018], MAC [Hudson & Manning, 2018] Stack-NMN [Hu et al., 2018] TbD [Mascharka et al. 2018] Image Captioning Instance Retrieval

Incorporate Concepts in Visual Reasoning
NS-VQA [Yi et al. 2018] Vision Scene Parsing There have been recent trials on incorporating explicilty concepts in visual reasoning. The NSVQA framework, proposed by Yi and his colleagues uses a scene parsing subroutine to extract an abstract representation of images. It first detects all objects in the scene and extracts their attributes, such as color, shape, and material with off-the-shelf neural networks. Language Q: What’s the shape of the red object?

NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 1 There have been recent trials on incorporating explicilty concepts in visual reasoning. The NSVQA framework, proposed by Yi and his colleagues uses a scene parsing subroutine to extract an abstract representation of images. It first detects all objects in the scene and extracts their attributes, such as color, shape, and material with off-the-shelf neural networks. Language Q: What’s the shape of the red object?

NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Meanwhile, it translates natural language questions into logic programs with a hierarchical layout. In this case, the question "what's the shape of the red object" can be translated into a two step program. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object?

NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning The generated program, together with the abstract scene representation, is sent to a symbolic program executor. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object?

NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning The reasoning module executes the program. It finds the red object in this scene. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object?

NS-VQA [Yi et al. 2018] Vision Scene Parsing ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning The reasoning module executes the program It finds the red object in this scene. And answer the question by querying the shape of the object. Language Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object? Sphere

NS-VQA [Yi et al. 2018] Vision Scene Parsing Concept Annotation ID Color Shape Material 1 Green Cube Metal 2 Red Sphere Rubber 1 2 Symbolic Reasoning During training, annotation for visual concepts and programs are needed. Language Program Annotation Semantic Parsing Filter(Red) Query(Shape) Program Q: What’s the shape of the red object? Sphere

NS-VQA [Yi et al. 2018] Vision Scene Parsing Concept Annotation? This restricts its application in real-world scenarios, where neither of the concept annotation for natural images or program annotation for natural language could be easily obtained. Program Annotation? Language Semantic Parsing Q: Are the animals grazing? VQA [Agrawal et al., 2015]

Idea: Joint Learning of Concepts and Semantic Parsing
Vision Scene Parsing Concept This restricts its application in real-world scenarios, where neither of the concept annotation for natural images or program annotation for natural language could be easily obtained. In this paper, we present the idea of joint learning of visual concept and semantic parsing from natural supervision, where no explicit human annotations for concepts or programs will be needed.. Analogical to human concept learning, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate parsing new sentences. Program Language Semantic Parsing Q: Are the animals grazing? VQA [Agrawal et al., 2015]

Object Detection Visual Representation Feature Extraction Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. Q: What’s the shape of the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. Q: What’s the shape of the red object?

Object Detection Visual Representation Obj 1 Feature Extraction Obj 2 2 Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. We also use semantic parsers to translate the input question into a symbolic program with a hierarchical structure. Q: What’s the shape of the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. We also use semantic parsers to translate the input question into a symbolic program with a hierarchical structure. Each of the concepts in the question, such as the word “red” , is associated with a vector embedding. A neuro-symbolic reasoning module takes visual representations, concept embeddings, and the parsed program as input and gives the answer to the question. ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape Sphere

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Concept Embeddings red ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red And finally, the latent programs of natural languages. ...... Semantic Parsing Q: What’s the shape of the red object? Program Filter Query red shape

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red First, we focus on the learning of visual concepts via the neuro-symbolic reasoning. We assume that a latent program has been recovered by the semantic parsing module. ...... Semantic Parsing Q: What’s the shape of the red object? Program Filter Query red shape

Neuro-Symbolic Reasoning
Concept Embeddings red ...... Visual Representation Obj 1 Obj 2 1 2 First, we focus on the learning of visual concepts via the neuro-symbolic reasoning. We assume that a latent program has been recovered by the semantic parsing module. Q: What’s the shape of the red object? Filter Query red shape

Visual Representation 1 Obj 1 Obj 2 2 Concept Embeddings red The perception module learns visual concepts based on the language description of the object being referred to. First, we focus on the learning of visual representations for objects and concept embeddings via the neuro-symbolic reasoning. We assume that a latent program has been recovered by the semantic parsing module. ...... Q: What’s the shape of the red object? Filter Query red shape

General Representation Space Visual Representation Obj 1 1 Obj 1 Obj 2 2 Concept Embeddings red Color Proj. Starting from the first filter_red operation, For each of the object in the scene, we use a small neural network to project its visual representation into a color space. The concept red also corresponds to a vector embedding in the color space. ...... Color Space Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1)

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Concept Embeddings red Color Proj. Starting from the first filter_red operation, For each of the object in the scene, we use a small neural network to project its visual representation into a color space. The concept red also corresponds to a vector embedding in the color space. ...... Color Space Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Concept Embeddings red Starting from the first filter_red operation, For each of the object in the scene, we use a small neural network to project its visual representation into a color space. The concept red also corresponds to a vector embedding in the color space. ...... Color Space red Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Sim(Color(Obj1), red) = 0.1 ✗ Concept Embeddings red We compare the cosine similarity between colors of objects and the vector embedding of red. ...... Color Space red Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Sim(Color(Obj1), red) = 0.1 ✗ Sim(Color(Obj2), red) = 0.9 ✓ Concept Embeddings red We compare the cosine similarity between colors of objects and the vector embedding of red to classify objects. In this case, the second object will be classified as red. ...... Color Space red Q: What’s the shape of the red object? Filter Query red shape Color(Obj 1) Color(Obj 2)

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 Concept Embeddings red Shape Proj. Thus, the second object will be selected as the input to the second query_shape operation. Next, we extract the shape representation of the second object, and compare it with the vector embedding of different shapes in the dataset: cube, sphere, and cylinder. ...... Q: What’s the shape of the red object? Shape Space Filter Query red shape Shape(Obj 2)

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder Concept Embeddings red Thus, the second object will be selected as the input to the second query_shape operation. Next, we extract the shape representation of the second object, and compare it with the vector embedding of different shapes in the dataset: cube, sphere, and cylinder. ...... Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder ✗ Concept Embeddings red Again, we compute the cosine similarity between the shape representation of the second object and the vector embeddings of different shapes to classify the shape of the object. ...... Sim(Shape(Obj2), cube) = 0.1 Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder ✗ Concept Embeddings red ✓ Again, we compute the cosine similarity between the shape representation of the second object and the vector embeddings of different shapes to classify the shape of the object. ...... Sim(Shape(Obj2), sphere) = 0.9 Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

General Representation Space Visual Representation Obj 1 1 Obj 2 Obj 1 Obj 2 2 sphere cube cylinder ✗ Concept Embeddings red ✓ ✗ Again, we compute the cosine similarity between the shape representation of the second object and the vector embeddings of different shapes to classify the shape of the object. In this case, the object will be classified as a sphere, which answers the question correctly. ...... Sim(Shape(Obj2), cylinder) = 0.1 Q: What’s the shape of the red object? Shape Space Filter Query red shape sphere Shape(Obj 2) cube cylinder

Back-propagation Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red In the neuro-symbolic reasoning process, all operations are executed based on the similarity between object attributes and concepts, in the latent embedding spaces. Thus, the derived answer is fully differentiable with respect to the visual representations of objects as well as the concept embeddings. During training, we use the groundtruth answer as the supervision and a back-propagation process as the training algorithm. ...... Semantic Parsing Q: What’s the shape of the red object? Filter Query red shape Sphere

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 Concept 2 Concept Embeddings red We now look at how the learned visual concepts facilitate parsing new sentences. ...... Semantic Parsing Q: What’s the shape of the red object? Program Filter Query red shape

Concepts Facilitate Parsing New Sentences
Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red ...... During training, for the semantic parsing, we sample multiple candidate programs from the parser. In this example, two candidate programs are sampled. We show the semantic interpretation for each of the programs in natural language. The semantics of the first program is what’s the shape of the red object, which is the correct one. The semantics of the second program is: is there any other thing of the same shape as the red object? The concepts in the question are associated with vector embeddings. The next step is to execute the candidate programs. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Groundtruth: Sphere ...... Based on the visual representations and concept embeddings Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Sphere Groundtruth: Sphere ...... Based on the visual representations and concept embeddings, the first program gives us the answer sphere. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Sphere ✓ Groundtruth: Sphere ...... Based on the visual representations and concept embeddings, the first program gives us the answer sphere, which is correct, compared with the groundtruth. Thus, the first candidate program will be marked as a positive example. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Concepts Facilitate Parsing New Sentences Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: Groundtruth: Sphere ...... We also execute the second program. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: No Groundtruth: Sphere ...... We also execute the second program, which gives us the answer No, since there is only one sphere in the scene. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Answer: No ✗ Groundtruth: Sphere ...... The answer No is incorrect compared with the groundtruth answer to the original question. Thus, the second program will be marked as a negative example. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red ...... REINFORCE For each of the parsed program, we use the execution result compared with the groundtruth as the reward, and apply a REINFORCE algorithm to train the semantic parser. Filter Query red shape Semantic Parsing What’s the shape of the red object? Q: What’s the shape of the red object? Filter Same red shape Exist Any other thing of the same shape as the red object?

Vision Scene Parsing Concept 1 2 Neuro-Symbolic Reasoning Language Program This closes the loop of concept learning and semantic parsing. In the NS-CL framework, these two modules are jointly learned, by only looking at images and reading paired questions and answers. No annotations for ... Semantic Parsing Q: What’s the shape of the red object? Sphere

Curriculum Learning Lesson1: Object-based questions.
Lesson3: complex scenes, complex questions Q: What is the shape of the red object? A: Sphere. Lesson2: Relational questions. To facilitate the joint learning, we draw inspirations from how children learn their visual concepts. Our model starts from learning object-level concepts by looking at simple scenes and reading simple questions. It then interpret referential expressions based on the learned object-level concepts in order to learn relational concepts. Finally, we extend it to learn from complex scenes and questions. Q: Is the green cube behind the red sphere? A: Yes Q: Does the big matte object behind the big sphere have the same color as the cylinder left of the small brown cube? A: No.

High Accuracy and Data Efficiency
IEP [Johnson et al. 2017] MAC [Hudson & Manning, 2018] TbD [Mascharka et al. 2018] NS-CL [Ours] 7K Images + 70K Questions Our approach brings multiple advantage over prior arts. First, looking at the QA accuracy on the standard testbed CLEVR dataset for visual reasoning, our model reaches a state-of-the-art performance compared with other baselines. Moreover, our approach is more data efficient. Using only 10% of the training data, our model can reach the performance of 98.9 in QA accuracy, surpassing all baselines by at least fourteen percent.

Application in Real-World Scenarios
VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] The neuro-symbolic concelt learning framework could be easily extended to natural images and natural language. Here, we shows the execution traces of two sample questions from the VQS dataset by Gan et al. The first question is to query the color of the fire hydrant. Q: What color is the fire hydrant? Filter Query fire_hydrant color

VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] The first operation filter fire hydrant selects the fire hydrant in the image. Q: What color is the fire hydrant? Filter Query fire_hydrant color

VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] The second query color operation gives the answer yellow. Q: What color is the fire hydrant? Filter Query fire_hydrant color A: Yellow

VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] In the second example, our goal is to count the number of zebras in the scene. Q: What color is the fire hydrant? Q: How many zebras are there? Filter Query Filter Count fire_hydrant color zebra A: Yellow

VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] To do this, we first filter out all zebras in the scene. Q: What color is the fire hydrant? Q: How many zebras are there? Filter Query Filter Count fire_hydrant color zebra A: Yellow

VQA [Agrawal et al., 2015] VQS [Gan et al., 2017] And return the number of zebras. Q: What color is the fire hydrant? Q: How many zebras are there? Filter Query Filter Count fire_hydrant color zebra A: Yellow A: 3

Concept: Person On a Skateboard
Concept: Horse Concept: Person Concept: Person On a Skateboard The learned concepts can be easily transferred to other tasks. Here, we show examples of instance retrieval given object-level concepts and relational concepts on the VQS dataseat. In terms of accuracy, oiur model also achieves a comparable accuracy with other baselines for visual question answering.

Limitations and Future Directions
Q: What purpose does the thing on the person’s head serve? A: Shade. VQA [Agrawal et al., 2015] There are also certain limitations of the current framework, which suggests various future directions for scene understanding, language understanding, and reasoning. Looking at this example from the VQA dataset, what purpose does the thing on the person’s head serve, while the answer is shade. It calls for robust recognition systems, and, even beyond just perception, the understanding of intentionality. It also requires language understanding algorithms that can interpret and represent noisy and complex natural language questions. Moreover, although the neuro-symbolic framework has shown significant improvement on data efficiency, it’s still unclear how to build learning systems that can acquire a new concept from just few examples. Recognition of in-the-wild images and beyond (e.g., goals). Interpretation of noisy natural language. Concept learning from fewer examples.

Conclusion NSCL learns visual concepts from language with no annotation. Advantages: high accuracy and data efficiency. transfer learned concepts to other tasks. To conclude, in this paper, we present the neuro-symbolic concept learner, which learns visual concepts from langauge with no annotation. It has several advantages, including high accuracy and data efficiency. More importantly, it allows the learned concepts to be easily transferred to other tasks by learning from only VQA datasets

Conclusion NSCL learns visual concepts from language with no annotation. Advantages: high accuracy and data efficiency. transfer learned concepts to other tasks. The principle behind our framework is the explicit grounding of visual concept via neuro-symbolic reasoning., facilitate the joint learning of concepts and language. For technical details and more results, please come to our poster of number 32.

Poster #32 Conclusion Project Page
NSCL learns visual concepts from language with no annotation. Principles: explicit visual grounding of concepts with neuro-symbolic reasoning. joint learning of concepts and language with curriculum. Advantages: high accuracy and data efficiency. transfer learned concepts to other tasks. The principle behind our framework is the explicit grounding of visual concept via neuro-symbolic reasoning. To facilitate the joint learning of concepts and language, we also introduce the idea of curriculum concept learning. For technical details and more results, please come to our poster of number 32. Project Page

Object Detection Visual Representation 1 Obj 1 Feature Extraction Obj 2 2 Neuro-Symbolic Reasoning Concept Embeddings red Let’s run a concrete example. We first detect objects in the scene, and extracts visual representations for each of the objects. We also use semantic parsers to translate the input question into a symbolic program with a hierarchical structure. Each of the concepts in the question, such as the word “red” , is associated with a vector embedding. A neuro-symbolic reasoning module takes visual representations, concept embeddings, and the parsed program as input and gives the answer to the question. ...... Semantic Parsing Q: Is there any red object? Filter Exist red Sphere

Combinatorial Generalization
Q: What’s the shape of the big yellow thing? Training Q: What size is the cylinder that is left of the cyan thing that is in front of the gray cube? Test Another advantage of NS-CL is its better combinatorial generalization. To test it, we manually split the CLEVR dataset into four splits based on the number of objects in the scene and the complexity of the questions. The split A contains only scenes with a small number of objects and simple questions. We train different models only on the split A, and test them on other splits, which include scenes with more objects, questions with higher complexity, or both. All models perform fairly well when tested on the split A. However, all baselines fail to generalize to other splits, resulting in a noticeable drop in QA accuracy. Comparatively, NS-CL performs well on all splits.

1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

Similar presentations

Presentation on theme: "1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

Similar presentations

Presentation on theme: "1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind"— Presentation transcript:

Similar presentations

About project

Feedback