Visual Grounding.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA

A Similarity Retrieval System for Multimodal Functional Brain Images Rosalia F. Tungaraza Advisor: Prof. Linda G. Shapiro Ph.D. Defense Computer Science.

Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.

Introduction to Image Quality Assessment

Computer vision.

Searching Provenance Shankar Pasupathy, Network Appliance PASS Workshop, Harvard October 2005.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Parameter selection in prostate IMRT Renzhi Lu, Richard J. Radke 1, Andrew Jackson 2 Rensselaer Polytechnic Institute 1,Memorial Sloan-Kettering Cancer.

1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

CS671: N ATURAL L ANGUAGE P ROCESSING H OMEWORK 3 - P APER R EVIEW Name: PRABUDDHA CHAKRABORTY Roll Number: st year, M.Tech (Computer Science.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

Convolutional LSTM Networks for Subcellular Localization of Proteins

Optimal Eye Movement Strategies In Visual Search.

Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

The PLA Model: On the Combination of Product-Line Analyses 강태준.

A Sentence Interaction Network for Modeling Dependence between Sentences Biao Liu, Minlie Huang Tsinghua University.

Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.

Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Convolutional Sequence to Sequence Learning

Hierarchical Question-Image Co-Attention for Visual Question Answering

Deep Learning Amin Sobhani.

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

A Simple Syntax-Directed Translator

Database Management System

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Chapter 6 Heuristics and Controlled Problem Solving

Adversarial Learning for Neural Dialogue Generation

Compositional Human Pose Regression

Associative Query Answering via Query Feature Similarity

Vector-Space (Distributional) Lexical Semantics

Attention-based Caption Description Mun Jonghwan.

Final Presentation: Neural Network Doc Summarization

Word Embedding Word2Vec.

Word embeddings based mapping

Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Unsupervised Pretraining for Semantic Parsing

Natural Language to SQL(nl2sql)

Word embeddings (continued)

Feature fusion and attention scheme

Attention for translation

Deep Interest Network for Click-Through Rate Prediction

Automatic Handwriting Generation

Human-object interaction

Motivation Semantic Transformation Module Most of the existing works neglect the semantic relationship between the visual feature and linguistic knowledge,

Neural Machine Translation using CNN

Introduction to Search Engines

Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.

Presented By: Harshul Gupta

Visual Grounding 专题报告 Lejian Ren 4.23.

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Sequence-to-Sequence Models

Appearance Transformer (AT)

Week 7 Presentation Ngoc Ta Aidean Sharghi

Learning to Detect Human-Object Interactions with Knowledge

Neural Machine Translation by Jointly Learning to Align and Translate

CVPR 2019 Poster.

Faithful Multimodal Explanation for VQA

Presentation transcript:

Visual Grounding

Problem Definition Visual Grounding/ Referring Expression Matching Expression with Detected Object. Search target object from a sets of objects in a image through an expression.

MAttNet: Modular Attention Network for Referring Expression Comprehension

Motivation Previous Work: using a simple concatenation of all features as input and a single LSTM to encode/decode the whole expression. Problem: ignoring the variance among different types of referring expressions.

Contribution Present the first modular network for the general referring expression comprehension task. MAttNet learns to parse expressions automatically through a soft attention based mechanism, instead of relying on an external language parser Applying different visual attention techniques in the subject and relationship modules to allow relevant attention on the described image portions.

Method Overview

Method Language Attention Network

Method Visual Modules

Method Visual Modules Subject Module Given the C3 and C4 features of a candidate, we forward them to two tasks. attribute prediction: phrase-guided attentional pooling: Given the subject phrase embedding 𝑞 𝑠𝑢𝑏𝑗 compute its attention on each grid location: The weighted sum of V is the final subject visual representation for the candidate region

Method Visual Modules Matching Function: Purpose: measure the similarity between the subject representation and phrase embedding. Operation: Two MLPs transform the visual and phrase representation into a common embedding space. The inner-product of two l2-normalized representations computes their similarity score.

Method Location Module Location embedding: Relative location embedding: Location representation for the target object is: Matching score:

Method Relationship Module The matching score: use the average-pooled C4 feature as the appearance feature. we encode their offsets to the candidate object via The visual representation for each surrounding object is: The matching score:

Loss Function Overall weighted matching score: Combined hinge loss:

Experiment

Experiment