Visual Grounding.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA
A Similarity Retrieval System for Multimodal Functional Brain Images Rosalia F. Tungaraza Advisor: Prof. Linda G. Shapiro Ph.D. Defense Computer Science.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Introduction to Image Quality Assessment
Computer vision.
Searching Provenance Shankar Pasupathy, Network Appliance PASS Workshop, Harvard October 2005.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Parameter selection in prostate IMRT Renzhi Lu, Richard J. Radke 1, Andrew Jackson 2 Rensselaer Polytechnic Institute 1,Memorial Sloan-Kettering Cancer.
1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
CS671: N ATURAL L ANGUAGE P ROCESSING H OMEWORK 3 - P APER R EVIEW Name: PRABUDDHA CHAKRABORTY Roll Number: st year, M.Tech (Computer Science.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Convolutional LSTM Networks for Subcellular Localization of Proteins
Optimal Eye Movement Strategies In Visual Search.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
The PLA Model: On the Combination of Product-Line Analyses 강태준.
A Sentence Interaction Network for Modeling Dependence between Sentences Biao Liu, Minlie Huang Tsinghua University.
Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Convolutional Sequence to Sequence Learning
Hierarchical Question-Image Co-Attention for Visual Question Answering
Deep Learning Amin Sobhani.
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
A Simple Syntax-Directed Translator
Database Management System
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Chapter 6 Heuristics and Controlled Problem Solving
Adversarial Learning for Neural Dialogue Generation
Compositional Human Pose Regression
Associative Query Answering via Query Feature Similarity
Vector-Space (Distributional) Lexical Semantics
Attention-based Caption Description Mun Jonghwan.
Final Presentation: Neural Network Doc Summarization
Word Embedding Word2Vec.
Word embeddings based mapping
Papers 15/08.
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Unsupervised Pretraining for Semantic Parsing
Natural Language to SQL(nl2sql)
Word2Vec.
Word embeddings (continued)
Feature fusion and attention scheme
Attention for translation
Deep Interest Network for Click-Through Rate Prediction
Automatic Handwriting Generation
Human-object interaction
Motivation Semantic Transformation Module Most of the existing works neglect the semantic relationship between the visual feature and linguistic knowledge,
Neural Machine Translation using CNN
Introduction to Search Engines
Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
Presented By: Harshul Gupta
Visual Grounding 专题报告 Lejian Ren 4.23.
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Appearance Transformer (AT)
Week 7 Presentation Ngoc Ta Aidean Sharghi
Learning to Detect Human-Object Interactions with Knowledge
Neural Machine Translation by Jointly Learning to Align and Translate
CVPR 2019 Poster.
Faithful Multimodal Explanation for VQA
Presentation transcript:

Visual Grounding

Problem Definition Visual Grounding/ Referring Expression Matching Expression with Detected Object. Search target object from a sets of objects in a image through an expression.

MAttNet: Modular Attention Network for Referring Expression Comprehension

Motivation Previous Work: using a simple concatenation of all features as input and a single LSTM to encode/decode the whole expression. Problem: ignoring the variance among different types of referring expressions.

Contribution Present the first modular network for the general referring expression comprehension task. MAttNet learns to parse expressions automatically through a soft attention based mechanism, instead of relying on an external language parser Applying different visual attention techniques in the subject and relationship modules to allow relevant attention on the described image portions.

Method Overview

Method Language Attention Network

Method Visual Modules

Method Visual Modules Subject Module Given the C3 and C4 features of a candidate, we forward them to two tasks. attribute prediction: phrase-guided attentional pooling: Given the subject phrase embedding 𝑞 𝑠𝑢𝑏𝑗 compute its attention on each grid location: The weighted sum of V is the final subject visual representation for the candidate region

Method Visual Modules Matching Function: Purpose: measure the similarity between the subject representation and phrase embedding. Operation: Two MLPs transform the visual and phrase representation into a common embedding space. The inner-product of two l2-normalized representations computes their similarity score.

Method Location Module Location embedding: Relative location embedding: Location representation for the target object is: Matching score:

Method Relationship Module The matching score: use the average-pooled C4 feature as the appearance feature. we encode their offsets to the candidate object via The visual representation for each surrounding object is: The matching score:

Loss Function Overall weighted matching score: Combined hinge loss:

Experiment

Experiment