Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Slides:



Advertisements
Similar presentations
Patch to the Future: Unsupervised Visual Prediction
Advertisements

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Information Retrieval in Practice
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Spatial Reasoning for Semi-Autonomous Vehicles Using Image and Range Data Marjorie Skubic and James Keller Students: Sam Blisard, George Chronis, Grant.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Self-Supervised Segmentation of River Scenes Supreeth Achar *, Bharath Sankaran ‡, Stephen Nuske *, Sebastian Scherer *, Sanjiv Singh * * ‡
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Information Retrieval in Practice
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Teaching with Data.Org : A Guide for Sociology. Teaching with Data.Org is a web site devoted to providing faculty with tools to use in courses. This site.
Mining and Summarizing Customer Reviews
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Document that explains the chosen concept to the animator.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Professor: S. J. Wang Student : Y. S. Wang
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Organizing Your Information
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Welcome This is a document to explains the chosen concept to the animator. This will take you through a 5 section process to provide the necessary details.
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Exploiting Ontologies for Automatic Image Annotation Munirathnam Srikanth, Joshua Varner, Mitchell Bowden, Dan Moldovan Language Computer Corporation SIGIR.
CS671: N ATURAL L ANGUAGE P ROCESSING H OMEWORK 3 - P APER R EVIEW Name: PRABUDDHA CHAKRABORTY Roll Number: st year, M.Tech (Computer Science.
Recognition Using Visual Phrases
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Annotation Framework & ImageCLEF 2014 JAN BOTOREK, PETRA BUDÍKOVÁ
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Zhuode Liu 2016/2/13 University of Texas at Austin CS 381V: Visual Recognition Discovering the Spatial Extent of Relative Attributes Xiao and Lee, ICCV.
Information Retrieval in Practice
Language Identification and Part-of-Speech Tagging
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Lecture 12: Data Wrangling
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Natural Language to SQL(nl2sql)
Human-object interaction
Presented by: Anurag Paul
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Week 7 Presentation Ngoc Ta Aidean Sharghi
THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU
Visual Grounding.
Presentation transcript:

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket R. (12309), Senior Undergraduate Student, CSE, IIT Kanpur

Abstract Visual Dependency Representation(VDR): is an explicit model of the spatial relationships between objects in an image. This paper presents an approach to training a VDR Parsing Model without the extensive human supervision. Its approach is to find objects mentioned in a given description using state-of-the-art object detector, and to use successful detections to produce training data.

Abstract Description of an unseen image is produced by first predicting its VDR over the top-N objects extracted from an object detector, and then generating the text with a template-based language generation model. Performance of this approach is comparable to multimodal deep neural network in images depicting actions.

Aim and Uses This paper focuses on generating literal descriptions which are rarely found alongside images because they describe what can easily seen by others. A computer that can automatically generate such literal descriptions, thereby filling the gap left by humans, may increase access to existing image collections or increase information access for visually impaired users. This paper presents first approach to automatically inferring VDR training examples from natural scenes using only object detector and an image description.

Evaluation of Approach Authors have evaluated their method for inferring VDRs in an image description experiment on the Pascal1K and VLT2K data sets against the bi-directional recurrent neural network(BRNN). The main finding was that the quality of the descriptions generated by this method depends on whether the images depict an action. In the VLT2K data set of people performing actions, the performance of authors’ approach was comparable to BRNN, but in the more diverse Pascal1K dataset, BRNN was substantially better than this method.

Figure showing detailed overview of authors’ approach

Spatial Relation In VDR, the spatial relationship between a pair of objects is encoded with one of the following eight options: above, below, beside, opposite, on, surrounds, in-front, and behind. Previous work on VDR-based image description has relied on training data from expert human annotators, which is expensive and difficult to scale to other datasets.

Spatial Relation An inferred VDR is constructed by searching for the subject and object referred to in the description of an image using an object detector. If both the subject and the object can be found in the image, a VDR is created by attaching the detected subject to the detected object, given the spatial relationship between the object bounding boxes. The spatial relationships that can be applied between subjects and objects are defined in Table1 on next slide. The set of relationships was reduced from eight to six due to the difficulty in predicting the 3D relationships in 2D images.

Linguistic Processing The description of an image is processed to extract candidates for the mentioned objects. Candidates are extracted from nsubj and dobj tokens in the dependency parsed description. If the parsed description doesn’t contain both subject and object, the example is discarded.

Some relaxations to the constraint The words in a description that refer to objects in an image are not always within the constrained vocabulary of the object labels in the object detection model. So, authors’ are increasing the chance of finding objects with two simple back-offs: lemmatizing the token and transforming the token into its WordNet hypernym parent.(eg. Colour is hypernym of red)

Generating Descriptions The description of an image is generated using a template-based language generation model designed to exploit the structure encoded in VDR. Top-N objects are extracted from an image using the pre-trained R-CNN object detector. Objects are parsed into a VDR structure using the VDR Parser trained on the automatically inferred training data. Given the VDR over the set of detected objects, we generate all possible descriptions of the image that can be produced in a depth- first traversal of the VDR.

Generating Descriptions A description is assigned a score that combines the corpus-based evidence and visual confidence of the objects selected for the description. Descriptions are generated using the following template: DT head is V DT child in this template, head and child are the labels of the objects that appear in the head and child positons of a specific VDR subtree. V is a verb determined from a subject-verb-object-spatial relation model derived from the training data descriptions.

Generating Descriptions The verb v that satisfies the V field is determined as follows: If no verbs were observed between a particular object-object pair in the training corpus, v is filled using back-off that uses the spatial relationship label between the objects in the VDR.

Generating Descriptions The object detection confidence values, which are not probabilities and can vary substantially between images, are transformed into the range [0,1] using sgm(conf) = 1/[1+exp(−conf)] The final score assigned to a description is then used to rank all of the candidate descriptions, and the highest-scoring description is assigned to an image:

Generating Descriptions If the VDR Parser does not predict any relationships between objects in an image, which may happen if all of the objects have never been observed in the training data, we use a back-off template to generate the description. In this case, the most confidently detected object in the image is used with the following template: A/An object is in the image.

Experiments Results

Examples of descriptions generated using VDR and the BRNN in the VLT2K data set

Effect of optimization in object detection

Transferring models

Conclusion Two key differences between the BRNN and authors’ approach are the nature visual input and the language generation models. VDR is currently restricted to discrete object detector outputs. So, VDR-based approach is unable to describe 30% of the data in VLT2K data set. This is due to the object detection model not recognizing crucial objects for three of the action classes: cameras, books, and telephones.