Using Natural Language Processing to Aid Computer Vision

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.
3 Small Comments Alex Berg Stony Brook University I work on recognition: features – action recognition – alignment – detection – attributes – hierarchical.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Data Visualization STAT 890, STAT 442, CM 462
Large dataset for object and scene recognition A. Torralba, R. Fergus, W. T. Freeman 80 million tiny images Ron Yanovich Guy Peled.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Presented by Zeehasham Rasheed
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Natural Language Understanding
Information Retrieval in Practice
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Multimedia Databases (MMDB)
Object Bank Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 4 th, 2013.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe & Computer Vision Laboratory ETH.
EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.
Chapter 11 Artificial Intelligence Introduction to CS 1 st Semester, 2015 Sanghyun Park.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Supertagging CMSC Natural Language Processing January 31, 2006.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Annotation Framework & ImageCLEF 2014 JAN BOTOREK, PETRA BUDÍKOVÁ
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Machine learning & object recognition Cordelia Schmid Jakob Verbeek.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Introduction to Machine Learning, its potential usage in network area,
Brief Intro to Machine Learning CS539
Grounded Language Learning
Measuring Monolinguality
Applying Deep Neural Network to Enhance EMPI Searching
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Sentiment analysis algorithms and applications: A survey
Siemens Enables Digitalization: Data Analytics & Artificial Intelligence Dr. Mike Roshchin, CT RDA BAM.
An Artificial Intelligence Approach to Precision Oncology
School of Computer Science & Engineering
Artificial Intelligence for Speech Recognition
Multimedia Content-Based Retrieval
Relation Extraction CSCI-GA.2591
Natural Language Processing (NLP)
Artificial Intelligence and Lisp Lecture 13 Additional Topics in Artificial Intelligence LiU Course TDDC65 Autumn Semester,
Personalized Social Image Recommendation
CSC 594 Topics in AI – Natural Language Processing
Advanced Higher Computing Based on Heriot-Watt University Scholar Materials Applications of AI – Vision and Languages 1.
Efficient Estimation of Word Representation in Vector Space
What is Pattern Recognition?
Machine Learning in Natural Language Processing
Distributed Representation of Words, Sentences and Paragraphs
Rob Fergus Computer Vision
Learning to Sportscast: A Test of Grounded Language Acquisition
Multimedia Information Retrieval
Creating Data Representations
CSE 635 Multimedia Information Retrieval
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
CS249: Neural Language Model
Natural Language Processing (NLP)
Presentation transcript:

Using Natural Language Processing to Aid Computer Vision Ray Mooney Department of Computer Science University of Texas at Austin

My Initial Motivation for Exploring Grounded Language Learning I became frustrated having to manually annotate sentences with formal meaning representations (MRs) to train semantic parsers. I hoped to automatically extract MRs from the perceptual context of situated language. However, computer vision is generally not capable enough yet to provide the MRs needed for semantic parsing. Therefore, I focused on grounded language learning in virtual worlds to circumvent computer vision.

Computer Vision is Hard Both language understanding and vision are “AI complete” problems. However, in many ways, I think vision is the harder problem.

Text Really Helps Language Learning Both speech and visual perception begin with a complex analog signal (sound vs. light waves). However, for language we have orthographic text that intermediates between analog signal and semantic representation. Training on large, easily available corpora of text dramatically benefits learning for language. No such pervasive, data rich, intermediate representation exists for vision.

Dimensionality Language vs. Vision Language input is fundamentally 1-d, i.e. linear sequences of sounds, phonemes, or words. Vision is fundamentally a 2-d (arrays of pixels), 2 ½ - d (depth maps) or 3-d (world model) problem. Therefore, the “curse of dimensionality” makes vision a much harder problem.

Biological Hardware Vision vs. Language Humans arguably have a greater amount of neural hardware dedicated to vision than to language. Therefore, matching human performance on vision may be computationally more complex.

Biological Evolution Vision vs. Language Vision has a much longer “evolutionary history” than language. First mammals: 200-250 million years ago (MYA) First human language: Homo habilis, 2.3 MYA Therefore, a much more complex neural system could have evolved for vision compared to language.

Language Helping Vision Now I feel sorry for my poor computer vision colleagues confronting an even harder problem. So I’d like to help them, instead of expecting them to help me. As Jerry Maguire said: Help me, help you!

NL-Acquired Knowledge for Vision Many types of linguistic knowledge or knowledge “extracted” from text can potentially aid computer vision.

Textual Retrieval of Images Most existing image search is based on retrieving images with relevant nearby text. A variety of computer vision projects have used text-based image search to automatically gather (noisy) training sets for object recognition. Various ways of dealing with the noise in the resulting data: Multiple-instance learning Cleanup results with crowd-sourcing

Lexical Ontologies ImageNet built a large-scale hierarchical database of images by using the WordNet ontology (Deng et al. CVPR-09) “Hedging Your Bets” method uses ImageNet hierarchy to classify object as specifically as possible while maintaining high accuracy (Deng et al. CVPR-12)

Subject-Verb-Object (SVO) Correlations We developed methods for using probability estimates of SVO combinations to aid activity recognition (Motwani & Mooney, ECAI-12) and sentential NL description (Krishnamoorthy et al., AAAI-13) for YouTube videos. SVO probabilities are estimated using a smoothed trigram “language model” trained on several large dependency-parsed corpora. Similar statistics can also be used to predict verbs for describing images based on the detected objects (Yang et al., EMNLP 2011).

Object-Activity-Scene Correlations Can use co-occurrence statistics mined from text for objects, verbs, and scenes to improve joint recognition of these from images or videos. Such statistics have been used to help predict scenes from objects and verbs (Yang et al. EMNLP 2011).

Object-Object Correlations Could use object-object co-occurrence statistics mined form text to acquire knowledge that could aid joint recognition of multiple objects in a scene. An “elephant” is more likely to be seen in the same image as a “giraffe” than in the same image as a “penguin”

Scripts Activity-Activity Correlations and Orderings Knowledge of stereotypical sequences of actions/events can be mined from text (Chambers & Jurafsky, 2008). Such knowledge could be used to improve joint recognition of sequences of activities from video. Opening a bottle is typically followed by drinking or pouring.

Transferring Algorithmic Techniques from Language to Vision Text classification suing “bag of words” to image classification using “bag of visual words” Linear CRFs for sequence labeling (e.g. POS tagging) for text to 2-d mesh CRFs for pixel classification in images. HMMs for speech recognition to HMMs for activity recognition in videos.

Other Ways Language can Help Vision? Your Idea Here

Conclusions Help me, help you, help me! Its easier to use language processing to help computer vision than the other way around. For a variety of reasons computer vision is harder than NLP. Knowledge about or from language can be used to help vision in various ways. Help me, help you, help me!

Recent Spate of Workshops on Grounded Language NSF 2011 Workshop on Language and Vision AAAI-2011 Workshop on Language-Action Tools for Cognitive Artificial Agents: Integrating Vision, Action and Language NIPS-2011 Workshop on Integrating Language and Vision NAACL-2012 Workshop on Semantic Interpretation in an Actionable Context AAAI-2012 Workshop on Grounding Language for Physical Systems NAACL-2013 Workshop on Vision and Language CVPR-2013 Workshop on Language for Vision UW-MSR 2013 Summer Institute on Understanding Situated Language