Learning Everything about Anything: Webly- Supervised Visual Concept Learning Santosh K. Divvala, Ali Farhadi, Carlos Guestrin.

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Creating a Similarity Graph from WordNet
Patch to the Future: Unsupervised Visual Prediction
Data Mining Techniques: Clustering
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Student: Yao-Sheng Wang Advisor: Prof. Sheng-Jyh Wang ARTICULATED HUMAN DETECTION 1 Department of Electronics Engineering National Chiao Tung University.
Large dataset for object and scene recognition A. Torralba, R. Fergus, W. T. Freeman 80 million tiny images Ron Yanovich Guy Peled.
Recognition: A machine learning approach
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Statistical Recognition Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and Kristen Grauman.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.
Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Overview of Search Engines
Information Retrieval in Practice
Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Human abilities Presented By Mahmoud Awadallah 1.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
COMPUTER VISION: SOME CLASSICAL PROBLEMS ADWAY MITRA MACHINE LEARNING LABORATORY COMPUTER SCIENCE AND AUTOMATION INDIAN INSTITUTE OF SCIENCE June 24, 2013.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Information Systems & Semantic Web University of Koblenz ▪ Landau, Germany Semantic Web - Multimedia Annotation – Steffen Staab
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
A Language Independent Method for Question Classification COLING 2004.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Object Detection with Discriminatively Trained Part Based Models
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Beyond Nouns Exploiting Preposition and Comparative adjectives for learning visual classifiers.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
MedKAT Medical Knowledge Analysis Tool December 2009.
Recognition Using Visual Phrases
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Using Semantic Relations to Improve Information Retrieval
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A Discriminatively Trained, Multiscale, Deformable Part Model Yeong-Jun Cho Computer Vision and Pattern Recognition,2008.
NEIL: Extracting Visual Knowledge from Web Data Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta Carnegie Mellon University CS381V Visual Recognition -
Machine learning & object recognition Cordelia Schmid Jakob Verbeek.
Object detection with deformable part-based models
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Object Localization Goal: detect the location of an object within an image Fully supervised: Training data labeled with object category and ground truth.
Clustering Algorithms for Noun Phrase Coreference Resolution
Recognizing and Learning Object categories
“The Truth About Cats And Dogs”
Multimedia Information Retrieval
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Presentation transcript:

Learning Everything about Anything: Webly- Supervised Visual Concept Learning Santosh K. Divvala, Ali Farhadi, Carlos Guestrin

University of Toronto Overview Fully-automated approach for learning models for a wide range of variations within a concept; such as, actions, interactions, attributes, etc. Need to discover the vocabulary of variance Vast resources of online books used Data collection and modelling steps intertwined => no need for explicit human supervision in training the models Introduce a novel “webly-supervised” approach to discover and model the intra-concept visual variance.

University of Toronto Contributions i.A novel approach, using no explicit supervision, for: 1) discovering a comprehensive vocabulary 2) training a full-fledged detection model for any concept ii.Showing substantial improvement over existing weakly supervised state-of-the-art methods iii.Presenting impressive results for unsupervised action detection iv.An open-source online system ( that, given any query concept, automatically learns everything visual about ithttp://levan.cs.washington.edu/

University of Toronto

Main motivation Scalability Exhaustively covering all appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance

University of Toronto The Problem Learn all possible appearance variations of a concept (everything) of different concepts for which visual models are to be learned (anything)

University of Toronto Related Work Taming intra-class variance: Previous works use simple annotations based on aspect-ratio, viewpoint, and feature-space clustering. Weakly-supervised object localization: 1) Poor existing image-based methods when object is highly cluttered or when it occupies only a small portion of the image (e.g., bottle). 2) Video-based methods perform poorly when there are no motion cues (e.g., tv-monitor). 3) Models of all existing methods trained on weakly-labeled datasets where each training image or video contains the object Learning from web images: Either the data is not labelled precisely enough or requires manual labelling.

University of Toronto The Conventional Approach 1)Variance discovery: Discover the visual space of variance for the concept Need a manually defined vocabulary (relying on benchmark datasets) 2)Gather the training data i.e., images and annotations 3)Variance modeling: Design a powerful model that can handle all the discovered intra-concept variations Usually done using divide and conquer, independently of variance discovery. (the training data within a category is grouped into smaller sub-categories of manageable visual variance, and the data are partitioned based on viewpoint, aspect-ratio, poselets, visual phrases, taxonomies, and attributes)

University of Toronto Problems with the conventional approach 1.How can we gather an exhaustive vocabulary that covers all the visual variance within a concept? 2.How can we discover the vocabulary, gather training data and annotations, and learn models without using human supervision?

University of Toronto Drawbacks of Supervision in Variance Discovery Extensivity: a manually-defined vocabulary may have cultural biases, leading to the creation of very biased datasets. Specificity: tedious to define all concept-specific vocabulary (e.g., ‘rearing’ can modify a ‘horse’ with very characteristic appearance, but does not extend to ‘sheep’)

University of Toronto Drawbacks of Supervision in Variance Modeling Flexibility: best for annotations to be modified based on feature representation and the learning algorithm. Scalability: creating a new dataset or adding new annotations to an existing dataset is a huge task. Also, since the modelling step and the discovery step are done independently, the annotations for modeling the intra-concept variance are often disjoint from those gathered during variance discovery.

University of Toronto The New Technical Approach Vocabulary of variance discovered using vast online books (Google Books Ngrams) => both extensive and concept-specific Can query Google Ngrams in Python: Visual variance modeled by intertwining the vocabulary discovery and the model learning steps. Use recent progress in text-based web image search engines, and weakly supervised object localization methods

University of Toronto Discovering the vocabulary of variance Use the Google books ngram English 2012 corpora to obtain all keywords Specifically the dependency gram data (contains parts-of-speech (POS) tagged head=>modifier dependencies between pairs of words) POS tags help partially disambiguate the context of the query Selected ngram dependencies for a given concept have modifiers that are tagged either as noun, verb, adjective, or adverb. Sum up the frequencies across different years End up with around 5000 ngrams for a concept Not all concepts are visually salient => use a simple and fast image- classifier based pruning method

University of Toronto Classifier-based Pruning

University of Toronto Space of Visual Variance

University of Toronto

Model Learning 1)Images for training the detectors gathered using Image Search with the query phrases as the ngrams constituting the superngram 2)200 full-sized, ‘full’ color, ‘photo’ type images downloaded per query 3)Images resized to a maximum of 500 pixels (preserving aspect-ratio) 4)All near-duplicates discarded 5)Images with extreme aspect-ratios discarded 6)Training a mixture of roots: Trained a separate Deformable Parts Model (DPM) for each ngram where the visual variance is constrained. a) Initialized the bounding box to a sub-image within the image that ignores the image boundaries. b) Initialized the model using feature space clustering c) To deal with appearance variations (e.g., ‘jumping horse’ contains images of horses jumping at different orientations), they used a mixture of components initialized with feature space clustering. Since approximately 70% of the components per ngram act as noise sinks they first trained root filters for each component and pruned the noisy ones.

University of Toronto Model Learning (cont’d) 7)Pruning noisy components: Basically the same as the slide on classifier-based pruning 8)Merging pruned components: Basically the same as the slide on Space of Visual Variance

University of Toronto Strengths/Weaknesses of the approach

University of Toronto The Experimental Evaluation

University of Toronto Strengths and Weaknesses of the evaluation Sources of errors: Extent of overlap: e.g., an image of horse-drawn carriage would have the ‘profile horse’, the ‘horse carriage’ and ‘horse head’ detected The VOC criterion demands a single unique detection box for each test instance that has 50% overlap => all the other valid detections are declared as false-positives either due to poor localization or multiple detections. Polysemy: e.g., the car model includes some bus-like car components The VOC dataset exclusively focuses on typical cars (and moreover, discriminates cars from buses).

University of Toronto Discussion: Future direction Some future directions (suggested by authors): Coreference resolution: Determining when two textual mentions name the same entity. The Stanford state-of- the-art system does not link ‘Mohandas Gandhi’ to ‘Mahatma Gandhi’, and ‘Mrs. Gandhi’ to ‘Indira Gandhi’ in: “Indira Gandhi was the third Indian prime minister. Mohandas Gandhi was the leader of Indian nationalism. Mrs. Gandhi was inspired by Mahatma Gandhi’s writings.” Whereas, the method presented here finds the appropriate coreferences. Paraphrasing: For example, they discovered that ‘grazing horse’ is semantically very similar to ‘eating horse’. Their method can produce a semantic similarity score for textual phrases. Temporal evolution of concepts: Can model the visual variance of a concept along the temporal axis, using year-based frequency information in the ngram corpus to identify the peaks over a period of time and then learn models for them. This can help in not only learning the evolution of a concept [26], but also in automatically dating detected instances [35]. Deeper image interpretation: E.g., providing object part boxes (e.g., ‘horse head’, ‘horse foot’, etc) and annotating object action (e.g., ‘fighting’) or type (e.g., ‘jennet horse’). Understanding actions: e.g., ‘horse fighting’, ‘reining horse’