Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Indoor Segmentation and Support Inference from RGBD Images Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus.
Advertisements

Semantic Contours from Inverse Detectors Bharath Hariharan et.al. (ICCV-11)
Learning Shared Body Plans Ian Endres University of Illinois work with Derek Hoiem, Vivek Srikumar and Ming-Wei Chang.
Combining Detectors for Human Hand Detection Antonio Hernández, Petia Radeva and Sergio Escalera Computer Vision Center, Universitat Autònoma de Barcelona,
LARGE-SCALE IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill road building car sky.
CHAPTER 23: Two Categorical Variables: The Chi-Square Test
Data-driven Visual Similarity for Cross-domain Image Matching
Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs Roozbeh Mottaghi 1, Sanja Fidler 2, Jian Yao 2, Raquel Urtasun 2, Devi Parikh 3 1 UCLA.
Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.
Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.
Recognition: A machine learning approach
Training Regimes Motivation  Allow state-of-the-art subcomponents  With “Black-box” functionality  This idea also occurs in other application areas.
Statistical Recognition Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and Kristen Grauman.
LARGE-SCALE NONPARAMETRIC IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill CVPR 2011Workshop on Large-Scale.
Adaptive Rao-Blackwellized Particle Filter and It’s Evaluation for Tracking in Surveillance Xinyu Xu and Baoxin Li, Senior Member, IEEE.
An opposition to Window- Scanning Approaches in Computer Vision Presented by Tomasz Malisiewicz March 6, 2006 Advanced The Robotics Institute.
Evaluating Hypotheses
Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.
1 Accurate Object Detection with Joint Classification- Regression Random Forests Presenter ByungIn Yoo CS688/WST665.
Lecture 29: Recent work in recognition CS4670: Computer Vision Noah Snavely.
Generic object detection with deformable part-based models
Sung Ju Hwang and Kristen Grauman University of Texas at Austin.
Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010.
Salient Object Detection by Composition
Studying Relationships Between Human Gaze, Description, and Computer Vision Kiwon Yun 1, Yifan Peng 1 Dimitris Samaras 1, Gregory J. Zelinsky 1,2, Tamara.
Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)
Language and images Abhijit Sarkar. Noun Verbs Adjective Adverbs.
Learning Collections of Parts for Object Recognition and Transfer Learning University of Illinois at Urbana- Champaign.
Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.
CHAPTER 17: Tests of Significance: The Basics
Efficient Region Search for Object Detection Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas at Austin.
INTRODUCTION Heesoo Myeong and Kyoung Mu Lee Department of ECE, ASRI, Seoul National University, Seoul, Korea Tensor-based High-order.
Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.
Tracking CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Object detection, deep learning, and R-CNNs
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
October Andrew C. Gallagher, Jiebo Luo, Wei Hao Improved Blue Sky Detection Using Polynomial Model Fit Andrew C. Gallagher, Jiebo Luo, Wei Hao Presented.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Counting How Many Words You Read
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
Using masers as evolutionary probes in the G333 GMC (as well as some follow up work) Shari Breen, Simon Ellingsen, Ben Lewis, Melanie Johnston-Hollitt,
Recognition Using Visual Phrases
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Object-Graphs for Context-Aware Category Discovery Yong Jae Lee and Kristen Grauman University of Texas at Austin 1.
Context Neelima Chavali ECE /21/2013. Roadmap Introduction Paper1 – Motivation – Problem statement – Approach – Experiments & Results Paper 2 Experiments.
Sung Ju Hwang and Kristen Grauman University of Texas at Austin.
Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba Massachusetts Institute of Technology
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Week 4: 6/6 – 6/10 Jeffrey Loppert. This week.. Coded a Histogram of Oriented Gradients (HOG) Feature Extractor Extracted features from positive and negative.
BMVC 2010 Sung Ju Hwang and Kristen Grauman University of Texas at Austin.
Recent developments in object detection
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
What is statistics? Sampling populations
Object detection with deformable part-based models
Lit part of blue dress and shadowed part of white dress are the same color
Vector-Space (Distributional) Lexical Semantics
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Accounting for the relative importance of objects in image retrieval
HOGgles Visualizing Object Detection Features
Object-Graphs for Context-Aware Category Discovery
On-going research on Object Detection *Some modification after seminar
Faster R-CNN By Anthony Martinez.
Learning Object Context for Dense Captioning
Presentation transcript:

Sung Ju Hwang and Kristen Grauman University of Texas at Austin

Image tagged with keywords clearly tell us Which object to search for Detecting tagged objects Dog Black lab Jasper Sofa Self Living room Fedora Explore #24

Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al Fergus et al Berg et al Vijayanarasimhan & Grauman Detecting tagged objects Image tagged with keywords clearly tell us Which object to search for

Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer ?? Can you guess where and what size the mug will appear in both images? Main Idea The list of tags on an image may give useful information Beyond just what objects are present

Main Idea Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Mug is named the firstMug is named later in the list Absence of larger objects Presence of larger objects Tag as context

Feature: word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Presence/absence of some other objects, and the number of those objects affects the scene layout Presence of smaller objects, such as key, and the absence of larger objects hints that it might be a close-up scene Presence of the larger objects such as desk and bookshelf hints that the image describes a typical office scene

Feature: word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer WordMugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Toothbrus h Key W W Blue Larger objects Red Smaller objects Plain bag-of-words feature describing word frequency. W i = word

Feature: tag rank Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer People tag the ‘important’ objects earlier If the object is tagged the first, there is a high chance that it is the main object: large, and centered If the object is tagged later, then it means that the object might not be salient: either it might be far from the center or small in scale

Feature: tag rank Blue High relative rank (>0.6)Red Low relative rank(<0.4) Percentile of the absolute rank of the tag compared against its typical rank. r i = percentile of the rank for tag i Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer WordMugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Toothbrus h Key W W Green Medium relative rank (0.4~0.6)

Feature: proximity 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer People tend to move their eyes to the objects nearby Objects that are close to each other in the tag list are likely to be close in the image

Feature: proximity Encoded as the inverse of the average rank difference between tag words. P i,j = rank difference between tag i and j 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer WordMugScreenKeyboardDeskBookshelf Mug Screen 0000 Keyboard 100 Desk 00 Bookshelf 0 WordMugScreenKeyboardDeskBookshelf Mug Screen Keyboard Desk 11 Bookshelf 1 Blue Objects close to each other

Overview of the approach Mug Key Keyboard Toothbrush Pen Photo Post-it Image Tags W = {1, 0, 2, …, 3} R = {0.9, 0.5, …, 0.2} P = {0.25, 0.33, …, 0.1} Appearance Model Implicit tag features P(X|W) P(X|R) P(X|P) P(X|A) Sliding window detector What? Where? Localization result Priming the detector Getting appearance Based prediction Modeling P(X|T)

Overview of the approach Mug Key Keyboard Toothbrush Pen Photo Post-it Image Tags W = {1, 0, 2, …, 3} R = {0.9, 0.5, …, 0.2} P = {0.25, 0.33, …, 0.1} Appearance Model Implicit tag features P(X|W) P(X|R) P(X|P) P(X|A) Localization result + What? Modulating the detector Sliding window detector Getting appearance Based prediction Modeling P(X|T)

Approach: modeling P(X|T) We modeled this conditional PDF P(X|T) directly without calculating the joint distribution P(X,T), using the mixture density network (MDN) We wish to know the conditional PDF of the location and scale of the target object, given the tag features: P(X|T) (X = s,x,y, T = tag feature) Lamp Car Wheel Light Window House Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car Top 30 mostly liked positions for class car. Bounding box sampled according to P(X|T)

Approach: Priming the detector Region to search Ignored Most probable scale Unlikely scale )Rank the detection results based on the learned P(X|T) 5 Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 2) Search only the probable region and the scale, following the rank

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score P(X|A) Detector P(X|W) P(X|R) P(X|P) Logistic regression Classifier We learn the weights for each prediction, P(X|A), P(X|W), P(X|R), and P(X|P) Lamp Car Wheel Light Image tags

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score Prediction based on the original detector score 0.9

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score Prediction based on the original detector score 0.9 Prediction based on the tag features

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score

Experiments  We compare the following two  Detection Speed  Number of windows to search  Detection Accuracy  AUROC  AP  On three methods  Appearance-only  Appearance + Gist  Appearance + tag features (ours)

LabelMe  contains the ordered tag list.  Used Dalal & Trigg’s Hog detector  contains images that have high variance in composition.  Tag lists are obtained from anonymous workers on Mechanical Turks  Felzenszwalb’s LSVM detector DatasetLabelMePascal Number of training/test images3799/ /4953 Number of classes520 Number of keywords Number of taggers56758 Avg. Number of Tags / Image235.5 Experiments: Dataset PASCAL VOC 2007

LabelMe: Performance Evaluation More accurate detection, Because we know which hypotheses to trust most. Modified version of the HOG detector by Dalal and Triggs. Faster detection, because we know where to look first

Results: LabelMe Sky Buildings Person Sidewalk Car Road Car Window Road Window Sky Wheel Sign HOGHOG+GistHOG+Tags Gist and Tags are likely to predict the same position, but different scale. Most of the accuracy gain using the tag features comes from accurate scale prediction

Results: LabelMe Desk Keyboard Screen Bookshelf Desk Keyboard Screen Mug Keyboard Screen CD HOGHOG+GistHOG+Tags

PASCAL VOC 2007: Performance Evaluation Need to test less number of windows to achieve the same detection rate. Modified Felzenszwalb’s LSVM detector 9.2% improvement in accuracy over all classes (Average Precision) 65% 25% 77% 70%

Per-class localization accuracy  Significant improvement on  Bird  Boat  Cat  Dog  Potted plant

PASCAL VOC 2007 (examples) Aeroplane Building Aeroplane Smoke Aeroplane Lamp Person Bottle Dog Sofa Painting Table Bottle Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Ours LSVM baseline

PASCAL VOC 2007 (examples) Dog Floor Hairclip Dog Person Ground Bench Scarf Person Microphone Light Horse Person Tree House Building Ground Hurdle Fence

PASCAL VOC 2007 (Failure case) Aeroplane Sky Building Shadow Person Pole Building Sidewalk Grass Road Dog Clothes Rope Plant Ground Shadow String Wall Bottle Glass Wine Table

Some Observations  We find that often implicit features predict: - scale better for indoor objects - position better for outdoor objects  We find Gist usually better for y position, while tags are generally stronger for scale - agrees with previous experiments using Gist  In general, need to have learned about target objects in variety of examples with different contexts

Conclusion  We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.

Future Work  Joint multi-object detection  From tags to natural language sentences  Image retrieval  Using Wordnet to group words with similar meanings

Conclusion  We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.