Download presentation
Presentation is loading. Please wait.
Published byBrian Gilbert Modified over 9 years ago
1
Sung Ju Hwang and Kristen Grauman University of Texas at Austin
2
Image tagged with keywords clearly tell us Which object to search for Detecting tagged objects Dog Black lab Jasper Sofa Self Living room Fedora Explore #24
3
Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al. 2002 Fergus et al. 2005 Berg et al. 2004 Vijayanarasimhan & Grauman 2008 3 Detecting tagged objects Image tagged with keywords clearly tell us Which object to search for
4
Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer ?? Can you guess where and what size the mug will appear in both images? Main Idea The list of tags on an image may give useful information Beyond just what objects are present
5
Main Idea Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Mug is named the firstMug is named later in the list Absence of larger objects Presence of larger objects Tag as context
6
Feature: word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Presence/absence of some other objects, and the number of those objects affects the scene layout Presence of smaller objects, such as key, and the absence of larger objects hints that it might be a close-up scene Presence of the larger objects such as desk and bookshelf hints that the image describes a typical office scene
7
Feature: word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer WordMugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Toothbrus h Key W1100100011111 W2122111200000 Blue Larger objects Red Smaller objects Plain bag-of-words feature describing word frequency. W i = word
8
Feature: tag rank Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer People tag the ‘important’ objects earlier If the object is tagged the first, there is a high chance that it is the main object: large, and centered If the object is tagged later, then it means that the object might not be salient: either it might be far from the center or small in scale
9
Feature: tag rank Blue High relative rank (>0.6)Red Low relative rank(<0.4) Percentile of the absolute rank of the tag compared against its typical rank. r i = percentile of the rank for tag i Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer WordMugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Toothbrus h Key W1 0.80000.510000.280.720.8200.90 W2 0.230.620.210.130.480.610.4100000 Green Medium relative rank (0.4~0.6)
10
Feature: proximity 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer People tend to move their eyes to the objects nearby Objects that are close to each other in the tag list are likely to be close in the image 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10
11
Feature: proximity Encoded as the inverse of the average rank difference between tag words. P i,j = rank difference between tag i and j 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 WordMugScreenKeyboardDeskBookshelf Mug100.500 Screen 0000 Keyboard 100 Desk 00 Bookshelf 0 WordMugScreenKeyboardDeskBookshelf Mug110.50.20.25 Screen 110.330.5 Keyboard 10.330.5 Desk 11 Bookshelf 1 Blue Objects close to each other
12
Overview of the approach Mug Key Keyboard Toothbrush Pen Photo Post-it Image Tags W = {1, 0, 2, …, 3} R = {0.9, 0.5, …, 0.2} P = {0.25, 0.33, …, 0.1} Appearance Model Implicit tag features P(X|W) P(X|R) P(X|P) P(X|A) Sliding window detector What? Where? Localization result Priming the detector Getting appearance Based prediction Modeling P(X|T)
13
Overview of the approach Mug Key Keyboard Toothbrush Pen Photo Post-it Image Tags W = {1, 0, 2, …, 3} R = {0.9, 0.5, …, 0.2} P = {0.25, 0.33, …, 0.1} Appearance Model Implicit tag features P(X|W) P(X|R) P(X|P) P(X|A) Localization result + What? Modulating the detector Sliding window detector Getting appearance Based prediction Modeling P(X|T) 0.24 0.81
14
Approach: modeling P(X|T) We modeled this conditional PDF P(X|T) directly without calculating the joint distribution P(X,T), using the mixture density network (MDN) We wish to know the conditional PDF of the location and scale of the target object, given the tag features: P(X|T) (X = s,x,y, T = tag feature) Lamp Car Wheel Light Window House Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car Top 30 mostly liked positions for class car. Bounding box sampled according to P(X|T)
15
Approach: Priming the detector Region to search Ignored Most probable scale Unlikely scale 33000 38600 1)Rank the detection results based on the learned P(X|T) 5 Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 2) Search only the probable region and the scale, following the rank
16
Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score P(X|A) Detector P(X|W) P(X|R) P(X|P) Logistic regression Classifier We learn the weights for each prediction, P(X|A), P(X|W), P(X|R), and P(X|P) Lamp Car Wheel Light Image tags
17
Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 0.7 0.8 Prediction based on the original detector score 0.9
18
Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 0.7 0.8 Prediction based on the original detector score 0.9 Prediction based on the tag features 0.3 0.9 0.2
19
Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 0.63 0.24 0.18
20
Experiments We compare the following two Detection Speed Number of windows to search Detection Accuracy AUROC AP On three methods Appearance-only Appearance + Gist Appearance + tag features (ours)
21
LabelMe contains the ordered tag list. Used Dalal & Trigg’s Hog detector contains images that have high variance in composition. Tag lists are obtained from anonymous workers on Mechanical Turks Felzenszwalb’s LSVM detector DatasetLabelMePascal Number of training/test images3799/25535011/4953 Number of classes520 Number of keywords209399 Number of taggers56758 Avg. Number of Tags / Image235.5 Experiments: Dataset PASCAL VOC 2007
22
LabelMe: Performance Evaluation More accurate detection, Because we know which hypotheses to trust most. Modified version of the HOG detector by Dalal and Triggs. Faster detection, because we know where to look first
23
Results: LabelMe Sky Buildings Person Sidewalk Car Road Car Window Road Window Sky Wheel Sign HOGHOG+GistHOG+Tags Gist and Tags are likely to predict the same position, but different scale. Most of the accuracy gain using the tag features comes from accurate scale prediction
24
Results: LabelMe Desk Keyboard Screen Bookshelf Desk Keyboard Screen Mug Keyboard Screen CD HOGHOG+GistHOG+Tags
25
PASCAL VOC 2007: Performance Evaluation Need to test less number of windows to achieve the same detection rate. Modified Felzenszwalb’s LSVM detector 9.2% improvement in accuracy over all classes (Average Precision) 65% 25% 77% 70%
26
Per-class localization accuracy Significant improvement on Bird Boat Cat Dog Potted plant
27
PASCAL VOC 2007 (examples) Aeroplane Building Aeroplane Smoke Aeroplane Lamp Person Bottle Dog Sofa Painting Table Bottle Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Ours LSVM baseline
28
PASCAL VOC 2007 (examples) Dog Floor Hairclip Dog Person Ground Bench Scarf Person Microphone Light Horse Person Tree House Building Ground Hurdle Fence
29
PASCAL VOC 2007 (Failure case) Aeroplane Sky Building Shadow Person Pole Building Sidewalk Grass Road Dog Clothes Rope Plant Ground Shadow String Wall Bottle Glass Wine Table
30
Some Observations We find that often implicit features predict: - scale better for indoor objects - position better for outdoor objects We find Gist usually better for y position, while tags are generally stronger for scale - agrees with previous experiments using Gist In general, need to have learned about target objects in variety of examples with different contexts
31
Conclusion We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.
32
Future Work Joint multi-object detection From tags to natural language sentences Image retrieval Using Wordnet to group words with similar meanings
33
Conclusion We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.