Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

Sung Ju Hwang and Kristen Grauman University of Texas at Austin

Image tagged with keywords clearly tell us Which object to search for Detecting tagged objects Dog Black lab Jasper Sofa Self Living room Fedora Explore #24

Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al. 2002 Fergus et al. 2005 Berg et al. 2004 Vijayanarasimhan & Grauman 2008 3 Detecting tagged objects Image tagged with keywords clearly tell us Which object to search for

Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer ?? Can you guess where and what size the mug will appear in both images? Main Idea The list of tags on an image may give useful information Beyond just what objects are present

Main Idea Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Mug is named the firstMug is named later in the list Absence of larger objects Presence of larger objects Tag as context

Feature: word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Presence/absence of some other objects, and the number of those objects affects the scene layout Presence of smaller objects, such as key, and the absence of larger objects hints that it might be a close-up scene Presence of the larger objects such as desk and bookshelf hints that the image describes a typical office scene

Feature: word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer WordMugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Toothbrus h Key W1100100011111 W2122111200000 Blue Larger objects Red Smaller objects Plain bag-of-words feature describing word frequency. W i = word

Feature: tag rank Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer People tag the ‘important’ objects earlier If the object is tagged the first, there is a high chance that it is the main object: large, and centered If the object is tagged later, then it means that the object might not be salient: either it might be far from the center or small in scale

Feature: tag rank Blue High relative rank (>0.6)Red Low relative rank(<0.4) Percentile of the absolute rank of the tag compared against its typical rank. r i = percentile of the rank for tag i Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer WordMugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Toothbrus h Key W1 0.80000.510000.280.720.8200.90 W2 0.230.620.210.130.480.610.4100000 Green Medium relative rank (0.4~0.6)

Feature: proximity 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer People tend to move their eyes to the objects nearby Objects that are close to each other in the tag list are likely to be close in the image 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10

Feature: proximity Encoded as the inverse of the average rank difference between tag words. P i,j = rank difference between tag i and j 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 WordMugScreenKeyboardDeskBookshelf Mug100.500 Screen 0000 Keyboard 100 Desk 00 Bookshelf 0 WordMugScreenKeyboardDeskBookshelf Mug110.50.20.25 Screen 110.330.5 Keyboard 10.330.5 Desk 11 Bookshelf 1 Blue Objects close to each other

Overview of the approach Mug Key Keyboard Toothbrush Pen Photo Post-it Image Tags W = {1, 0, 2, …, 3} R = {0.9, 0.5, …, 0.2} P = {0.25, 0.33, …, 0.1} Appearance Model Implicit tag features P(X|W) P(X|R) P(X|P) P(X|A) Sliding window detector What? Where? Localization result Priming the detector Getting appearance Based prediction Modeling P(X|T)

Overview of the approach Mug Key Keyboard Toothbrush Pen Photo Post-it Image Tags W = {1, 0, 2, …, 3} R = {0.9, 0.5, …, 0.2} P = {0.25, 0.33, …, 0.1} Appearance Model Implicit tag features P(X|W) P(X|R) P(X|P) P(X|A) Localization result + What? Modulating the detector Sliding window detector Getting appearance Based prediction Modeling P(X|T) 0.24 0.81

Approach: modeling P(X|T) We modeled this conditional PDF P(X|T) directly without calculating the joint distribution P(X,T), using the mixture density network (MDN) We wish to know the conditional PDF of the location and scale of the target object, given the tag features: P(X|T) (X = s,x,y, T = tag feature) Lamp Car Wheel Light Window House Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car Top 30 mostly liked positions for class car. Bounding box sampled according to P(X|T)

Approach: Priming the detector Region to search Ignored Most probable scale Unlikely scale 33000 38600 1)Rank the detection results based on the learned P(X|T) 5 Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 2) Search only the probable region and the scale, following the rank

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 0.7 0.8 Prediction based on the original detector score 0.9

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 0.7 0.8 Prediction based on the original detector score 0.9 Prediction based on the tag features 0.3 0.9 0.2

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1)Use it to speed the detection process 2)Use it to modulate the detection confidence score 0.63 0.24 0.18

Experiments  We compare the following two  Detection Speed  Number of windows to search  Detection Accuracy  AUROC  AP  On three methods  Appearance-only  Appearance + Gist  Appearance + tag features (ours)

LabelMe  contains the ordered tag list.  Used Dalal & Trigg’s Hog detector  contains images that have high variance in composition.  Tag lists are obtained from anonymous workers on Mechanical Turks  Felzenszwalb’s LSVM detector DatasetLabelMePascal Number of training/test images3799/25535011/4953 Number of classes520 Number of keywords209399 Number of taggers56758 Avg. Number of Tags / Image235.5 Experiments: Dataset PASCAL VOC 2007

LabelMe: Performance Evaluation More accurate detection, Because we know which hypotheses to trust most. Modified version of the HOG detector by Dalal and Triggs. Faster detection, because we know where to look first

Results: LabelMe Sky Buildings Person Sidewalk Car Road Car Window Road Window Sky Wheel Sign HOGHOG+GistHOG+Tags Gist and Tags are likely to predict the same position, but different scale. Most of the accuracy gain using the tag features comes from accurate scale prediction

Results: LabelMe Desk Keyboard Screen Bookshelf Desk Keyboard Screen Mug Keyboard Screen CD HOGHOG+GistHOG+Tags

PASCAL VOC 2007: Performance Evaluation Need to test less number of windows to achieve the same detection rate. Modified Felzenszwalb’s LSVM detector 9.2% improvement in accuracy over all classes (Average Precision) 65% 25% 77% 70%

Per-class localization accuracy  Significant improvement on  Bird  Boat  Cat  Dog  Potted plant

PASCAL VOC 2007 (examples) Aeroplane Building Aeroplane Smoke Aeroplane Lamp Person Bottle Dog Sofa Painting Table Bottle Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Ours LSVM baseline

PASCAL VOC 2007 (examples) Dog Floor Hairclip Dog Person Ground Bench Scarf Person Microphone Light Horse Person Tree House Building Ground Hurdle Fence

PASCAL VOC 2007 (Failure case) Aeroplane Sky Building Shadow Person Pole Building Sidewalk Grass Road Dog Clothes Rope Plant Ground Shadow String Wall Bottle Glass Wine Table

Some Observations  We find that often implicit features predict: - scale better for indoor objects - position better for outdoor objects  We find Gist usually better for y position, while tags are generally stronger for scale - agrees with previous experiments using Gist  In general, need to have learned about target objects in variety of examples with different contexts

Conclusion  We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.

Future Work  Joint multi-object detection  From tags to natural language sentences  Image retrieval  Using Wordnet to group words with similar meanings

Conclusion  We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.

Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

Similar presentations

Presentation on theme: "Sung Ju Hwang and Kristen Grauman University of Texas at Austin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

Similar presentations

Presentation on theme: "Sung Ju Hwang and Kristen Grauman University of Texas at Austin."— Presentation transcript:

Similar presentations

About project

Feedback