Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010.

Similar presentations


Presentation on theme: "Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010."— Presentation transcript:

1 Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010

2 Images tagged with keywords clearly tell us which objects to search for Detecting tagged objects Dog Black lab Jasper Sofa Self Living room Fedora Explore #24 Hwang & Grauman, CVPR 2010

3 Duygulu et al. 2002 Detecting tagged objects Previous work using tagged images focuses on the noun ↔ object correspondence. Fergus et al. 2005 Li et al., 2009 Berg et al. 2004 [Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …] Hwang & Grauman, CVPR 2010

4 Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster ?? Based on tags alone, can you guess where and what size the mug will be in each image? Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Hwang & Grauman, CVPR 2010

5 Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Mug Key Keyboard Toothbrush Pen Photo Post-it Our Idea Presence of larger objects Mug is named first Absence of larger objects Mug is named later The list of human-provided tags gives useful cues beyond just which objects are present. Hwang & Grauman, CVPR 2010

6 Our Idea We propose to learn the implicit localization cues provided by tag lists to improve object detection. Hwang & Grauman, CVPR 2010

7 Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Training: Learn object-specific connection between localization parameters and implicit tag features. Mug Eiffel Desk Mug Office Mug Coffee Testing: Given novel image, localize objects based on both tags and appearance. P (location, scale | tags) Implicit tag features Hwang & Grauman, CVPR 2010

8 Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Training: Learn object-specific connection between localization parameters and implicit tag features. Mug Eiffel Desk Mug Office Mug Coffee Testing: Given novel image, localize objects based on both tags and appearance. P (location, scale | tags) Implicit tag features Hwang & Grauman, CVPR 2010

9 Feature: Word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Presence or absence of other objects affects the scene layout = count of i-th word., where MugPenPost-it Tooth brush KeyPhotoComputerScreenKeyboardDeskBookshelfPoster W(im1)111111001000 W(im2)100000121111 Hwang & Grauman, CVPR 2010

10 Feature: Word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Presence or absence of other objects affects the scene layout Large objects mentionedSmall objects mentioned = count of i-th word., where MugPenPost-it Tooth brush KeyPhotoComputerScreenKeyboardDeskBookshelfPoster W(im1)111111001000 W(im2)100000121111 Hwang & Grauman, CVPR 2010

11 Feature: Rank of tags People tag the “important” objects earlierPeople tag the “important” objects earlier  record rank of each tag compared to its typical rank. = percentile rank of i-th word., where Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster MugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Tooth brush Key R(im1) 0.80000.510000.280.720.8200.90 R(im2) 0.230.620.210.130.480.610.4100000 Hwang & Grauman, CVPR 2010

12 Feature: Rank of tags People tag the “important” objects earlierPeople tag the “important” objects earlier  record rank of each tag compared to its typical rank. = percentile rank of i-th word., where Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Relatively high rank MugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Tooth brush Key R(im1) 0.80000.510000.280.720.8200.90 R(im2) 0.230.620.210.130.480.610.4100000 Hwang & Grauman, CVPR 2010

13 Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation = rank difference., where 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 1 MugScreenKeyboardDeskBookshelf Mug100.500 Screen 0000 Keyboard 100 Desk 00 Bookshelf 0 MugScreenKeyboardDeskBookshelf Mug110.50.20.25 Screen 110.330.5 Keyboard 10.330.5 Desk 11 Bookshelf 1 Hwang & Grauman, CVPR 2010

14 Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation = rank difference., where 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 1 MugScreenKeyboardDeskBookshelf Mug100.500 Screen 0000 Keyboard 100 Desk 00 Bookshelf 0 MugScreenKeyboardDeskBookshelf Mug110.50.20.25 Screen 110.330.5 Keyboard 10.330.5 Desk 11 Bookshelf 1 May be close to each other Hwang & Grauman, CVPR 2010

15 Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Mug Eiffel Desk Mug Office Mug Coffee P (location, scale | W,R,P) Implicit tag features Training: Testing: Hwang & Grauman, CVPR 2010

16 Modeling P(X|T) We model it directly using a mixture density network (MDN) [Bishop, 1994]. We need PDF for location and scale of the target object, given the tag feature: P(X = scale, x, y | T = tag feature) Input tag feature (Words, Rank, or Proximity) Mixture model Neural network αµ Σ αµ Σ αµ Σ Hwang & Grauman, CVPR 2010

17 Lamp Car Wheel Light Window House Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Hwang & Grauman, CVPR 2010

18 Lamp Car Wheel Light Window House Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Hwang & Grauman, CVPR 2010

19 Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Mug Eiffel Desk Mug Office Mug Coffee P (location, scale | W,R,P) Implicit tag features Training: Testing: Hwang & Grauman, CVPR 2010

20 Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) Hwang & Grauman, CVPR 2010

21 Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) (a) Sort all candidate windows according to P(X|T). Most likely Less likely Least likely (b) Run detector only at the most probable locations and scales. Hwang & Grauman, CVPR 2010

22 Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) 2)Use it to increase detection accuracy (modulate the detector output scores) Predictions from object detector 0.7 0.8 0.9 Predictions based on tag features 0.3 0.2 0.9

23 Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) 2)Use it to increase detection accuracy (modulate the detector output scores) 0.63 0.24 0.18

24 Experiments: Datasets LabelMePASCAL VOC 2007  Street and office scenes  Contains ordered tag lists via labels added  5 classes  56 unique taggers  23 tags / image  Dalal & Trigg’s HOG detector  Flickr images  Tag lists obtained on Mechanical Turk  20 classes  758 unique taggers  5.5 tags / image  Felzenszwalb et al.’s LSVM detector Hwang & Grauman, CVPR 2010

25 Experiments We evaluate  Detection Speed  Detection Accuracy We compare  Raw detector (HOG, LSVM)  Raw detector + Our tag features We also show the results when using Gist [Torralba 2003] as context, for reference. Hwang & Grauman, CVPR 2010

26 We search fewer windows to achieve same detection rate. We know which detection hypotheses to trust most. PASCAL: Performance evaluation Naïve sliding window searches 70%. We search only 30%. Hwang & Grauman, CVPR 2010

27 PASCAL: Accuracy vs Gist per class Hwang & Grauman, CVPR 2010

28 PASCAL: Accuracy vs Gist per class Hwang & Grauman, CVPR 2010

29 Lamp Person Bottle Dog Sofa Painting Table Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Bottle Car License Plate Building Car LSVM+ Tags (Ours) LSVM alone PASCAL: Example detections Car Door Gear Steering Wheel Seat Person Camera Hwang & Grauman, CVPR 2010

30 Dog Floor Hairclip Dog Person Ground Bench Scarf Person Microphone Light Horse Person Tree House Building Ground Hurdle Fence PASCAL: Example detections Dog Person LSVM+Tags (Ours)LSVM alone Hwang & Grauman, CVPR 2010

31 Aeroplane Sky Building Shadow Person Pole Building Sidewalk Grass Road Dog Clothes Rope Plant Ground Shadow String Wall Bottle Glass Wine Table PASCAL: Example failure cases LSVM+Tags (Ours)LSVM alone Hwang & Grauman, CVPR 2010

32 Results: Observations  Often our implicit features predict: - scale well for indoor objects - position well for outdoor objects  Gist usually better for y position, while our tags are generally stronger for scale  Need to have learned about target objects in variety of examples with different contexts - visual and tag context are complementary Hwang & Grauman, CVPR 2010

33 Summary  We want to learn what is implied (beyond objects present) by how a human provides tags for an image.  Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.  Novel tag cues enable effective localization prior.  Significant gains with state-of-the-art detectors and two datasets. Hwang & Grauman, CVPR 2010

34  Joint multi-object detection  From tags to natural language sentences  Image retrieval applications Future work Hwang & Grauman, CVPR 2010


Download ppt "Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010."

Similar presentations


Ads by Google