Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010.

Slides:

Advertisements

Similar presentations

Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.

Advertisements

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

From Interactive to Semantic Image Segmentation Varun Gulshan Supervisors: Prof. Andrew Blake Prof. Andrew Zisserman 20 Jan 2012.

Challenges to image parsing researchers Lana Lazebnik UNC Chapel Hill sky sidewalk building road car person car mountain.

Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Patch to the Future: Unsupervised Visual Prediction

Parsing Clothing in Fashion Photographs

INTRODUCTION Heesoo Myeong, Ju Yong Chang, and Kyoung Mu Lee Department of EECS, ASRI, Seoul National University, Seoul, Korea Learning.

Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.

Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs Roozbeh Mottaghi 1, Sanja Fidler 2, Jian Yao 2, Raquel Urtasun 2, Devi Parikh 3 1 UCLA.

Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.

Retrieving Actions in Group Contexts Tian Lan, Yang Wang, Greg Mori, Stephen Robinovitch Simon Fraser University Sept. 11, 2010.

Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.

Recognition: A machine learning approach

Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,

Statistical Recognition Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and Kristen Grauman.

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

LARGE-SCALE NONPARAMETRIC IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill CVPR 2011Workshop on Large-Scale.

Learning Spatial Context: Using stuff to find things Geremy Heitz Daphne Koller Stanford University October 13, 2008 ECCV 2008.

An opposition to Window- Scanning Approaches in Computer Vision Presented by Tomasz Malisiewicz March 6, 2006 Advanced The Robotics Institute.

Saliency & attention (P) Lavanya Sharan April 4th, 2011.

Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.

Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.

Visual Object Recognition Rob Fergus Courant Institute, New York University

Agenda Introduction Bag-of-words models Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.

Nonnegative Shared Subspace Learning and Its Application to Social Media Retrieval Presenter: Andy Lim.

Large Scale Recognition and Retrieval. What does the world look like? High level image statistics Object Recognition for large-scale search Focus on scaling.

Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

Generic object detection with deformable part-based models

Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

Salient Object Detection by Composition

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Studying Relationships Between Human Gaze, Description, and Computer Vision Kiwon Yun 1, Yifan Peng 1 Dimitris Samaras 1, Gregory J. Zelinsky 1,2, Tamara.

Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.

Dynamic 3D Scene Analysis from a Moving Vehicle Young Ki Baik (CV Lab.) (Wed)

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.

Efficient Region Search for Object Detection Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas at Austin.

Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

INTRODUCTION Heesoo Myeong and Kyoung Mu Lee Department of ECE, ASRI, Seoul National University, Seoul, Korea Tensor-based High-order.

Deformable Part Models (DPM) Felzenswalb, Girshick, McAllester & Ramanan (2010) Slides drawn from a tutorial By R. Girshick AP 12% 27% 36% 45% 49% 2005.

Putting Context into Vision Derek Hoiem September 15, 2004.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Object detection, deep learning, and R-CNNs

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

October Andrew C. Gallagher, Jiebo Luo, Wei Hao Improved Blue Sky Detection Using Polynomial Model Fit Andrew C. Gallagher, Jiebo Luo, Wei Hao Presented.

Context-based vision system for place and object recognition Antonio Torralba Kevin Murphy Bill Freeman Mark Rubin Presented by David Lee Some slides borrowed.

Counting How Many Words You Read

Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.

Recognition Using Visual Phrases

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Object-Graphs for Context-Aware Category Discovery Yong Jae Lee and Kristen Grauman University of Texas at Austin 1.

Context Neelima Chavali ECE /21/2013. Roadmap Introduction Paper1 – Motivation – Problem statement – Approach – Experiments & Results Paper 2 Experiments.

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Using the Forest to see the Trees: A computational model relating features, objects and scenes Antonio Torralba CSAIL-MIT Joint work with Aude Oliva, Kevin.

BMVC 2010 Sung Ju Hwang and Kristen Grauman University of Texas at Austin.

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Object detection with deformable part-based models

Tracking Objects with Dynamics

Saliency detection Donghun Yeo CV Lab..

Accounting for the relative importance of objects in image retrieval

HOGgles Visualizing Object Detection Features

Object-Graphs for Context-Aware Category Discovery

Outline Background Motivation Proposed Model Experimental Results

Learning Object Context for Dense Captioning

Heterogeneous convolutional neural networks for visual recognition

Presentation transcript:

Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010

Images tagged with keywords clearly tell us which objects to search for Detecting tagged objects Dog Black lab Jasper Sofa Self Living room Fedora Explore #24 Hwang & Grauman, CVPR 2010

Duygulu et al Detecting tagged objects Previous work using tagged images focuses on the noun ↔ object correspondence. Fergus et al Li et al., 2009 Berg et al [Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …] Hwang & Grauman, CVPR 2010

Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster ?? Based on tags alone, can you guess where and what size the mug will be in each image? Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Hwang & Grauman, CVPR 2010

Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Mug Key Keyboard Toothbrush Pen Photo Post-it Our Idea Presence of larger objects Mug is named first Absence of larger objects Mug is named later The list of human-provided tags gives useful cues beyond just which objects are present. Hwang & Grauman, CVPR 2010

Our Idea We propose to learn the implicit localization cues provided by tag lists to improve object detection. Hwang & Grauman, CVPR 2010

Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Training: Learn object-specific connection between localization parameters and implicit tag features. Mug Eiffel Desk Mug Office Mug Coffee Testing: Given novel image, localize objects based on both tags and appearance. P (location, scale | tags) Implicit tag features Hwang & Grauman, CVPR 2010

Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Training: Learn object-specific connection between localization parameters and implicit tag features. Mug Eiffel Desk Mug Office Mug Coffee Testing: Given novel image, localize objects based on both tags and appearance. P (location, scale | tags) Implicit tag features Hwang & Grauman, CVPR 2010

Feature: Word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Presence or absence of other objects affects the scene layout = count of i-th word., where MugPenPost-it Tooth brush KeyPhotoComputerScreenKeyboardDeskBookshelfPoster W(im1) W(im2) Hwang & Grauman, CVPR 2010

Feature: Word presence/absence Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Presence or absence of other objects affects the scene layout Large objects mentionedSmall objects mentioned = count of i-th word., where MugPenPost-it Tooth brush KeyPhotoComputerScreenKeyboardDeskBookshelfPoster W(im1) W(im2) Hwang & Grauman, CVPR 2010

Feature: Rank of tags People tag the “important” objects earlierPeople tag the “important” objects earlier  record rank of each tag compared to its typical rank. = percentile rank of i-th word., where Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster MugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Tooth brush Key R(im1) R(im2) Hwang & Grauman, CVPR 2010

Feature: Rank of tags People tag the “important” objects earlierPeople tag the “important” objects earlier  record rank of each tag compared to its typical rank. = percentile rank of i-th word., where Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Relatively high rank MugComputerScreenKeyboardDeskBookshelfPosterPhotoPenPost-it Tooth brush Key R(im1) R(im2) Hwang & Grauman, CVPR 2010

Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation = rank difference., where 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster MugScreenKeyboardDeskBookshelf Mug Screen 0000 Keyboard 100 Desk 00 Bookshelf 0 MugScreenKeyboardDeskBookshelf Mug Screen Keyboard Desk 11 Bookshelf 1 Hwang & Grauman, CVPR 2010

Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation = rank difference., where 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster MugScreenKeyboardDeskBookshelf Mug Screen 0000 Keyboard 100 Desk 00 Bookshelf 0 MugScreenKeyboardDeskBookshelf Mug Screen Keyboard Desk 11 Bookshelf 1 May be close to each other Hwang & Grauman, CVPR 2010

Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Mug Eiffel Desk Mug Office Mug Coffee P (location, scale | W,R,P) Implicit tag features Training: Testing: Hwang & Grauman, CVPR 2010

Modeling P(X|T) We model it directly using a mixture density network (MDN) [Bishop, 1994]. We need PDF for location and scale of the target object, given the tag feature: P(X = scale, x, y | T = tag feature) Input tag feature (Words, Rank, or Proximity) Mixture model Neural network αµ Σ αµ Σ αµ Σ Hwang & Grauman, CVPR 2010

Lamp Car Wheel Light Window House Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Hwang & Grauman, CVPR 2010

Lamp Car Wheel Light Window House Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Hwang & Grauman, CVPR 2010

Woman Table Mug Ladder Approach overview Mug Key Keyboard Toothbrush Pen Photo Post-it Object detector Implicit tag features Computer Poster Desk Screen Mug Poster Mug Eiffel Desk Mug Office Mug Coffee P (location, scale | W,R,P) Implicit tag features Training: Testing: Hwang & Grauman, CVPR 2010

Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) Hwang & Grauman, CVPR 2010

Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) (a) Sort all candidate windows according to P(X|T). Most likely Less likely Least likely (b) Run detector only at the most probable locations and scales. Hwang & Grauman, CVPR 2010

Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) 2)Use it to increase detection accuracy (modulate the detector output scores) Predictions from object detector Predictions based on tag features

Integrating with object detector How to exploit this learned distribution P(X|T)? 1)Use it to speed up the detection process (location priming) 2)Use it to increase detection accuracy (modulate the detector output scores)

Experiments: Datasets LabelMePASCAL VOC 2007  Street and office scenes  Contains ordered tag lists via labels added  5 classes  56 unique taggers  23 tags / image  Dalal & Trigg’s HOG detector  Flickr images  Tag lists obtained on Mechanical Turk  20 classes  758 unique taggers  5.5 tags / image  Felzenszwalb et al.’s LSVM detector Hwang & Grauman, CVPR 2010

Experiments We evaluate  Detection Speed  Detection Accuracy We compare  Raw detector (HOG, LSVM)  Raw detector + Our tag features We also show the results when using Gist [Torralba 2003] as context, for reference. Hwang & Grauman, CVPR 2010

We search fewer windows to achieve same detection rate. We know which detection hypotheses to trust most. PASCAL: Performance evaluation Naïve sliding window searches 70%. We search only 30%. Hwang & Grauman, CVPR 2010

PASCAL: Accuracy vs Gist per class Hwang & Grauman, CVPR 2010

PASCAL: Accuracy vs Gist per class Hwang & Grauman, CVPR 2010

Lamp Person Bottle Dog Sofa Painting Table Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Bottle Car License Plate Building Car LSVM+ Tags (Ours) LSVM alone PASCAL: Example detections Car Door Gear Steering Wheel Seat Person Camera Hwang & Grauman, CVPR 2010

Dog Floor Hairclip Dog Person Ground Bench Scarf Person Microphone Light Horse Person Tree House Building Ground Hurdle Fence PASCAL: Example detections Dog Person LSVM+Tags (Ours)LSVM alone Hwang & Grauman, CVPR 2010

Aeroplane Sky Building Shadow Person Pole Building Sidewalk Grass Road Dog Clothes Rope Plant Ground Shadow String Wall Bottle Glass Wine Table PASCAL: Example failure cases LSVM+Tags (Ours)LSVM alone Hwang & Grauman, CVPR 2010

Results: Observations  Often our implicit features predict: - scale well for indoor objects - position well for outdoor objects  Gist usually better for y position, while our tags are generally stronger for scale  Need to have learned about target objects in variety of examples with different contexts - visual and tag context are complementary Hwang & Grauman, CVPR 2010

Summary  We want to learn what is implied (beyond objects present) by how a human provides tags for an image.  Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.  Novel tag cues enable effective localization prior.  Significant gains with state-of-the-art detectors and two datasets. Hwang & Grauman, CVPR 2010

 Joint multi-object detection  From tags to natural language sentences  Image retrieval applications Future work Hwang & Grauman, CVPR 2010