Finding Things: Image Parsing with Regions and Per-Exemplar Detectors

Slides:

Advertisements

Similar presentations

CVPR2013 Poster Modeling Actions through State Changes.

Advertisements

Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Three things everyone should know to improve object retrieval

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -

LARGE-SCALE IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill road building car sky.

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

Lecture 31: Modern object recognition

INTRODUCTION Heesoo Myeong, Ju Yong Chang, and Kyoung Mu Lee Department of EECS, ASRI, Seoul National University, Seoul, Korea Learning.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.

Face detection Many slides adapted from P. Viola.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA

Pedestrian Detection in Crowded Scenes Dhruv Batra ECE CMU.

SURF: Speeded-Up Robust Features

Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.

Learning to Detect A Salient Object Reporter: 鄭綱 (3/2)

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

LARGE-SCALE NONPARAMETRIC IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill CVPR 2011Workshop on Large-Scale.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

What, Where & How Many? Combining Object Detectors and CRFs

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Recognition using Regions (Demo) Sudheendra V. Outline Generating multiple segmentations –Normalized cuts [Ren & Malik (2003)] Uniform regions –Watershed.

“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.

Object Detection with Discriminatively Trained Part Based Models

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

BING: Binarized Normed Gradients for Objectness Estimation at 300fps

Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.

INTRODUCTION Heesoo Myeong and Kyoung Mu Lee Department of ECE, ASRI, Seoul National University, Seoul, Korea Tensor-based High-order.

A Face processing system Based on Committee Machine: The Approach and Experimental Results Presented by: Harvest Jang 29 Jan 2003.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Methods for classification and image representation

A New Method for Automatic Clothing Tagging Utilizing Image-Click-Ads Introduction Conclusion Can We Do Better to Reduce Workload?

Recognition Using Visual Phrases

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.

Learning Hierarchical Features for Scene Labeling

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

CS 548 Spring 2016 Model and Regression Trees Showcase by Yanran Ma, Thanaporn Patikorn, Boya Zhou Showcasing work by Gabriele Fanelli, Juergen Gall, and.

Learning Hierarchical Features for Scene Labeling Cle’ment Farabet, Camille Couprie, Laurent Najman, and Yann LeCun by Dong Nie.

Image segmentation.

Scene Parsing with Object Instances and Occlusion Ordering JOSEPH TIGHE, MARC NIETHAMMER, SVETLANA LAZEBNIK 2014 IEEE CONFERENCE ON COMPUTER VISION AND.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Data Driven Attributes for Action Detection

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Supervised Time Series Pattern Discovery through Local Importance

Nonparametric Semantic Segmentation

Video Google: Text Retrieval Approach to Object Matching in Videos

Paper Presentation: Shape and Matching

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Context-Aware Modeling and Recognition of Activities in Video

Brief Review of Recognition + Context

Data Driven Attributes for Action Detection

Video Google: Text Retrieval Approach to Object Matching in Videos

Human-object interaction

“Traditional” image segmentation

Presentation transcript:

Finding Things: Image Parsing with Regions and Per-Exemplar Detectors Joseph Tighe, Svetlana Lazebnik 2013 IEEE Conference on Computer Vision and Pattern Recognition

Outline Introduction Region-based parsing Detector-based parsing SVM combination and MRF smoothing Experiments Conclusion

Introduction The problem of image parsing, or labeling each pixel in an image with its semantic category. Our goal is achieving broad coverage – the ability to recognize hundreds or thousands of object classes that commonly occur in everyday street scenes and indoor environments.

Introduction A major challenge in doing this is posed by the non-uniform statistics of these classes in realistic scene images. Two main categories : Stuff : A small number of classes – mainly ones associated with large regions or “stuff,” such as road, sky, trees, buildings, etc. Things : people, cars, dogs, mailboxes, vases, stop signs – occupy a small percentage of image pixels and have relatively few instances each.

Introduction

Introduction Problems with sliding window detectors: They return only bounding box hypotheses, and obtaining segmentation hypotheses from them is challenging. They do not work well for classes with few training examples and large intra-class variation.

Introduction──parsing pipeline Obtain a retrieval set of globally similar training images Region based data term (ER) is computed using Superparsing system Detector based data term (ED) : Run per-exemplar detectors for exemplars in the retrieval set Transfer masks from all detections above a set detection threshold to test image Detector data term is computed as the sum of these masks scaled by their detection score Combine these two data terms by training a SVM on the concatenation of ED and ER Smooth the SVM output (ESVM) using a MRF

Outline Introduction Region-based parsing Detector-based parsing SVM combination and MRF smoothing Experiments Conclusion

Region-based parsing [27] J. Tighe and S. Lazebnik. SuperParsing: Scalable nonparametric image parsing with superpixels. IJCV, 101(2):329–349, Jan 2013. 提取global feature计算特征相似度

Region-based parsing Find a retrieval set of images similar to the query image. Segment the query image into superpixels and compute feature vectors for each superpixel. For each superpixel and each feature type, find the nearest- neighbor superpixels in the retrieval set. Compute a likelihood score for each class based on the superpixel matches Use the computed likelihoods together with pairwise co- occurrence energies in an Markov Random Field (MRF) framework to compute a global labeling of the image.

Region-based parsing Matches are used to produce a log-likelihood ratio score for label c at region si. Use this score to define our region-based data term ER for each pixel p and class c:

Outline Introduction Region-based parsing Detector-based parsing SVM combination and MRF smoothing Experiments Conclusion

Detector-based parsing [19] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplarSVMs for object detection and beyond. In ICCV, 2011.

Detector-based parsing We train a per-exemplar detector for each labeled object instance in our dataset. While it may seem intuitive to only train detectors for “thing” categories, we train them for all categories, including ones seemingly inappropriate for a sliding window approach, such as “sky.” At test time, given an image that needs to be parsed, we first obtain a retrieval set of globally similar training images. Then we run the detectors associated with the first k instances of each class in that retrieval set. (the instances are ranked in decreasing order of the similarity of their image to the test image, and different instances in the same image are ranked arbitrarily).

Detector-based parsing

Detector-based parsing Detector-based data term ED for a class c and pixel p. Simply take the sum of all detection masks from that class weighted by their detection scores:

Outline Introduction Region-based parsing Detector-based parsing SVM combination and MRF smoothing Experiments Conclusion

SVM Combination and MRF Smoothing Test image, for each pixel p and each class c Two data terms: ER( p, c) and ED( p, c) Training data for each SVM is generated by running region- and detector-based parsing on the entire training set. smooth the labels with an MRF energy function

SVM Combination and MRF Smoothing The resulting amount of data is huge, and to make SVM training feasible, we must subsample the data. We could subsample the data uniformly. This preserves the relative class frequencies, but in practice it heavily biases the classifier towards the more common classes. Conversely, subsampling the data so that each class has a roughly equal number of points produces a bias towards the rare classes. We have found that combining these two schemes in a 2:1 ratio achieves a good trade-off on all our datasets. : our largest dataset has over nine billion pixels, which would require 30 terabytes of storage. we must subsample the data – a tricky task given the unbalanced class frequencies in our many-category datasets. subsample the data uniformly (i.e., reduce the number of points by a fixed factor without regard to class labels). That is, we obtain 67% of the training data by uniform sampling and 33% by even per-class sampling.

SVM Combination and MRF Smoothing Let ESVM (pi , ci) denote the response of the SVM for class ci at pixel pi. We smooth the labels with an MRF energy function defined over the field of pixel labels c: where I is the set of all pixels in the image, 𝜀 is the set of adjacent pixels, M is the highest expected value of the SVM response, λ is a smoothing constant , and Esmooth(ci , cj ) imposes a penalty when two adjacent pixels (pi , pj ) are similar but are assigned different labels (ci , cj ) . To obtain the final labeling, we can simply take the highest-scoring label at each pixel, but this produces noisy results.

Outline Introduction Region-based parsing Detector-based parsing SVM combination and MRF smoothing Experiments Conclusion

Experiments Three challenging datasets: SIFT Flow LM+SUN CamVid 2,488 training images, 200 test images, and 33 labels, retrieval set size of 200. LM+SUN 45,676 images (21,182 indoor and 24,494 outdoor) and 232 labels. 45,176 training and 500 test images. Since this is a much larger dataset, retrieval set size to 800. CamVid consists of video sequences taken from a moving vehicle and densely labeled at 1Hz with 11 class labels. It is much smaller – a total of 701 labeled frames split into 468 training and 233 testing

Experiments On all datasets, we report the overall per-pixel, which is dominated by the most common classes, as well as the average of per-class rates, which is dominated by the rarer classes.

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Experiments

Running time We examine the computational requirements of on our largest dataset, LM+SUN, by timing MATLAB implementation on a six-core 3.4 GHz Intel Xeon processor with 48 GB of RAM. Average training time per detector is 472.3 seconds; training all of them would require 1,938 days on a single CPU. Leave-one-out parsing of the training set takes 939 hours on a single CPU. Next, training a set of 232 SVMs takes one hour on a single machine and ten hours for the approximate RBF. At test time, the region-based parsing takes an average of 27.5 seconds per image. The detector-based parser runs an average of 4,842 detectors per image in 47.4 seconds total. The final combination step takes an average of 8.9 seconds for the linear kernel and 124 seconds for the approximate RBF. MRF inference takes only 6.8 seconds per image. , but we do it on a 512-node cluster in four days , or about two hours on the cluster

Outline Introduction Region-based parsing Detector-based parsing SVM combination and MRF smoothing Experiments Conclusion

Conclusion We propose an image parsing system that integrates region- based cues with the promising novel framework of per- exemplar detectors. Our current system achieves very promising results, but at a considerable computational cost. Reducing this cost is an important future research direction.

Reference The slides shared by author Joseph Tighe, University of North Carolina at Chapel Hill, at CVPR 2013 Oral Presentation [Finding Things: Image Parsing with Regions and Per-Exemplar Detectors]

Thanks for your listening!