Towards Efficient Learning of Optimal Spatial Bag-of-Words Representations Lu Jiang 1, Wei Tong 1, Deyu Meng 2, Alexander G. Hauptmann 1 1 School of Computer.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

TRECVID 2011 Surveillance Event Detection Speaker: Lu Jiang Longfei Zhang, Lu Jiang, Lei Bao, Shohei Takahashi, Yuanpeng Li, Alexander.
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
A Comparison of Rule-Based versus Exemplar-Based Categorization Using the ACT-R Architecture Matthew F. RUTLEDGE-TAYLOR, Christian LEBIERE, Robert THOMSON,
Aggregating local image descriptors into compact codes
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Three things everyone should know to improve object retrieval
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Algorithms + L. Grewe.
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
Patch to the Future: Unsupervised Visual Prediction
Mixture of trees model: Face Detection, Pose Estimation and Landmark Localization Presenter: Zhang Li.
Robust Object Tracking via Sparsity-based Collaborative Model
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Discriminative and generative methods for bags of features
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Yajie Miao Florian Metze
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Large Scale Recognition and Retrieval. What does the world look like? High level image statistics Object Recognition for large-scale search Focus on scaling.
Global and Efficient Self-Similarity for Object Classification and Detection CVPR 2010 Thomas Deselaers and Vittorio Ferrari.
Salient Object Detection by Composition
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,
Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,
EADS DS / SDC LTIS Page 1 7 th CNES/DLR Workshop on Information Extraction and Scene Understanding for Meter Resolution Image – 29/03/07 - Oberpfaffenhofen.
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
Object Bank Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 4 th, 2013.
Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search Lu Jiang 1, Deyu Meng 2, Teruko Mitamura 1, Alexander G. Hauptmann 1 1 School.
Semantic Information Fusion Shashi Phoha, PI Head, Information Science and Technology Division Applied Research Laboratory The Pennsylvania State.
Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
Object Detection with Discriminatively Trained Part Based Models
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Experiments Test different parking lot images captured in different luminance conditions The test samples include 1300 available parking spaces and 1500.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
In Defense of Nearest-Neighbor Based Image Classification Oren Boiman The Weizmann Institute of Science Rehovot, ISRAEL Eli Shechtman Adobe Systems Inc.
A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION Yan Song, Sheng Tang, Yan-Tao Zheng, Tat-Seng Chua, Yongdong Zhang, Shouxun Lin.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Hierarchical Matching with Side Information for Image Classification
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
Image Classification for Automatic Annotation
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Compact Bilinear Pooling
Data Driven Attributes for Action Detection
Learning Mid-Level Features For Recognition
ICCV Hierarchical Part Matching for Fine-Grained Image Classification
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Distributed Learning of Multilingual DNN Feature Extractors using GPUs
CVPR 2014 Orientational Pyramid Matching for Recognizing Indoor Scenes
Objects as Attributes for Scene Classification
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Presentation transcript:

Towards Efficient Learning of Optimal Spatial Bag-of-Words Representations Lu Jiang 1, Wei Tong 1, Deyu Meng 2, Alexander G. Hauptmann 1 1 School of Computer Science, Carnegie Mellon University 2 School of Mathematics and Statistics, Xi'an Jiaotong University

People CMU Informedia Team Wei TongDeyu Meng Alexander G. Hauptmann

Outline  Motivation  Related Work  Jensen - Shannon Tiling  Experiment Results  Conclusions

Outline  Motivation  Related Work  Jensen - Shannon Tiling  Experiment Results  Conclusions

Spatial Bag-of-Words The Spatial Bag-of-Words (BoW) model has proven one of the most broadly used models in image and video retrieval. It divides an image/video into one or more smaller tiles. The image represented by the concatenated BoW histograms from all the tiles.

Spatial Pyramid Matching (SPM) Spatial Pyramid Matching is a robust extension to spatial BoW Model. Combine a set of predefined partitions (1x1, 2x2, 4x4, etc.) But, are predefined representations in SPM sufficient for multimedia retrieval?

Spatial Pyramid Matching (SPM) Spatial Pyramid Matching is a robust extension to spatial BoW Model. Combine a set of predefined partitions (1x1, 2x2, 4x4, etc.) But, are predefined tilings in SPM sufficient for multimedia retrieval?

Spatial Pyramid Matching (SPM) Spatial Pyramid Matching is a robust extension to spatial BoW Model. Combine a set of predefined partitions (1x1, 2x2, 4x4, etc.) But, are predefined representations SPM sufficient for multimedia retrieval?

IBM’s TRECVID 12 Semantic Indexing

SRI Sarnoff’s 12 Multimedia Event Detection

CMU’s TRECVID 11 Surveillance Event Detection

Motivation Spatial Representation is fundamental to multimedia retrieval. – Semantic objects/concepts indexing. – Multimedia event retrieval. – Surveillance event detection, etc. Different spatial representations can affects results considerably.

Semi-Manual Approach A straightforward way to find optimal representations [1,2]: – Manually design representation candidates. – Verify the candidates by running the classifier. Cons: – Require manual effort. – Computationally infeasible to verify all the candidates. [1] W. Tong, Y. Yang, L. Jiang, S. I. Yu, Z. Lan, Z. Ma, W. Sze, E. Younessian, and A. G. Hauptmann. E-LAMP: integration of innovative ideas for multimedia event detection. Machine Vision and Applications, pages 1–11, [2] V. Viitaniemi and J. Laaksonen. Spatial extensions to bag of visual words. In ACM CIVR, 2009.

Motivation Manually designing representations is never an easy thing. Our goal: – Automatically learn salient spatial representations from data. – Efficient enough to run on large-scale data.

Outline  Motivation  Related Work  Jensen - Shannon Tiling  Experiment Results  Conclusions

Comparison with Related Work Existing studies learn the representations with the classifiers [3,4,5]. Reasonable Improvements. Time consuming. Low cost-effective. 2,000 core hours for 2% MAP (worth doing?) [3] J. Feng, B. Ni, Q. Tian, and S. Yan. Geometric lp-norm feature pooling for image classification. In CVPR, [4] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive field learning for pooled image features. In CVPR, [5] G. Sharma and F. Jurie. Learning discriminative spatial representation for image classification. In BMVC, 2011.

Comparison with Related Work Existing studies learn the representations with the classifiers [3,4,5]. JS(Jensen-Shannon)- Tiling directly captures representations at lower BoW level, independent of the classifier. Reasonable Improvements. Time consuming. Low cost-effective. 2,000 core hours for 2% MAP (worth doing?) Decent improvements. Orders of magnitude faster. High cost-effective. BoW Distribution

Comparison with Related Work Existing Work learn the representations with the classifiers [3,4,5]. JS Tiling directly captures them at lower BoW level, independent of the classifier. Embedded method in feature selection. Filter method in feature selection. Efficiency. Generalizability. BoW Distribution

Proposed Approach JS(Jensen-Shannon)-Tiling offers a solution because it is: – Learn salient representations automatically from data. – Applicably to large-scale datsets. It is an important component in CMU Teams' final submission in TRECVID 2012 Multimedia Event Detection[1].

Outline  Motivation  Related Work  Jensen - Shannon Tiling  Experiment Results  Conclusions

Problem Formulation A mask is a predefined partition. More representations can be derived by combining the tiles in the mask. Each representation is called a tiling.

Problem Formulation A mask is a predefined partition. More representations can be derived by combining the tiles in the mask. Each representation is called a tiling. 1,427 More

Problem Formulation Problem: Find optimal tilings for a given mask. Proposed approach: – Systematically generate all possible tilings from the given mask. – Efficiently evaluate each tiling without running classifiers.

Problem Formulation Problem: Find optimal tilings for a given mask. Proposed approach: – Systematically generate all possible tilings from the given mask. – Efficiently evaluate each tiling without running classifiers.

Tiling Definition Tiling can be defined based on the set-partition theory. Divide a set as a union of non-overlapping and non-empty subsets.

Tiling Definition Tiling can be defined based on the set-partition theory. Divide a set as a union of non-overlapping and non-empty subsets. A tiling can be defined as: – A complete partition of mask into non-overlapping area. – Each partition (tile) is visually adjacent[3].

Tiling Definition Tiling can be defined based on the set-partition theory. Divide a set as a union of non-overlapping and non-empty subsets. A tiling can be defined as: – A complete partition of mask into non-overlapping area. – Each partition (tile) is visually adjacent[3]. identical to the connected components in the graph.

Tiling Generation NP-hard problem. But given reasonable masks, it is solvable. Algorithm (Loop until termination): 1)Generate a set partition candidate; 2)Test whether this candidate obeys the adjacency constraint; Visual adjacency constraint significantly reduces the number of candidates.

Tiling Generation NP-hard problem. But given reasonable masks, it is solvable. Algorithm (Loop until termination): 1)Generate a set partition candidate; 2)Test whether this candidate obeys the adjacency constraint; Visual adjacency constraint significantly reduces the number of candidates.

Problem Formulation Problem: Find optimal tilings for a given mask. Proposed approach: – Systematically generate all possible tilings from the given mask. – Efficiently evaluate each tiling without running classifiers.

Tiling Evaluation Intuitively an optimal tiling would separate the positive and negative samples with the maximum distance. The distance is evaluated w.r.t Kullback-Leibler (KL) divergence. Symmetric version called Jensen-Shannon (JS) divergence. average word distributions of positive and negative samples generated by the tiling. is the tiling to evaluate.

Tiling Evaluation Consistent with the distribution separability principle in [6]. We prove that the negative JS divergence is approximately an upper bound of the training error of a weighted K- Nearest Neighbor classifier K = N. Justify why the computationally inexpensive divergence can be a proxy to the computationally expensive classifier. [6]Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In ICML, 2010.

Tiling Evaluation Consistent with the distribution separability principle in [6]. We prove that the negative JS divergence is approximately an upper bound of the training error of a weighted K- Nearest Neighbor classifier K = N. Justify why the computationally inexpensive divergence can be a proxy to the computationally expensive classifier.

Outline  Motivation  Related Work  Jensen - Shannon Tiling  Experiment Results  Conclusions

Comparison with state-of-the-art Consistently outperforms the SPM across datasets on scene/object recognition and event detection. Comparable or even better results with existing methods.

Reasons for the Improvement 1) Capture more salient spatial representations than SPM. The results are on 15 scene category dataset. Proposed Method Predefined tilings in SPM

Reasons for the Improvement 1) Capture more salient spatial representations than SPM. 2) Substantially augment the choices of representations. The results are on 15 scene category dataset.

Learned Tiling on SED dataset Heat maps are plotted based on manual annotations. Tilings are learned without using annotations. Learned tilings are more sensible than predefined tilings.

Runtime Comparison Compare the runtime with tiling selection by running classifiers. Search a space of 1,434 tilings. A single core Intel Core i7 with 4G memory. Orders of magnitude faster than running classifiers. Substantiate the theoretical complexity analysis.

Outline  Motivation  Related Work  Jensen - Shannon Tiling  Experiment Results  Conclusions

Summary A few messages to take away from this talk: – JS Tiling provides a efficient solution to automatically learn salient BoW representations for large-scale datasets. – JS Tiling consistently outperforms the spatial pyramid matching across datasets. Comparable or even better performance with existing methods. J

Beyond BoW representation Tokyo TechCanon’s 2012 AXES’s

Beyond spatial representation Temporal tiling – Determine optimal sliding window sizes. Time

Aspects to be Improved The tilings learned from different masks are not directly comparable. A practical trick: – Start with a number of masks. – Use JS-Tiling to find a couple of salient tilings from the huge search space. – Run classifiers on these tilings on the validation dataset, and fuse promising ones to obtain better performance. Sampling bias for small tiles (overestimate the distance). – Equal tiling can avoid this bias. – Study the smoothing function.

Acknowledgement This work was partially supported by Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.