Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Slides:



Advertisements
Similar presentations
Distinctive Image Features from Scale-Invariant Keypoints David Lowe.
Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Group Meeting Presented by Wyman 10/14/2006
Aggregating local image descriptors into compact codes
Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.
Three things everyone should know to improve object retrieval
Word Spotting DTW.
Medical Image Registration Kumar Rajamani. Registration Spatial transform that maps points from one image to corresponding points in another image.
Presented by Xinyu Chang
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
Herv´ eJ´ egouMatthijsDouzeCordeliaSchmid INRIA INRIA INRIA
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Empowering visual categorization with the GPU Present by 陳群元 我是強壯 !
Learning Convolutional Feature Hierarchies for Visual Recognition
A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
Object-based Image Representation Dr. B.S. Manjunath Sitaram Bhagavathy Shawn Newsam Baris Sumengen Vision Research Lab University of California, Santa.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Video Fingerprinting: Features for Duplicate and Similar Video Detection and Query- based Video Retrieval Anindya Sarkar, Pratim Ghosh, Emily Moxley and.
CS292 Computational Vision and Language Visual Features - Colour and Texture.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Linear Algebra and Image Processing
Computer vision.
Summarized by Soo-Jin Kim
Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,
Object Bank Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 4 th, 2013.
Presented by Tienwei Tsai July, 2005
An Integration Framework for Sensor Networks and Data Stream Management Systems.
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.
COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL Presented by 2006/8.
Axial Flip Invariance and Fast Exhaustive Searching with Wavelets Matthew Bolitho.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
SINGULAR VALUE DECOMPOSITION (SVD)
A survey of different shape analysis techniques 1 A Survey of Different Shape Analysis Techniques -- Huang Nan.
Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.
Wavelets and Multiresolution Processing (Wavelet Transforms)
Local invariant features Cordelia Schmid INRIA, Grenoble.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Fundamentals of Multimedia Chapter 6 Basics of Digital Audio Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Naifan Zhuang, Jun Ye, Kien A. Hua
Compressive Coded Aperture Video Reconstruction
Estimation Techniques for High Resolution and Multi-Dimensional Array Signal Processing EMS Group – Fh IIS and TU IL Electronic Measurements and Signal.
Wavelets : Introduction and Examples
Nonparametric Semantic Segmentation
Image Segmentation Techniques
Introduction To Wavelets
Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611
Presentation transcript:

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral

Outline 1. Introduction 2. EVVE: an event retrieval dataset 3. Frame description 4. Circulant temporal aggregation 5. Indexing strategy and complexity 6. Experiments 7. Conclusion

1. Introduction * This paper introduces an approach for specific event retrieval video copy detection[13] event category recognition [16] * Searching for specific events related find deformed videos * Measures the similarity between two sequences for all possible alignments Frame descriptors are jointly encoded in the frequency domain Frequency pruning Regularization Quantization

1. Introduction * Contribution 2.exploit the properties of circulant matrices to compare videos in the frequency domain 1.encode the frame descriptors into a temporal representation 3.Dataset named EVVE for specific event retrieval in large user-generated video content

2. EVVE: an event retrieval dataset * EVVE also includes events for which relevant videos might not correspond to the same instance in place or time. * Several of them are localized precisely in time and space * The human annotators have marked the videos as either positive or negative. Ambiguous videos were removed. * we have also retrieved a set of 100,000 distractor

2. EVVE: an event retrieval dataset

3. Frame description * Pre-processing Sampling them at a fixed rate of 15 fps and resizing them to a maximum of 120k pixels * Local description. dense grid SIFT extract We square root the SIFT components and reduce the descriptor to 32 dimensions with PCA * Descriptor aggregation The SIFT descriptors of a frame are encoded using MultiVLAD (VLAD: Vector of Locally Aggregated Descriptors) Two VLAD descriptors are obtained from two different codebooks of size 128, and concatenated. Power law normalization is applied to the vector and it is reduced by PCA to dimension d

4. Circulant temporal aggregation q t =0,if t m b t =0,if t n Assumption 1: There is no (or limited) temporal acceleration. Assumption 2: The inner product is a good similarity between individual frames Assumption 3: The sum of similarities between the frame descriptors reflects the similarity of the sequences (In practice, this assumption is not well satisfied, because the videos are very self-similar in time)

4. Circulant temporal aggregation Circulant encoding of vector sequences * column notation convolution operator: * Assuming sequences of equal lengths (n = m) element-wise multiplication of 2 vectors:

4. Circulant temporal aggregation Regularized comparison metric (which is more efficient to compute) * we consider hereafter that the sequences have been preprocessed to have the same length m = n = 2 k : because the temporal consistency and more generally the self-similarity of frames in videos,the values of the score vector s(q,b) are noisy and its peak over is not precisely localized

4. Circulant temporal aggregation additional filter: * we compute W i assuming that the contributions are shared equally across dimensions (in the Fourier domain)

4. Circulant temporal aggregation * One major drawback is that the denominator in Eqn. 8 may be close to zero * Solve this issue regularization coefficient :

4. Circulant temporal aggregation Boundary detection * Time shift: to find optimal alignment In some applications such as video alignment (see Section 6), we also need the boundaries of the matching segments The matching sequence is defined as a set of contiguous t for which the scores S t are high enough. database descriptors are reconstructed in the temporal domain from

5. Indexing strategy and complexity Frequency domain representation * The goal is to implement the method presented in Section 4 in an approximate manner Video : bLength: n complex matrix * Frequency pruning reduce the video representation by keeping only a fraction of the low frequency vectors

5. Indexing strategy and complexity Complex PQ-codes and metric optimization * Descriptor sizes we pre-compute a Fourier descriptor for different zero-padded versions of the query by noticing that the Fourier descriptor of the concatenation of a signal with itself is Use product quantization technique [9], which is a compression technique that enables efficient compressed-domain comparison and search.

5. Indexing strategy and complexity database vectoris split into p sub-vectors(j=1…p) The sub-vectors are separately quantized using k-means quantizers This produces a vector of indexes * The comparison between a query descriptor x and the database vectors is performed in two stages First, the squared distances between each sub-vector x j and all the possible centroids are computed and stored in a table

5. Indexing strategy and complexity Summary of search procedure and complexity Second, the squared distance between x and y is approximated as which only requires p look-ups and additions. 1. The video is pre-processed and each frame is described as a d-dimensional Multi-VLAD descriptor. 2. This vector is padded with zeros to the next power of two, and mapped to the Fourier domain using d independent 1-dimensional FFTs.

5. Indexing strategy and complexity 3. High frequencies are pruned: Only frequency vectors are kept. After this step, the video is represented by dimensional complex vectors. 4. These vectors are separately encoded with a complex product quantizer, producing a compressed representation of bytes for the whole video. * complexity

5. Indexing strategy and complexity

6. Experiments

For TRECVID the measure is NDCR (lower = better)

6. Experiments

7. Conclusion This video representation provides an efficient search scheme Extensive experiments on two video copy detection benchmarks show that our approach improves over the state of the art with respect to accuracy, search time and memory usage.