Naifan Zhuang, Jun Ye, Kien A. Hua

Slides:

Advertisements

Similar presentations

Image Retrieval with Geometry-Preserving Visual Phrases

Advertisements

Aggregating local image descriptors into compact codes

Presented by Xinyu Chang

Kien A. Hua Division of Computer Science University of Central Florida.

Patch to the Future: Unsupervised Visual Prediction

Contactless and Pose Invariant Biometric Identification Using Hand Surface Vivek Kanhangad, Ajay Kumar, Senior Member, IEEE, and David Zhang, Fellow, IEEE.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Presented by Relja Arandjelović Iterative Quantization: A Procrustean Approach to Learning Binary Codes University of Oxford 21 st September 2011 Yunchao.

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Video Fingerprinting: Features for Duplicate and Similar Video Detection and Query- based Video Retrieval Anindya Sarkar, Pratim Ghosh, Emily Moxley and.

Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)

Video Trails: Representing and Visualizing Structure in Video Sequences Vikrant Kobla David Doermann Christos Faloutsos.

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

What’s Making That Sound ?

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Community Architectures for Network Information Systems

Line detection Assume there is a binary image, we use F(ά,X)=0 as the parametric equation of a curve with a vector of parameters ά=[α 1, …, α m ] and X=[x.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Exact indexing of Dynamic Time Warping

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Content-Based MP3 Information Retrieval Chueh-Chih Liu Department of Accounting Information Systems Chihlee Institute of Technology 2005/06/16.

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Cross-modal Hashing Through Ranking Subspace Learning

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Scalable Person Re-identification on Supervised Smoothed Manifold

Unsupervised Learning of Video Representations using LSTMs

Deeply learned face representations are sparse, selective, and robust

Spatio-temporal Segmentation of Video by Hierarchical Mean Shift Analysis Daniel DeMenthon SMVP 2002.

Bag-of-Visual-Words Based Feature Extraction

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Saliency-guided Video Classification via Adaptively weighted learning

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams Jun Ye Kai Li Guo-Jun Qi Kien.

Supervised Time Series Pattern Discovery through Local Importance

Research in Computational Molecular Biology , Vol (2008)

Neural networks (3) Regularization Autoencoder

Video Summarization by Spatial-Temporal Graph Optimization

Self-Organizing Maps for Content-Based Image Database Retrieval

Unsupervised Learning and Autoencoders

School of Computer Science & Engineering

Improving Retrieval Performance of Zernike Moment Descriptor on Affined Shapes Dengsheng Zhang, Guojun Lu Gippsland School of Comp. & Info Tech Monash.

Ying Dai Faculty of software and information science,

Ying Dai Faculty of software and information science,

Ying Dai Faculty of software and information science,

Handwritten Characters Recognition Based on an HMM Model

Ying Dai Faculty of software and information science,

Unsupervised Pretraining for Semantic Parsing

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Natural Language to SQL(nl2sql)

Neural networks (3) Regularization Autoencoder

Ying Dai Faculty of software and information science,

Topological Signatures For Fast Mobility Analysis

NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Learning and Memorization

Presented By: Harshul Gupta

Week 7 Presentation Ngoc Ta Aidean Sharghi

Clustering Algorithms for Perceptual Image Hashing

Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.

Li Li, Zhu Li, Vladyslav Zakharchenko, Jianle Chen, Houqiang Li

Presentation transcript:

Naifan Zhuang, Jun Ye, Kien A. Hua DLSTM Approach to Video Modeling with Hashing for Large-Scale Video Retrieval Naifan Zhuang, Jun Ye, Kien A. Hua Department of Computer Science University of Central Florida ICPR 2016 Presented by Naifan Zhuang

Motivation and Background According to a report from Cisco, by 2019: A million minutes of videos will be shared every second It takes an individual 5 million years to watch all videos shared each month Urgent demand for indexing and retrieving these ever-increasing videos

QBE with Euclidean Distance Effective for image retrieval Ineffective for video retrieval: High-dimensional feature space Curse of high dimensionality Feature extraction Image: Feature extraction Video: …

Video modeling solutions Deterministic Quantization (DQ) based on Hamming Distance [21] Divides the video into equal segments Extracts visual feature for each key frame Bag-of-word feature encoded by hashing ANN search using Hamming Distance Dynamic Temporal Quantization (DTQ) [22] Varied-length video segments Semantic content of the video

Video modeling solutions To support very large video database Desirable to substantially reduce number of feature vectors used Say to only ONE segment

Video modeling solutions Differential Long Short-Term Memory (DLSTM) Obtain highly compact fixed-size representation Then hash to binary bits for further compression Video retrieval performed based on Hamming Distance UCF101 and MSRActionPairs datasets

Differential LSTM Architecture DLSTM [17] explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences

Pairwise training and Objective Consider two videos 𝑉 𝑖 and 𝑉 𝑗 with lengths 𝑇 𝑖 and 𝑇 𝑗 Loss function is defined as: 𝐿 𝑖,𝑗 =−𝑙𝑜𝑔 1 1+exp⁡(𝛽 𝑙 𝑖𝑗 𝐡 𝑇 𝑖 𝑖 − 𝐡 𝑇 𝑗 𝑗 2 ) Where 𝛽 is normalizing factor 𝑙 𝑖𝑗 denotes label-similarity between 𝑉 𝑖 and 𝑉 𝑗 𝑙 𝑖𝑗 = +1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑠𝑎𝑚𝑒 𝑙𝑎𝑏𝑒𝑙 −1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑑𝑖𝑓𝑓 𝑙𝑎𝑏𝑒𝑙𝑠 𝐡 𝑇 𝑖 𝑖 and 𝐡 𝑇 𝑗 𝑗 denote the hidden state of 𝑉 𝑖 and 𝑉 𝑗 at the last time-step

Differential LSTM Architecture DLSTM explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences

Training loss vs epochs UCF101 With a larger number of hidden states, training process converges faster Achieves a lower training cost If the number of hidden states is too large, Overfitting might occur

Impact of No. of Hidden States -- without hashing 100 200 300 400 500 UCF101 38.52 43.13 45.40 46.63 42.33 MSRActionPairs 67.29 74.39 71.13 67.33 61.26 Video retrieval without hashing using mAP as the metric For UCF101 dataset, DLSTM achieves the highest mAP with 400 hidden states Input dimension is 300, pre-trained ILSVERC12 top layer with PCA For MSR dataset, DLSTM achieves the highest mAP with 200 hidden states Input dimension is 162

Comparison of modeling methods -- without hashing DTW [10] BoW [16] DTQ [22] DLSTM UCF101 31.02 21.53 36.59 46.63 MSRActionPairs -- 62.37 74.39 Compare with three other video modeling methods Dynamic Time Warping (DTW) DTW-based motion template method [10] Suffer from misalignment of videos with varied lengths Bag-of-Words (BoW) BoW based on HogHoF feature detector [16] Local spatio-temporal features Dynamic Temporal Quantization (DTQ) Aforementioned [22] DLSTM significantly outperforms other methods in terms of mAP

Comparison on UCF101 dataset DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1, respectively

Comparison on MSRAction Pairs DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1 , respectively

Performance improvement of hashing methods with DLSTM modeling UCF101 MSR

Conclusion and Future Work Propose DLSTM for video modeling Generate highly compact fixed-length representations for videos of varied lengths DLSTM feature can further benefit existing image hashing methods Future Work Investigate an end-to-end DLSTM-based video hashing algorithm

Thank you! Q & A