Download presentation
Presentation is loading. Please wait.
1
Naifan Zhuang, Jun Ye, Kien A. Hua
DLSTM Approach to Video Modeling with Hashing for Large-Scale Video Retrieval Naifan Zhuang, Jun Ye, Kien A. Hua Department of Computer Science University of Central Florida ICPR 2016 Presented by Naifan Zhuang
2
Motivation and Background
According to a report from Cisco, by 2019: A million minutes of videos will be shared every second It takes an individual 5 million years to watch all videos shared each month Urgent demand for indexing and retrieving these ever-increasing videos
3
QBE with Euclidean Distance
Effective for image retrieval Ineffective for video retrieval: High-dimensional feature space Curse of high dimensionality Feature extraction Image: Feature extraction Video: …
4
Video modeling solutions
Deterministic Quantization (DQ) based on Hamming Distance [21] Divides the video into equal segments Extracts visual feature for each key frame Bag-of-word feature encoded by hashing ANN search using Hamming Distance Dynamic Temporal Quantization (DTQ) [22] Varied-length video segments Semantic content of the video
5
Video modeling solutions
To support very large video database Desirable to substantially reduce number of feature vectors used Say to only ONE segment
6
Video modeling solutions
Differential Long Short-Term Memory (DLSTM) Obtain highly compact fixed-size representation Then hash to binary bits for further compression Video retrieval performed based on Hamming Distance UCF101 and MSRActionPairs datasets
7
Differential LSTM Architecture
DLSTM [17] explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences
8
Pairwise training and Objective
Consider two videos 𝑉 𝑖 and 𝑉 𝑗 with lengths 𝑇 𝑖 and 𝑇 𝑗 Loss function is defined as: 𝐿 𝑖,𝑗 =−𝑙𝑜𝑔 1 1+exp(𝛽 𝑙 𝑖𝑗 𝐡 𝑇 𝑖 𝑖 − 𝐡 𝑇 𝑗 𝑗 2 ) Where 𝛽 is normalizing factor 𝑙 𝑖𝑗 denotes label-similarity between 𝑉 𝑖 and 𝑉 𝑗 𝑙 𝑖𝑗 = +1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑠𝑎𝑚𝑒 𝑙𝑎𝑏𝑒𝑙 −1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑑𝑖𝑓𝑓 𝑙𝑎𝑏𝑒𝑙𝑠 𝐡 𝑇 𝑖 𝑖 and 𝐡 𝑇 𝑗 𝑗 denote the hidden state of 𝑉 𝑖 and 𝑉 𝑗 at the last time-step
9
Differential LSTM Architecture
DLSTM explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences
10
Training loss vs epochs UCF101
With a larger number of hidden states, training process converges faster Achieves a lower training cost If the number of hidden states is too large, Overfitting might occur
11
Impact of No. of Hidden States -- without hashing
100 200 300 400 500 UCF101 38.52 43.13 45.40 46.63 42.33 MSRActionPairs 67.29 74.39 71.13 67.33 61.26 Video retrieval without hashing using mAP as the metric For UCF101 dataset, DLSTM achieves the highest mAP with 400 hidden states Input dimension is 300, pre-trained ILSVERC12 top layer with PCA For MSR dataset, DLSTM achieves the highest mAP with 200 hidden states Input dimension is 162
12
Comparison of modeling methods -- without hashing
DTW [10] BoW [16] DTQ [22] DLSTM UCF101 31.02 21.53 36.59 46.63 MSRActionPairs -- 62.37 74.39 Compare with three other video modeling methods Dynamic Time Warping (DTW) DTW-based motion template method [10] Suffer from misalignment of videos with varied lengths Bag-of-Words (BoW) BoW based on HogHoF feature detector [16] Local spatio-temporal features Dynamic Temporal Quantization (DTQ) Aforementioned [22] DLSTM significantly outperforms other methods in terms of mAP
13
Comparison on UCF101 dataset
DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1, respectively
14
Comparison on MSRAction Pairs
DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1 , respectively
15
Performance improvement of hashing methods with DLSTM modeling
UCF101 MSR
16
Conclusion and Future Work
Propose DLSTM for video modeling Generate highly compact fixed-length representations for videos of varied lengths DLSTM feature can further benefit existing image hashing methods Future Work Investigate an end-to-end DLSTM-based video hashing algorithm
17
Thank you! Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.