Naifan Zhuang, Jun Ye, Kien A. Hua

Naifan Zhuang, Jun Ye, Kien A. Hua
DLSTM Approach to Video Modeling with Hashing for Large-Scale Video Retrieval Naifan Zhuang, Jun Ye, Kien A. Hua Department of Computer Science University of Central Florida ICPR 2016 Presented by Naifan Zhuang

Motivation and Background
According to a report from Cisco, by 2019: A million minutes of videos will be shared every second It takes an individual 5 million years to watch all videos shared each month Urgent demand for indexing and retrieving these ever-increasing videos

QBE with Euclidean Distance
Effective for image retrieval Ineffective for video retrieval: High-dimensional feature space Curse of high dimensionality Feature extraction Image: Feature extraction Video: …

Video modeling solutions
Deterministic Quantization (DQ) based on Hamming Distance [21] Divides the video into equal segments Extracts visual feature for each key frame Bag-of-word feature encoded by hashing ANN search using Hamming Distance Dynamic Temporal Quantization (DTQ) [22] Varied-length video segments Semantic content of the video

To support very large video database Desirable to substantially reduce number of feature vectors used Say to only ONE segment

Differential Long Short-Term Memory (DLSTM) Obtain highly compact fixed-size representation Then hash to binary bits for further compression Video retrieval performed based on Hamming Distance UCF101 and MSRActionPairs datasets

Differential LSTM Architecture
DLSTM [17] explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences

Pairwise training and Objective
Consider two videos 𝑉 𝑖 and 𝑉 𝑗 with lengths 𝑇 𝑖 and 𝑇 𝑗 Loss function is defined as: 𝐿 𝑖,𝑗 =−𝑙𝑜𝑔 1 1+exp⁡(𝛽 𝑙 𝑖𝑗 𝐡 𝑇 𝑖 𝑖 − 𝐡 𝑇 𝑗 𝑗 2 ) Where 𝛽 is normalizing factor 𝑙 𝑖𝑗 denotes label-similarity between 𝑉 𝑖 and 𝑉 𝑗 𝑙 𝑖𝑗 = +1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑠𝑎𝑚𝑒 𝑙𝑎𝑏𝑒𝑙 −1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑑𝑖𝑓𝑓 𝑙𝑎𝑏𝑒𝑙𝑠 𝐡 𝑇 𝑖 𝑖 and 𝐡 𝑇 𝑗 𝑗 denote the hidden state of 𝑉 𝑖 and 𝑉 𝑗 at the last time-step

Differential LSTM Architecture
DLSTM explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences

Training loss vs epochs UCF101
With a larger number of hidden states, training process converges faster Achieves a lower training cost If the number of hidden states is too large, Overfitting might occur

Impact of No. of Hidden States -- without hashing
100 200 300 400 500 UCF101 38.52 43.13 45.40 46.63 42.33 MSRActionPairs 67.29 74.39 71.13 67.33 61.26 Video retrieval without hashing using mAP as the metric For UCF101 dataset, DLSTM achieves the highest mAP with 400 hidden states Input dimension is 300, pre-trained ILSVERC12 top layer with PCA For MSR dataset, DLSTM achieves the highest mAP with 200 hidden states Input dimension is 162

Comparison of modeling methods -- without hashing
DTW [10] BoW [16] DTQ [22] DLSTM UCF101 31.02 21.53 36.59 46.63 MSRActionPairs -- 62.37 74.39 Compare with three other video modeling methods Dynamic Time Warping (DTW) DTW-based motion template method [10] Suffer from misalignment of videos with varied lengths Bag-of-Words (BoW) BoW based on HogHoF feature detector [16] Local spatio-temporal features Dynamic Temporal Quantization (DTQ) Aforementioned [22] DLSTM significantly outperforms other methods in terms of mAP

Comparison on UCF101 dataset
DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1, respectively

Comparison on MSRAction Pairs
DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1 , respectively

Performance improvement of hashing methods with DLSTM modeling
UCF101 MSR

Conclusion and Future Work
Propose DLSTM for video modeling Generate highly compact fixed-length representations for videos of varied lengths DLSTM feature can further benefit existing image hashing methods Future Work Investigate an end-to-end DLSTM-based video hashing algorithm

Thank you! Q & A

Naifan Zhuang, Jun Ye, Kien A. Hua

Similar presentations

Presentation on theme: "Naifan Zhuang, Jun Ye, Kien A. Hua"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Naifan Zhuang, Jun Ye, Kien A. Hua

Similar presentations

Presentation on theme: "Naifan Zhuang, Jun Ye, Kien A. Hua"— Presentation transcript:

Similar presentations

About project

Feedback