Naifan Zhuang, Jun Ye, Kien A. Hua DLSTM Approach to Video Modeling with Hashing for Large-Scale Video Retrieval Naifan Zhuang, Jun Ye, Kien A. Hua Department of Computer Science University of Central Florida ICPR 2016 Presented by Naifan Zhuang
Motivation and Background According to a report from Cisco, by 2019: A million minutes of videos will be shared every second It takes an individual 5 million years to watch all videos shared each month Urgent demand for indexing and retrieving these ever-increasing videos
QBE with Euclidean Distance Effective for image retrieval Ineffective for video retrieval: High-dimensional feature space Curse of high dimensionality Feature extraction Image: Feature extraction Video: …
Video modeling solutions Deterministic Quantization (DQ) based on Hamming Distance [21] Divides the video into equal segments Extracts visual feature for each key frame Bag-of-word feature encoded by hashing ANN search using Hamming Distance Dynamic Temporal Quantization (DTQ) [22] Varied-length video segments Semantic content of the video
Video modeling solutions To support very large video database Desirable to substantially reduce number of feature vectors used Say to only ONE segment
Video modeling solutions Differential Long Short-Term Memory (DLSTM) Obtain highly compact fixed-size representation Then hash to binary bits for further compression Video retrieval performed based on Hamming Distance UCF101 and MSRActionPairs datasets
Differential LSTM Architecture DLSTM [17] explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences
Pairwise training and Objective Consider two videos 𝑉 𝑖 and 𝑉 𝑗 with lengths 𝑇 𝑖 and 𝑇 𝑗 Loss function is defined as: 𝐿 𝑖,𝑗 =−𝑙𝑜𝑔 1 1+exp(𝛽 𝑙 𝑖𝑗 𝐡 𝑇 𝑖 𝑖 − 𝐡 𝑇 𝑗 𝑗 2 ) Where 𝛽 is normalizing factor 𝑙 𝑖𝑗 denotes label-similarity between 𝑉 𝑖 and 𝑉 𝑗 𝑙 𝑖𝑗 = +1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑠𝑎𝑚𝑒 𝑙𝑎𝑏𝑒𝑙 −1, 𝑉 𝑖 & 𝑉 𝑗 𝑎𝑟𝑒 𝑜𝑓 𝑑𝑖𝑓𝑓 𝑙𝑎𝑏𝑒𝑙𝑠 𝐡 𝑇 𝑖 𝑖 and 𝐡 𝑇 𝑗 𝑗 denote the hidden state of 𝑉 𝑖 and 𝑉 𝑗 at the last time-step
Differential LSTM Architecture DLSTM explicitly models information gain with Derivative of States (DoS) Better interprets the dynamic structures present in input image sequences
Training loss vs epochs UCF101 With a larger number of hidden states, training process converges faster Achieves a lower training cost If the number of hidden states is too large, Overfitting might occur
Impact of No. of Hidden States -- without hashing 100 200 300 400 500 UCF101 38.52 43.13 45.40 46.63 42.33 MSRActionPairs 67.29 74.39 71.13 67.33 61.26 Video retrieval without hashing using mAP as the metric For UCF101 dataset, DLSTM achieves the highest mAP with 400 hidden states Input dimension is 300, pre-trained ILSVERC12 top layer with PCA For MSR dataset, DLSTM achieves the highest mAP with 200 hidden states Input dimension is 162
Comparison of modeling methods -- without hashing DTW [10] BoW [16] DTQ [22] DLSTM UCF101 31.02 21.53 36.59 46.63 MSRActionPairs -- 62.37 74.39 Compare with three other video modeling methods Dynamic Time Warping (DTW) DTW-based motion template method [10] Suffer from misalignment of videos with varied lengths Bag-of-Words (BoW) BoW based on HogHoF feature detector [16] Local spatio-temporal features Dynamic Temporal Quantization (DTQ) Aforementioned [22] DLSTM significantly outperforms other methods in terms of mAP
Comparison on UCF101 dataset DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1, respectively
Comparison on MSRAction Pairs DTQ + state-of-the-art hashing methods DLSTM + simple hashing method If a feature value is smaller or larger than the mean of corresponding dim Simple hashing method encodes it as 0 or 1 , respectively
Performance improvement of hashing methods with DLSTM modeling UCF101 MSR
Conclusion and Future Work Propose DLSTM for video modeling Generate highly compact fixed-length representations for videos of varied lengths DLSTM feature can further benefit existing image hashing methods Future Work Investigate an end-to-end DLSTM-based video hashing algorithm
Thank you! Q & A