A Discriminative CNN Video Representation for Event Detection

Name: A Discriminative CNN Video Representation for Event Detection
Uploaded: 2017-10-01T06:57:35+00:00
Duration: PTM16S57
Channel: Oliver McCoy
Description: A Discriminative CNN Video Representation for Event Detection

A Discriminative CNN Video Representation for Event Detection
Zhongwen Xu†, Yi Yang† and Alexander G. Hauptmann‡ †QCIS, University of Technology, Sydney ‡SCS, Carnegie Mellon University

Multimedia Event Detection
Detect user-defined event by analyzing the visual, acoustic, and textual information in the web videos Event: a phrase like “Birthday party”, “Wedding ceremony”, “Making a sandwich” and “Changing a vehicle tire”. Part of TRECVID competition, the largest video analysis competition in the world.

MED data Source: MEDEval 14 dataset Collected from YouTube
Uploaded by different users MEDEval 14 dataset num of events = 20 Training Data: 32k videos, duration: ~1,200 hours Test data: 200k videos, duration: ~8,000 hours, size: ~5 TB Well human-labeled

Video analysis costs a lot
Dense Trajectories and its enhanced version improved Dense Trajectories (IDT) have dominated complex event detection superior performance over other features such as the motion feature STIP and the static appearance feature Dense SIFT Credits: Heng Wang

Paralleling 1,000 cores, it takes about one week to extract the IDT features for the 200,000 videos with duration of 8,000 hours in the TRECVID MEDEval 14 collection “Blacklight” with 4,096 CPU cores, 32 TB shared memory in Pittsburgh Supercomputing Center

And, it is really a headache when dealing with the I/O problems caused by one thousand threads, which are reading videos and outputting generated features very heavily. The whole system would be slowed down extremely when you did not coordinate the I/O well.

As a result of the unaffordable computation cost (a cluster with 1,000 cores), it would be extremely difficult for a relatively smaller research group with limited computational resources to process large scale video datasets. It becomes important to propose an efficient representation for complex event detection with only affordable computational resources, e.g., a single machine.

Turn to CNNs? One instinctive idea would be to utilize the deep learning approach, especially Convolutional Neural Networks (CNNs), given their overwhelming accuracy in image analysis fast processing speed, which is achieved by leveraging the massive parallel processing power of GPUs.

Turn to CNN? However, it has been reported that the event detection performance of CNN based video representation is WORSE than the improved Dense Trajectories in TRECVID MED 2013.

Average Pooling for Videos
Winning solution for the TRECVID MED 2013 competition

Average Pooling of CNN frame features
Convolutional Neural Networks (CNNs) with standard approach (average pooling) to generate video representation from frame level features What’s wrong with CNN video representation? MEDTest 13 MEDTest 14 Improved Dense Trajectories 34.0 27.6 CNN in 2013 29.0 N.A. CNN from VGG-16 32.7 24.8

Video Pooling on CNN Descriptors
Video pooling computes video representation over the whole video by pooling all the descriptors from all the frames in a video. For local descriptors like HOG, HOF, MBH in improved Dense Trajectories, the Fisher vector (FV) or Vector of Locally Aggregated Descriptors (VLAD) is applied to generate the video representation. To our knowledge, this is the first work on the video pooling of CNN descriptors and we broaden the encoding methods such as FV and VLAD from local descriptors to CNN descriptors in video analysis.

Illustration of VLAD encoding
Credits: Prateek Joshi

Discriminative Ability Analysis on Training Set of TRECVID MEDTest 14

Results fc6 fc6_relu fc7 fc7_relu Average pooling 19.8 24.8 18.8 23.8 Fisher vector 28.3 28.4 27.4 29.1 VLAD 33.1 32.6 33.2 31.5 Table: Performance comparison (mAP in percentage) on MEDTest Ex Figure: Performance comparisons on MEDTest 13 and MEDTest 14, both 100Ex and 10Ex

Results For more references, we provide the performance of a number of widely used features on MEDTest 14 for comparison MEDTest 100Ex MoSIFT with Fisher vector achieves mAP 18.1%; STIP with Fisher vector achieves mAP 15.0%; CSIFT with Fisher vector achieves mAP 14.7%; IDT with Fisher vector achieves mAP 27.6%; Our single layer achieves mAP 33.2% Note that with VLAD encoded CNN descriptors, we can achieve better performance (mAP 20.8%) with 10Ex than the relatively poorer features such as MoSIFT, STIP, and CSIFT with 100Ex!

Features from Convolutional Layers
Credits: Matthew Zeiler

Latent Concept Descriptors (LCD)
Convolutional filters can be regarded as generalized linear classifiers on the underlying data patches, and each convolutional filter corresponds to a latent concept. From this interpretation, pool5 layer of size a×a×M can be converted into a2 latent concept descriptors with M dimensions. Each latent concept descriptor represents the responses from the M filters for a specific pooling location.

Latent Concept Descriptors (LCD)

LCD with SPP LCD can be incorporated with Spatial Pyramid Pooling (SPP) layer, to enrich the visual information while only marginal computation cost is increased. The last convolutional layer is pooled into 6x6, 3x3, 2x2, 1x1 regions, each with M filters.

LCD Results on pool5 100Ex 10Ex Average pooling 31.2 18.8 LCDVLAD 38.2
25.0 LCDVLAD + SPP 40.3 25.6 Table 1: Performance comparisons for pool5 on MEDTest 13 100Ex 10Ex Average pooling 24.6 15.3 LCDVLAD 33.9 22.8 LCDVLAD + SPP 35.7 23.2 Table 2: Performance comparisons for pool5 on MEDTest 14

LCD for image analysis Deep filter banks for texture recognition and segmentation, M. Cimpoi, S. Maji and A. Vedaldi, in CVPR, 2015 (Oral) Deep Spatial Pyramid: The Devil is Once Again in the Details, B. Gao, X. Wei, J. Wu and W. Lin, arXiv, 2015

Comparisons with previous best features IDT
Ours IDT Relative improvement MEDTest Ex 44.6 34.0 31.2% MEDTest Ex 29.8 18.0 65.6% MEDTest Ex 36.8 27.6 33.3% MEDTest 14 10Ex 24.5 13.9 76.3%

Comparison to the state-of-the-art Systems on MEDTest 13
Natarajan et al. report mAP 38.5% on 100Ex, 17.9% on 10Ex from their whole visual system of combining all their low-level visual features. Lan et al. report 39.3% mAP on 100Ex of their whole system including non-visual features. Our results achieve 44.6% mAP on 100Ex and 29.8% mAP on 10Ex Ours + IDT + MFCC achieve 48.6% mAP on 100Ex, and 32.2% mAP on 10Ex Our single feature beats the state-of-the-art MED systems with more than 20 features Ours lightweight system (static + motion + acoustic features) sets up a high standard for MED

Notes The proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate fine-tuning techniques, or better descriptor encoding techniques. The proposed representation is very general for video analysis, not limited to multimedia event detection. We tested on MED datasets since they are the largest available video analysis datasets in the world. The proposed representation is pretty simple but very effective, it is easy to generate the representation using Caffe/cxxnet/cuda-convnet (for CNN features part) and vlfeat/Yael (for encoding part) toolkits.

THUMOS’ 15 Action Recognition Challenge
Action Recognition in Temporally Untrimmed Videos! A new forward-looking dataset containing over 430 hours of video data and 45 million frames (70% larger than THUMOS‘14) with the following components is made available under this challenge: Training Set: over 13,000 temporally trimmed videos from 101 action classes. Validation Set: Over 2100 temporally untrimmed videos with temporal annotations of actions. Background Set: Approximately 3000 relevant videos guaranteed to not include any instance of the 101 actions. Test Set: Over 5600 temporally untrimmed videos with withheld ground truth.

Results for THUMOS’ 15 validation set
Setting: Training data: training part only (UCF-101) Testing data: validation part in 2015 Set C = 100 in linear SVM with LIBSVM toolkit Metric: mean Average Precision (mAP) Comparison between average pooling and VLAD encoding fc6 fc7 Average pooling* 0.521 0.493 VLAD encoding 0.589 0.566 * In average pooling, we utilize the layer after ReLU since it shows better performance

Results for THUMOS’15 validation set
Performance from VLAD encoded CNN features: fc6 fc7 LCD mAP 0.589 0.566 0.619

LCD with a better CNN model, GoogLeNet with Batch Normalization (Inception v2) Batch Normalization, Ioffe and Szegedy, ICML 2015 Trained by cxxnet toolkit* with 4 NVIDIA K20 GPUs Timing: about 4.5 days (~40 epochs) (ref. VGG-16 takes 2-3 weeks to train on 4 GPUs) Achieve the same performance as the single network of Google’s submission at ILSVRC 2014 last year LCD from VGG-16 LCD from Inception v2 mAP 0.619 0.628 *with great Multi-GPU training support

LCD with Inception v2 Multi-skip IDT FlowNet mAP 0.628 0.529 0.416 with late fusion of the prediction scores, we can achieve mAP 0.689

University of Amsterdam
THUMOS’15 Ranking Rank Entry Best Result 1 UTS & CMU 0.7384 2 MSR Asia (MSM) 0.6897 3 Zhejiang University* 0.6876 4 INRIA_LEAR* 0.6814 5 CUHK & SIAT 0.6803 6 University of Amsterdam 0.6798 * Utilized our CVPR 2015 paper as the main system component

Shared features We share all the features for MED datasets and THUMOS 2015 datasets You can download the features via Dropbox / Baidu Yun. Links are on my homepage. Features can be utilized to conduct machine learning / pattern recognition tasks

Thanks

A Discriminative CNN Video Representation for Event Detection

Similar presentations

Presentation on theme: "A Discriminative CNN Video Representation for Event Detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Discriminative CNN Video Representation for Event Detection

Similar presentations

Presentation on theme: "A Discriminative CNN Video Representation for Event Detection"— Presentation transcript:

Similar presentations

About project

Feedback