A Discriminative CNN Video Representation for Event Detection

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Limin Wang, Yu Qiao, and Xiaoou Tang
ImageNet Classification with Deep Convolutional Neural Networks
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
Patch to the Future: Unsupervised Visual Prediction
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Large-Scale Object Recognition with Weak Supervision
Spatial Pyramid Pooling in Deep Convolutional
From R-CNN to Fast R-CNN
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
Generic object detection with deformable part-based models
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today…
Bridge Semantic Gap: A Large Scale Concept Ontology for Multimedia (LSCOM) Guo-Jun Qi Beckman Institute University of Illinois at Urbana-Champaign.
Action recognition with improved trajectories
Object Bank Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 4 th, 2013.
Lecture 31: Modern recognition CS4670 / 5670: Computer Vision Noah Snavely.
Deep face recognition Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman.
Mentor: Salman Khokhar Action Recognition in Crowds Week 7.
Deformable Part Model Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 11 st, 2013.
Human Action Recognition from RGB-D Videos Oliver MacNeely YSP 2015.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Deep Convolutional Nets
Feedforward semantic segmentation with zoom-out features
Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Cascade Region Regression for Robust Object Detection
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.
Spatial Localization and Detection
Deep Residual Learning for Image Recognition
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Hierarchical Motion Evolution for Action Recognition Authors: Hongsong Wang, Wei Wang, Liang Wang Center for Research on Intelligent Perception and Computing,
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Scale Up Video Understanding with Deep Learning May 30, 2016 Chuang Gan Tsinghua University 1.
City Forensics: Using Visual Elements to Predict Non-Visual City Attributes Sean M. Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala Presented.
Learning to Compare Image Patches via Convolutional Neural Networks SERGEY ZAGORUYKO & NIKOS KOMODAKIS.
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.
Recent developments in object detection
Deep Learning for Dual-Energy X-Ray
Learning to Compare Image Patches via Convolutional Neural Networks
Summary of “Efficient Deep Learning for Stereo Matching”
Compact Bilinear Pooling
Object detection with deformable part-based models
Data Mining, Neural Network and Genetic Programming
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
The Problem: Classification
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Saliency-guided Video Classification via Adaptively weighted learning
Performance of Computer Vision
Robust Lung Nodule Classification using 2
Training Techniques for Deep Neural Networks
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Computer Vision James Hays
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Object Tracking: Comparison of
John H.L. Hansen & Taufiq Al Babba Hasan
RCNN, Fast-RCNN, Faster-RCNN
Comparison of EET and Rank Pooling on UCF101 (split 1)
Airport Parking Space Navigation
Heterogeneous convolutional neural networks for visual recognition
Automatic Handwriting Generation
Human-object interaction
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Semantic Segmentation
Object Detection Implementations
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Week 3 Volodymyr Bobyr.
Report 2 Brandon Silva.
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

A Discriminative CNN Video Representation for Event Detection Zhongwen Xu†, Yi Yang† and Alexander G. Hauptmann‡ †QCIS, University of Technology, Sydney ‡SCS, Carnegie Mellon University

Multimedia Event Detection Detect user-defined event by analyzing the visual, acoustic, and textual information in the web videos Event: a phrase like “Birthday party”, “Wedding ceremony”, “Making a sandwich” and “Changing a vehicle tire”. Part of TRECVID 2011-2015 competition, the largest video analysis competition in the world.

MED data Source: MEDEval 14 dataset Collected from YouTube Uploaded by different users MEDEval 14 dataset num of events = 20 Training Data: 32k videos, duration: ~1,200 hours Test data: 200k videos, duration: ~8,000 hours, size: ~5 TB Well human-labeled

Video analysis costs a lot Dense Trajectories and its enhanced version improved Dense Trajectories (IDT) have dominated complex event detection superior performance over other features such as the motion feature STIP and the static appearance feature Dense SIFT Credits: Heng Wang

Video analysis costs a lot Paralleling 1,000 cores, it takes about one week to extract the IDT features for the 200,000 videos with duration of 8,000 hours in the TRECVID MEDEval 14 collection “Blacklight” with 4,096 CPU cores, 32 TB shared memory in Pittsburgh Supercomputing Center

Video analysis costs a lot And, it is really a headache when dealing with the I/O problems caused by one thousand threads, which are reading videos and outputting generated features very heavily. The whole system would be slowed down extremely when you did not coordinate the I/O well.

Video analysis costs a lot As a result of the unaffordable computation cost (a cluster with 1,000 cores), it would be extremely difficult for a relatively smaller research group with limited computational resources to process large scale video datasets. It becomes important to propose an efficient representation for complex event detection with only affordable computational resources, e.g., a single machine.

Turn to CNNs? One instinctive idea would be to utilize the deep learning approach, especially Convolutional Neural Networks (CNNs), given their overwhelming accuracy in image analysis fast processing speed, which is achieved by leveraging the massive parallel processing power of GPUs.

Turn to CNN? However, it has been reported that the event detection performance of CNN based video representation is WORSE than the improved Dense Trajectories in TRECVID MED 2013.

Average Pooling for Videos Winning solution for the TRECVID MED 2013 competition

Average Pooling of CNN frame features Convolutional Neural Networks (CNNs) with standard approach (average pooling) to generate video representation from frame level features What’s wrong with CNN video representation? MEDTest 13 MEDTest 14 Improved Dense Trajectories 34.0 27.6 CNN in CMU@MED 2013 29.0 N.A. CNN from VGG-16 32.7 24.8

Video Pooling on CNN Descriptors Video pooling computes video representation over the whole video by pooling all the descriptors from all the frames in a video. For local descriptors like HOG, HOF, MBH in improved Dense Trajectories, the Fisher vector (FV) or Vector of Locally Aggregated Descriptors (VLAD) is applied to generate the video representation. To our knowledge, this is the first work on the video pooling of CNN descriptors and we broaden the encoding methods such as FV and VLAD from local descriptors to CNN descriptors in video analysis.

Illustration of VLAD encoding Credits: Prateek Joshi

Discriminative Ability Analysis on Training Set of TRECVID MEDTest 14

Results fc6 fc6_relu fc7 fc7_relu Average pooling 19.8 24.8 18.8 23.8 Fisher vector 28.3 28.4 27.4 29.1 VLAD 33.1 32.6 33.2 31.5 Table: Performance comparison (mAP in percentage) on MEDTest 14 100Ex Figure: Performance comparisons on MEDTest 13 and MEDTest 14, both 100Ex and 10Ex

Results For more references, we provide the performance of a number of widely used features on MEDTest 14 for comparison MEDTest 100Ex MoSIFT with Fisher vector achieves mAP 18.1%; STIP with Fisher vector achieves mAP 15.0%; CSIFT with Fisher vector achieves mAP 14.7%; IDT with Fisher vector achieves mAP 27.6%; Our single layer achieves mAP 33.2% Note that with VLAD encoded CNN descriptors, we can achieve better performance (mAP 20.8%) with 10Ex than the relatively poorer features such as MoSIFT, STIP, and CSIFT with 100Ex!

Features from Convolutional Layers Credits: Matthew Zeiler

Latent Concept Descriptors (LCD) Convolutional filters can be regarded as generalized linear classifiers on the underlying data patches, and each convolutional filter corresponds to a latent concept. From this interpretation, pool5 layer of size a×a×M can be converted into a2 latent concept descriptors with M dimensions. Each latent concept descriptor represents the responses from the M filters for a specific pooling location.

Latent Concept Descriptors (LCD)

LCD with SPP LCD can be incorporated with Spatial Pyramid Pooling (SPP) layer, to enrich the visual information while only marginal computation cost is increased. The last convolutional layer is pooled into 6x6, 3x3, 2x2, 1x1 regions, each with M filters.

LCD Results on pool5 100Ex 10Ex Average pooling 31.2 18.8 LCDVLAD 38.2 25.0 LCDVLAD + SPP 40.3 25.6 Table 1: Performance comparisons for pool5 on MEDTest 13 100Ex 10Ex Average pooling 24.6 15.3 LCDVLAD 33.9 22.8 LCDVLAD + SPP 35.7 23.2 Table 2: Performance comparisons for pool5 on MEDTest 14

LCD for image analysis Deep filter banks for texture recognition and segmentation, M. Cimpoi, S. Maji and A. Vedaldi, in CVPR, 2015 (Oral) Deep Spatial Pyramid: The Devil is Once Again in the Details, B. Gao, X. Wei, J. Wu and W. Lin, arXiv, 2015

Comparisons with previous best features IDT Ours IDT Relative improvement MEDTest 13 100 Ex 44.6 34.0 31.2% MEDTest 13 10 Ex 29.8 18.0 65.6% MEDTest 14 100Ex 36.8 27.6 33.3% MEDTest 14 10Ex 24.5 13.9 76.3%

Comparison to the state-of-the-art Systems on MEDTest 13 Natarajan et al. report mAP 38.5% on 100Ex, 17.9% on 10Ex from their whole visual system of combining all their low-level visual features. Lan et al. report 39.3% mAP on 100Ex of their whole system including non-visual features. Our results achieve 44.6% mAP on 100Ex and 29.8% mAP on 10Ex Ours + IDT + MFCC achieve 48.6% mAP on 100Ex, and 32.2% mAP on 10Ex Our single feature beats the state-of-the-art MED systems with more than 20 features Ours lightweight system (static + motion + acoustic features) sets up a high standard for MED

Notes The proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate fine-tuning techniques, or better descriptor encoding techniques. The proposed representation is very general for video analysis, not limited to multimedia event detection. We tested on MED datasets since they are the largest available video analysis datasets in the world. The proposed representation is pretty simple but very effective, it is easy to generate the representation using Caffe/cxxnet/cuda-convnet (for CNN features part) and vlfeat/Yael (for encoding part) toolkits.

THUMOS’ 15 Action Recognition Challenge Action Recognition in Temporally Untrimmed Videos! A new forward-looking dataset containing over 430 hours of video data and 45 million frames (70% larger than THUMOS‘14) with the following components is made available under this challenge: Training Set: over 13,000 temporally trimmed videos from 101 action classes. Validation Set: Over 2100 temporally untrimmed videos with temporal annotations of actions. Background Set: Approximately 3000 relevant videos guaranteed to not include any instance of the 101 actions. Test Set: Over 5600 temporally untrimmed videos with withheld ground truth.

Results for THUMOS’ 15 validation set Setting: Training data: training part only (UCF-101) Testing data: validation part in 2015 Set C = 100 in linear SVM with LIBSVM toolkit Metric: mean Average Precision (mAP) Comparison between average pooling and VLAD encoding fc6 fc7 Average pooling* 0.521 0.493 VLAD encoding 0.589 0.566 * In average pooling, we utilize the layer after ReLU since it shows better performance

Results for THUMOS’15 validation set Performance from VLAD encoded CNN features: fc6 fc7 LCD mAP 0.589 0.566 0.619

Results for THUMOS’ 15 validation set LCD with a better CNN model, GoogLeNet with Batch Normalization (Inception v2) Batch Normalization, Ioffe and Szegedy, ICML 2015 Trained by cxxnet toolkit* with 4 NVIDIA K20 GPUs Timing: about 4.5 days (~40 epochs) (ref. VGG-16 takes 2-3 weeks to train on 4 GPUs) Achieve the same performance as the single network of Google’s submission at ILSVRC 2014 last year LCD from VGG-16 LCD from Inception v2 mAP 0.619 0.628 *with great Multi-GPU training support

Results for THUMOS’ 15 validation set LCD with Inception v2 Multi-skip IDT FlowNet mAP 0.628 0.529 0.416 with late fusion of the prediction scores, we can achieve mAP 0.689

University of Amsterdam THUMOS’15 Ranking Rank Entry Best Result 1 UTS & CMU 0.7384 2 MSR Asia (MSM) 0.6897 3 Zhejiang University* 0.6876 4 INRIA_LEAR* 0.6814 5 CUHK & SIAT 0.6803 6 University of Amsterdam 0.6798 * Utilized our CVPR 2015 paper as the main system component

Shared features We share all the features for MED datasets and THUMOS 2015 datasets You can download the features via Dropbox / Baidu Yun. Links are on my homepage. Features can be utilized to conduct machine learning / pattern recognition tasks

Thanks