Presentation is loading. Please wait.

Presentation is loading. Please wait.

SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China

Similar presentations


Presentation on theme: "SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China"— Presentation transcript:

1 SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China ygj@fudan.edu.cn ACM ICMR 2012, Hong Kong, June 2012 Speeded Up Event Recognition ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.

2 The Problem 2 Recognize high-level events in videos  We’re particularly interested in Internet Consumer videos Applications  Video Search  Personal Video Collection Management  Smart Advertising  Intelligence Analysis …… …

3 Our Objective 3 Improve Efficiency Maintain Accuracy

4 The Baseline Recognition Framework 4 Feature extraction SIFT Spatial-temporal interest points MFCC audio feature Late Average Fusion χ 2 kernel SVM Classifier Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010. Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task

5 Three Audio-Visual Features… 5 SIFT (visual) – D. Lowe, IJCV ‘04 STIP (visual) – I. Laptev, IJCV ‘05 MFCC (audio) … 16ms

6 Bag-of-words Representation SIFT / STIP / MFCC words Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007) Bag-of-SIFT 6 Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010

7 Baseline Speed… 7 Feature extraction SIFT Spatial-temporal interest points MFCC audio feature Late Average Fusion χ 2 kernel SVM Classifier 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling 82.0 916.8 2.36 ~2.00 <<1 Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps). Classification time is measured by classifying a video using classifiers of all the 20 categories Total: 1003 seconds per video !

8 Basketball Baseball Soccer Ice Skating Skiing Swimming Biking Cat Dog Bird Graduation Birthday Celebration Wedding Reception Wedding Ceremony Wedding Dance Music Performance Non-music Performance Parade Beach Playground 8 Dataset: Columbia Consumer Videos (CCV) Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.

9 Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR 2009. Feature Options (Sparse) SIFT STIP MFCC Dense SIFT (DIFT) Dense SURF (DURF) Self-Similarities (SSIM) Color Moments (CM) GIST LBP TINY 9 Suggested feature combinations:

10 Classifier Kernels Chi Square Kernel Histogram Intersection Kernel (HI) Fast HI Kernel (fastHI) 10 Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008.

11 Multi-modality Fusion Early Fusion Feature concatenation Kernel Fusion K f =K 1 +K 2 +… Late Fusion fusion of classification score MFCC, DURF, SSIM, CM, GIST, LBP MFCC, DURF

12 Frame Sampling DURF 12 Uniformly sampling 16 frames per video seems sufficient. K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008.

13 Frame Sampling MFCC 13 Sampling audio frames is always harmful.

14 Summary Feature: Dense SURF (DURF), MFCC, plus some global features Classifier: Fast HI kernel SVM Fusion: Early Frame Selection: Audio - No; Visual - Yes 14 220-fold speed-up!

15 Demo… 15

16 email: ygj@fudan.edu.cn 16


Download ppt "SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China"

Similar presentations


Ads by Google