SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China

Slides:



Advertisements
Similar presentations
Max-Margin Additive Classifiers for Detection
Advertisements

Classification using intersection kernel SVMs is efficient
Context-based Visual Concept Detection Using Domain Adaptive Semantic Diffusion Yu-Gang Jiang, Jun Wang, Shih-Fu Chang, Chong-Wah Ngo VIREO Research Group.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
Limin Wang, Yu Qiao, and Xiaoou Tang
Classification using intersection kernel SVMs is efficient Joint work with Subhransu Maji and Alex Berg Jitendra Malik UC Berkeley.
Query Specific Fusion for Image Retrieval
Detecting Categories in News Video Using Image Features Slav Petrov, Arlo Faria, Pascal Michaillat, Alex Berg, Andreas Stolcke, Dan Klein, Jitendra Malik.
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Fast intersection kernel SVMs for Realtime Object Detection
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.
ACM Multimedia th Annual Conference, October , 2004
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
CS294‐43: Visual Object and Activity Recognition Prof. Trevor Darrell Spring 2009 March 17 th, 2009.
Presentation in IJCNN 2004 Biased Support Vector Machine for Relevance Feedback in Image Retrieval Hoi, Chu-Hong Steven Department of Computer Science.
Video Search Engines and Content-Based Retrieval Steven C.H. Hoi CUHK, CSE 18-Sept, 2006.
DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.
Global and Efficient Self-Similarity for Object Classification and Detection CVPR 2010 Thomas Deselaers and Vittorio Ferrari.
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,
TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.
What’s Making That Sound ?
Semantic Indexing of multimedia content using visual, audio and text cues Written By:.W. H. Adams. Giridharan Iyengar. Ching-Yung Lin. Milind Ramesh Naphade.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling Future Work YLI Event Detection Dataset Growth of Multimedia Data On YouTube today…
Bridge Semantic Gap: A Large Scale Concept Ontology for Multimedia (LSCOM) Guo-Jun Qi Beckman Institute University of Illinois at Urbana-Champaign.
Action recognition with improved trajectories
Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,
Yu-Gang Jiang, Yanran Wang, Rui Feng Xiangyang Xue, Yingbin Zheng, Hanfang Yang Understanding and Predicting Interestingness of Videos Fudan University,
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.
Beauty is Here! Evaluating Aesthetics in Videos Using Multimodal Features and Free Training Data Yanran Wang, Qi Dai, Rui Feng, Yu-Gang Jiang School of.
Deformable Part Model Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 11 st, 2013.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
Relative Hidden Markov Models Qiang Zhang, Baoxin Li Arizona State University.
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Towards Efficient Learning of Optimal Spatial Bag-of-Words Representations Lu Jiang 1, Wei Tong 1, Deyu Meng 2, Alexander G. Hauptmann 1 1 School of Computer.
First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,
Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue
Understanding and Predicting Interestingness of Videos Yu-Gang Jiang, Yanran Wang, Rui Feng, Hanfang Yang, Yingbin Zheng, Xiangyang Xue School of Computer.
How Do Humans Sketch Objects? SIGGRAPH 2012 Mathias Technische Universität Berlin ( 柏林工业 )Technische Universität Berlin James Brown UniversityBrown.
Classifying Covert Photographs CVPR 2012 POSTER. Outline  Introduction  Combine Image Features and Attributes  Experiment  Conclusion.
SUN Database: Large-scale Scene Recognition from Abbey to Zoo Jianxiong Xiao *James Haysy Krista A. Ehinger Aude Oliva Antonio Torralba Massachusetts Institute.
Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.
Sreekanth Vempati ( ) Advisors: Dr. C. V. Jawahar ( IIIT Hyderabad ), Dr. Andrew Zisserman ( Univ. of Oxford ) Efficient SVM based object classification.
Scale Up Video Understanding with Deep Learning May 30, 2016 Chuang Gan Tsinghua University 1.
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.
Recent developments in object detection
Visual Event Recognition in Videos by Learning from Web Data
Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,
Artist Identification Based on Song Analysis
Saliency-guided Video Classification via Adaptively weighted learning
Efficient Image Classification on Vertically Decomposed Data
Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
Digit Recognition using SVMS
By Suren Manvelyan, Crocodile (nile crocodile?) By Suren Manvelyan,
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Efficient Image Classification on Vertically Decomposed Data
Multiple Feature Learning for Action Classification
Presentation transcript:

SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China ACM ICMR 2012, Hong Kong, June 2012 Speeded Up Event Recognition ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun

The Problem 2 Recognize high-level events in videos  We’re particularly interested in Internet Consumer videos Applications  Video Search  Personal Video Collection Management  Smart Advertising  Intelligence Analysis …… …

Our Objective 3 Improve Efficiency Maintain Accuracy

The Baseline Recognition Framework 4 Feature extraction SIFT Spatial-temporal interest points MFCC audio feature Late Average Fusion χ 2 kernel SVM Classifier Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task

Three Audio-Visual Features… 5 SIFT (visual) – D. Lowe, IJCV ‘04 STIP (visual) – I. Laptev, IJCV ‘05 MFCC (audio) … 16ms

Bag-of-words Representation SIFT / STIP / MFCC words Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007) Bag-of-SIFT 6 Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010

Baseline Speed… 7 Feature extraction SIFT Spatial-temporal interest points MFCC audio feature Late Average Fusion χ 2 kernel SVM Classifier 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling ~2.00 <<1 Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps). Classification time is measured by classifying a video using classifiers of all the 20 categories Total: 1003 seconds per video !

Basketball Baseball Soccer Ice Skating Skiing Swimming Biking Cat Dog Bird Graduation Birthday Celebration Wedding Reception Wedding Ceremony Wedding Dance Music Performance Non-music Performance Parade Beach Playground 8 Dataset: Columbia Consumer Videos (CCV) Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.

Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR Feature Options (Sparse) SIFT STIP MFCC Dense SIFT (DIFT) Dense SURF (DURF) Self-Similarities (SSIM) Color Moments (CM) GIST LBP TINY 9 Suggested feature combinations:

Classifier Kernels Chi Square Kernel Histogram Intersection Kernel (HI) Fast HI Kernel (fastHI) 10 Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008.

Multi-modality Fusion Early Fusion Feature concatenation Kernel Fusion K f =K 1 +K 2 +… Late Fusion fusion of classification score MFCC, DURF, SSIM, CM, GIST, LBP MFCC, DURF

Frame Sampling DURF 12 Uniformly sampling 16 frames per video seems sufficient. K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008.

Frame Sampling MFCC 13 Sampling audio frames is always harmful.

Summary Feature: Dense SURF (DURF), MFCC, plus some global features Classifier: Fast HI kernel SVM Fusion: Early Frame Selection: Audio - No; Visual - Yes fold speed-up!

Demo… 15

16