Saliency-guided Video Classification via Adaptively weighted learning ICME 2017 Saliency-guided Video Classification via Adaptively weighted learning Yunzhen Zhao and Yuxin Peng* Institute of Computer Science and Technology, Peking University, Beijing 100871, China {pengyuxin@pku.edu.cn}
Guide Line Introduction Method Experiment Conclusion
Introduction Large scale Internet videos (by CISCO) In 2021, it would take an individual more than 5 million years to watch the amount of video that will cross global IP networks each month. Globally, IP video traffic will be 82 percent of all consumer Internet traffic by 2021, up from 73 percent in 2016. Big video data arouses the need of video classification, as it’s one of the key techniques for video understanding and analysis. Source: Cisco Visual Networking Index, 2017
Introduction What is video classification? Learn semantics from video content and classify videos into pre-defined categories automatically. For examples: classifying human actions and multimedia events, etc. Birthday Celebration Parade HorseRiding PlayingGitar
human-computer interaction Introduction Wide applications human-computer interaction video search Video Classification sports analysis surveillance
Introduction Deep video classification Inspired by the great progress by DNN for image classification, DNN-based video classification has become a hotspot of research. A classical deep video classification method: The two-stream ConvNet architecture K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
Introduction Deep video classification Ji, et al. developed a 3D CNN architecture 2D CNN computes features from spatial dimensions only 3D CNN computes features from both spatial and temporal dimensions temporal 3D CNN 2D CNN S. Ji, W. Wu, M. Yang, et al, “3D convolutional neural networks for human action recognition,” in IEEE TPAMI, 2014.
Introduction Two problems From the view of motion, video frames can be decomposed into salient and non-salient areas, which should be treated differently Information from multiple streams play different roles for video classification, which should be treated differently
Introduction Main contributions Use optical flow to segment video frames into salient areas and non-salient areas, which is with no supervision information Propose a hybrid framework that combines 3D and 2D CNN to model multiple stream information from salient areas and non-salient areas respectively Introduce an adaptively weighted learning method to learn different fusion weights adaptively for multiple stream information
Guide Line Introduction Method Experiment Conclusion
Method Framework Salient area prediction Motion in videos may guide us to predict salient areas Salient areas present static and motion information Non-salient areas present background information
Method Framework Hybrid CNN networks Include three stream CNN networks Two 3D CNNs model static and motion information from salient areas One 2D CNN model background information from non-salient areas
Method Framework Adaptively weighted learning Adaptively learn the fusion weights of three stream information modeled by hybrid CNN networks
Method Salient area prediction Motivation: human brains are selectively sensitive to motion. Motion in videos: Subject motion: caused by movement of the objects in the videos (useful information) Camera motion: caused by movement of cameras (need to be eliminated)
Method Salient area prediction Step1: Estimate the homography by finding the correspondences between two frames Step2: Use estimated homography to rectify the raw frames to remove the camera motion Step3: Analyze the vectors of the trajectories in the flow field to remove the vectors that are too small Step4: Apply edge detection algorithm and get the connected domain as a salient region. Heng Wang and Cordelia Schmid, “Action recognition with improved trajectories,” in ICCV, 2013.
Method Hybrid CNN networks 3D CNN 2D CNN Applies 3D convolution Computes features of three dimensions, including spatial and temporal dimensions Is suitable for salient areas 2D CNN Applies 2D convolution Computes features of two dimensions, including only spatial dimensions Is suitable for non-salient areas
Method Formal description The value of the unit of j-th feature map in the i-th convolution layer: For 2D convolution: For 3D convolution:
Method Adaptively weighted learning Motivation Information from different streams plays different roles for each class, thus different fusion weights should be learn according to different semantic classes We propose adaptively weighted learning method to learn fusion weights for multiple stream information in an adaptive way
Method Adaptively weighted learning Objective function Pj stands for the fusion score within the corresponding semantic class Nj stands for the fusion score within the non-corresponding semantic class
Method Adaptively weighted learning Classification with adaptively learned weights Though above equation, different fusion weights are considered for each class, and the final result are determined by the highest fusion score.
Guide Line Introduction Method Experiment Conclusion
Experiments Datasets UCF-101 dataset consists of 13320 video clips with 101 classes. The length of these video clips is over 27 hours in total. All the videos are collected from the YouTube website and have the fixed frame rate of 25 FPS with the resolution of 320x240. CCV dataset is a consumer video database which contains 9317 web videos of over 20 semantic categories. It consists of interesting and diverse content, with less textual tags and content descriptions, thus more complex than UCF-101. TaiChi Punch wedding dance graduation UCF-101 CCV
Experiments Evaluate metrics UCF-101: measure the results by averaging accuracy over three splits. CCV: first calculate the average precision (AP) for each class, then report mAP for the whole dataset. The evaluate metrics are the same with the listed paper for fair comparison Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM International Conference on Multimedia (ACMMM), pages 461–470, 2015.
Experiments Comparing different combinations of the three streams The first group shows the results achieved by separate stream The second group shows the results achieved by combining any two streams The third group shows the results achieved by combining all three streams
Experiments The effectiveness of whether to model saliency or not
Experiments The effectiveness of adaptively weighted learning
Experiments Compared results with states-of-the-arts
Guide Line Introduction Method Experiment Conclusion
Conclusion Conclusion Further direction Optical flow can be used to predict the salient areas in an unsupervised way Modeling multi-stream information from salient and non-salient areas respectively can boost the performance of video classification The adaptively weighted learning method can help to learn different fusion weights for different semantic classes Further direction Exploit the help of manual indicating and handcraft labeling Attempt to apply unsupervised learning into our work
Cross-media Retrieval More than video: Cross-media retrieval Our current research focus Perform retrieval among different media types, such as image, text, audio and video We have released XMedia dataset with 5 media types. This dataset and source codes of our related works: Interested in cross-media retrieval? Hope our recent overview would be helpful for you http://www.icst.pku.edu.cn/mipl/xmedia Yuxin Peng, Xin Huang, and Yunzhen Zhao, "An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges", IEEE TCSVT, 2017. arXiv: 1704.02223.
ICME 2017 Thank you!