Visual Event Recognition in Videos by Learning from Web Data

Slides:

Advertisements

Similar presentations

Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support / 12 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT AND EXPLORATION.

Advertisements

Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.

Presented by Xinyu Chang

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Patch to the Future: Unsupervised Visual Prediction

Yuanlu Xu Human Re-identification: A Survey.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Addressing the Medical Image Annotation Task using visual words representation Uri Avni, Tel Aviv University, Israel Hayit GreenspanTel Aviv University,

Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.

SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.

Discriminative and generative methods for bags of features

Local Descriptors for Spatio-Temporal Recognition

Computer Vision Group, University of BonnVision Laboratory, Stanford University Abstract This paper empirically compares nine image dissimilarity measures.

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.

1 Diffusion Distance for Histogram Comparison, CVPR06. Haibin Ling, Kazunori Okada Group Meeting Presented by Wyman 3/14/2006.

Object Recognition Using Distinctive Image Feature From Scale-Invariant Key point D. Lowe, IJCV 2004 Presenting – Anat Kaspi.

Scale Invariant Feature Transform (SIFT)

Local Features and Kernels for Classification of Object Categories J. Zhang --- QMUL UK (INRIA till July 2005) with M. Marszalek and C. Schmid --- INRIA.

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

Large Lump Detection by SVM Sharmin Nilufar Nilanjan Ray.

1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.

Oral Defense by Sunny Tang 15 Aug 2003

A String Matching Approach for Visual Retrieval and Classification Mei-Chen Yeh* and Kwang-Ting Cheng Learning-Based Multimedia Lab Department of Electrical.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Learning from Multiple Outlooks Maayan Harel and Shie Mannor ICML 2011 Presented by Minhua Chen.

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

Machine learning & category recognition Cordelia Schmid Jakob Verbeek.

Introduction to domain adaptation

Selective Transfer Machine for Personalized Facial Action Unit Detection Wen-Sheng Chu, Fernando De la Torre and Jeffery F. Cohn Robotics Institute, Carnegie.

Learning to classify the visual dynamics of a scene Nicoletta Noceti Università degli Studi di Genova Corso di Dottorato.

Person-Specific Domain Adaptation with Applications to Heterogeneous Face Recognition (HFR) Presenter: Yao-Hung Tsai Dept. of Electrical Engineering, NTU.

Overcoming Dataset Bias: An Unsupervised Domain Adaptation Approach Boqing Gong University of Southern California Joint work with Fei Sha and Kristen Grauman.

Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo Nanyang Technological.

Week 9 Presented by Christina Peterson. Recognition Accuracies on UCF Sports data set Method Accuracy (%)DivingGolfingKickingLiftingRidingRunningSkating.

Survey of Algorithms to Query Image Databases COMP :Computational Geometry Benjamin Lok 11/2/98 Image from Kodak’s PhotoQuilt.

Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

A feature-based kernel for object classification P. Moreels - J-Y Bouguet Intel.

First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,

Classifying Covert Photographs CVPR 2012 POSTER. Outline  Introduction  Combine Image Features and Attributes  Experiment  Conclusion.

Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon.

南台科技大學資訊工程系 Region partition and feature matching based color recognition of tongue image 指導教授：李育強報告者：楊智雁日期： 2010/04/19 Pattern Recognition Letters,

Presented by David Lee 3/20/2006

Data Driven Attributes for Action Detection

Saliency-guided Video Classification via Adaptively weighted learning

Learning Mid-Level Features For Recognition

Nonparametric Semantic Segmentation

Introductory Seminar on Research: Fall 2017

The Earth Mover's Distance

Paper Presentation: Shape and Matching

ICCV Hierarchical Part Matching for Fine-Grained Image Classification

CVPR 2014 Orientational Pyramid Matching for Recognizing Indoor Scenes

Liang Zheng and Yuzhong Qu

The Open World of Micro-Videos

KFC: Keypoints, Features and Correspondences

Knowledge-based event recognition from salient regions of activity

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

Delivered By: Yuelei Xie

A Graph-Matching Kernel for Object Categorization

Recognizing Deformable Shapes

Motivation It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically. We propose BC-DNN method.

Presentation transcript:

Visual Event Recognition in Videos by Learning from Web Data Lixin Duan†, Dong Xu†, Ivor Tsang†, Jiebo Luo¶ † Nanyang Technological University, Singapore ¶ Kodak Research Labs, Rochester, NY, USA

Outline Overview of the Event Recognition System Similarity between Videos Aligned Space-Time Pyramid Matching Cross-Domain Problem Adaptive Multiple Kernel Learning Experiments Conclusion

Overview GOAL: Recognize consumer videos Large intra-class variability; limited labeled videos Wedding Sports Picnic

A Large Number of Web Videos Overview GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube) Wedding Sports Picnic Consumer Videos A Large Number of Web Videos

Overview Flowchart of the system Video Database Test video Classifier Output

Similarity between Videos Pyramid matching methods Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1] Unaligned space-time pyramid matching, I. Laptev [2] Space-time axes Time axis Space axes

Similarity between Videos

Similarity between Videos Aligned Space-Time Pyramid Matching Level 1 Distance

Similarity between Videos Distance Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] s.t.

Similarity between Videos Distance Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3] s.t.

Cross-Domain Problem Data distribution mismatch between consumer videos and web videos Consumer videos: Naturally captured Web videos: Edited; Selected Maximum Mean Discrepancy (MMD), K. M. Borgwardt [4]

Cross-Domain Problem Prior information

Cross-Domain Problem

Cross-Domain Problem Adaptive Multiple Kernel Learning (A-MKL) MMD Structural risk functional where

Cross-Domain Problem

Cross-Domain Problem

Experiments Data set 195 consumer videos and 906 web videos collected by ourselves and from Kodak Consumer Video Benchmark Data Set [5] 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports” Training data: 3 videos per event from consumer videos and all web videos Test data: The rest consumer videos

Experiments

Experiments Aligned Unaligned Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM) ASTPM is better than USTPM at Level 1

Experiments

Experiments Comparisons of cross-domain learning methods (a) SIFT features (b) ST features (c) SIFT features and ST features “parade”: 75.7% (A-MKL) vs. 62.2% (FR)

Experiments Comparisons of cross-domain learning methods Relative improvements SVM_T: 36.9% SVM_AT: 8.6% Feature Replication (FR) [6]: 7.6% Adaptive SVM (A-SVM) [7]: 49.6% Domain Transfer SVM (DTSVM) [8]: 9.9% MKL-based methods Better fuse SIFT features and ST features Handle noise in the loose labels

Conclusion We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos. We develop a new aligned space-time pyramid matching method. We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.

References [1] D. Xu and S.-F. Chang. Video event recognition using kernel methods with multi-level temporal alignment. T-PAMI, 30(11):1985–1997, 2008. [2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000. [4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

References [5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004. [6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007. [7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009. [8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007. [9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

Thank you!