Semantic Embedding Space for Zero Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.

Slides:

Advertisements

Similar presentations

Learning visual representations for unfamiliar environments Kate Saenko, Brian Kulis, Trevor Darrell UC Berkeley EECS & ICSI.

Advertisements

Ke Chen 1, Shaogang Gong 1, Tao Xiang 1, Chen Change Loy 2 1. Queen Mary, University of London 2. The Chinese University of Hong Kong VGG reading group.

Attribute Learning for Understanding Unstructured Social Activity

Improving the Fisher Kernel for Large-Scale Image Classiﬁcation Florent Perronnin, Jorge Sanchez, and Thomas Mensink, ECCV 2010 VGG reading group, January.

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Aggregating local image descriptors into compact codes

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Limin Wang, Yu Qiao, and Xiaoou Tang

A generic model to compose vision modules for holistic scene understanding Adarsh Kowdle *, Congcong Li *, Ashutosh Saxena, and Tsuhan Chen Cornell University,

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Large-Scale Object Recognition using Label Relation Graphs Jia Deng 1,2, Nan Ding 2, Yangqing Jia 2, Andrea Frome 2, Kevin Murphy 2, Samy Bengio 2, Yuan.

Patch to the Future: Unsupervised Visual Prediction

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Object-centric spatial pooling for image classification Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei ECCV 2012.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Large-Scale Object Recognition with Weak Supervision

Discriminative and generative methods for bags of features

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.

Spatial Pyramid Pooling in Deep Convolutional

Hierarchical Subquery Evaluation for Active Learning on a Graph Oisin Mac Aodha, Neill Campbell, Jan Kautz, Gabriel Brostow CVPR 2014 University College.

Wayne State University, 1/31/ Multiple-Instance Learning via Embedded Instance Selection Yixin Chen Department of Computer Science University of.

Machine learning & category recognition Cordelia Schmid Jakob Verbeek.

Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.

School of Electronic Information Engineering, Tianjin University Human Action Recognition by Learning Bases of Action Attributes and Parts Jia pingping.

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,

Bag of Video-Words Video Representation

CSE 185 Introduction to Computer Vision Pattern Recognition.

Overcoming Dataset Bias: An Unsupervised Domain Adaptation Approach Boqing Gong University of Southern California Joint work with Fei Sha and Kristen Grauman.

Action recognition with improved trajectories

Object Bank Presenter ： Liu Changyu Advisor ： Prof. Alex Hauptmann Interest ： Multimedia Analysis April 4 th, 2013.

Multi-task Low-rank Affinity Pursuit for Image Segmentation Bin Cheng, Guangcan Liu, Jingdong Wang, Zhongyang Huang, Shuicheng Yan (ICCV’ 2011) Presented.

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Week 9 Presented by Christina Peterson. Recognition Accuracies on UCF Sports data set Method Accuracy (%)DivingGolfingKickingLiftingRidingRunningSkating.

Towards Open World Recognition Abhijit Bendale, Terrance Boult University of Colorado of Colorado Springs Poster no 85.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.

Geodesic Flow Kernel for Unsupervised Domain Adaptation Boqing Gong University of Southern California Joint work with Yuan Shi, Fei Sha, and Kristen Grauman.

Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC

Deep Visual Analogy-Making

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Recent developments in object detection

Semi-Supervised Clustering

School of Computer Science & Engineering

Data Driven Attributes for Action Detection

Saliency-guided Video Classification via Adaptively weighted learning

Efficient Image Classification on Vertically Decomposed Data

Evaluating Techniques for Image Classification

Recognition using Nearest Neighbor (or kNN)

ICCV Hierarchical Part Matching for Fine-Grained Image Classification

Finding Clusters within a Class to Improve Classification Accuracy

Efficient Image Classification on Vertically Decomposed Data

Rob Fergus Computer Vision

Bilinear Classifiers for Visual Recognition

Object Detection + Deep Learning

Zero shot learning Presented by: YuYing Chou

Weakly Supervised Action Recognition

RCNN, Fast-RCNN, Faster-RCNN

Zeroshot Learning Mun Jonghwan.

Word embeddings (continued)

CAMCOS Report Day December 9th, 2015 San Jose State University

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

Presentation transcript:

Semantic Embedding Space for Zero Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of London

Action Recognition Ever Increasing #Categories KTH 6 Classes Weizmann 9 Classes 2004 Olympic Sports 16 Classes HMDB51 51 Classes UCF Classes Limitations Expensive to collect training data Annotating video is costly Limitations Expensive to collect training data Annotating video is costly

Zero-Shot Action Recognition Can we use videos from seen class to help predict videos from unseen classes? Unknown Classes Known Classes Hammer Throw Discus Throw Shot-Put

Conventional Approaches Human Labelled Attributes Approaches Human labelled attributes Limitations Manual label is costly Ontological problem Incompatible with other attribute sets Lampert etal. CVPR09 [1] Liu etal. CVPR11 [2] Fu etal. TPAMI15 [3] [1] Lampert etal. Learning to detect unseen object classes by between-class attribute transfer, CVPR2009 [2] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” CVPR, [3] Fu Y, Hospedales TM, Xiang T, Gong S. Transductive Multiview Zero-Shot Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2015;.

Conventional Approaches Attribute Based Ball Throw Away Shot-put Hammer Throw Discus Throw Bend Turn Around Outdoor Limitations Manual label is costly Ontological problem Incompatible with other attribute sets

Semantic Embedding Approach Semantic Embedding Space Discus Throw = [ …] Feature Space Discus Throw Hammer Throw = [ …] Hammer Throw ShotPut = [ …]

Benefit Unsupervised Semantic Space

Benefits Unsupervised Wide coverage of words Vec(“Apple”) = [ …] Vec(“Bear”) = [ …] Vec(“Car ”) = [ …] Vec(“Desk”) = [ …] Vec(“Fish”) = [ …] …

Benefits Unsupervised Wide coverage of words Semantic Meaningful Semantic Embedding Space Run Walk ship cat dog

Benefits Unsupervised Wide coverage of words Semantic Meaningful Uniform across datasets HammerThrow = [ …] Discus Throw = [ …] Dataset 1 HammerThrow = [ …] Discus Throw = [ …] Dataset 2

Challenges Complex Mapping

Challenges Semantic Vector Space Discus Throw = [ …] Feature Space N dim HammerThrow = [ …] N dim D dim

Challenges Domain Shift

Challenges Semantic Vector Space Discus Throw Feature Space Discus Throw HammerThrow Hammer Throw Sword Exercise Play Guitar

Semantic Embedding Approach Y=“Discus Throw”

Low-Level Visual Feature Improved Trajectory Feature [1] Bag of Words encoding [1] H Wang, C Schmid, Action recognition with improved trajectories, ICCV13

Semantic Embedding Space Y=“Discus Throw”

Semantic Word Vector Skip-gram model [1] predicts nearby words [1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality.“ NIPS2013 archery hammer sword throw ……

Combinations of Multi Words Additive Composition vec(“Discus Throw”) = vec(“Discus”) + vec(“Throw”) vec(“Apply Eye Makeup”) = vec(“Apply”) + vec(“Eye”) + vec(“Makeup”) vec(“Playing Guitar”) = vec(“Playing”) + vec(“Guitar”)

Visual to Semantic Mapping

Support Vector Regression with Chi2 Kernel z1z1 z2z2 x1x1 x2x2 x3x3 …… … N dim D dim

Semantic Word Vector Approach

Zeroshot Recognition Do nearest Neighbor search to predict category of test data Basketball Kayaking Fencing Diving HulaHoop TaiChi Rafting Minimal distance TestData Semantic Embedding Space

Domain Shift – Self Training Self-training is applied to tackle domain shift is the KNN function Z1Z1 Z2Z2 Z3Z3 Z4Z4 Z5Z5 Z6Z6 Z8Z8 Z7Z7 4 NN example Semantic Embedding Space

Domain Shift – Data Augmentation Target Dataset Train (HMDB Train) Auxiliary Dataset Train (UCF) Augmented Train VisualPrototypesVisualPrototypes VisualPrototypes VisualPrototypes Target Dataset Test(HMDB Test)

Experiments Dataset: HMDB51 – 51 classes 6766 videos UCF101 – 101 classes videos Feature: Improved Trajectory Feature [1] Bag of Words encoding Semantic Embedding Space: Skip-gram neural network model trained on Google News Dataset 300 dimension word vector [1] Wang, Heng, and Cordelia Schmid. "Action recognition with improved trajectories.“ ICCV [2] Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." ECCV 2010

Zeroshot Recognition DataSplits: Random 50/50 split, 30 times Evaluation: Average + Deviation Mean Classification Accuracy DatasetTraining ClassesTesting Classes HMDB UCF

Zeroshot Experiment Models Baselines: Random Guess Nearest Neighbour Classifier (NN) NN with Self-Training (NN+ST) NN with Data Augmentation (NN + Aux) NN with ST and Aux (NN+ST+Aux) Comparison of models: Direct Attribute Prediction (DAP) Indirect Attribute Prediction (IAP)

Zeroshot Experiment Quantitative Evaluation

Qualitative Insight Without Augmentation With Augmentation

Conclusion Exploited a semantic embedding model for zeroshot action recognition and detection We experimented on 2 popular action/event dataset for zeroshot learning. We proposed the first zeroshot data splits for 2 action/event dataset

Thank You Scan Me

Multishot Experiment DataSplits: Standard data splits Evaluation: Mean Category Accuracy: HMDB51, UCF101 Comparison of models: (1) Low-level feature direct SVM classifier (2) Human labeled attribute (3) Embedding linear SVM classifier

Multishot Experiment Quantitative Analysis