A Hierarchical Deep Temporal Model for Group Activity Recognition

Slides:

Advertisements

Similar presentations

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Advertisements

Limin Wang, Yu Qiao, and Xiaoou Tang

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Retrieving Actions in Group Contexts Tian Lan, Yang Wang, Greg Mori, Stephen Robinovitch Simon Fraser University Sept. 11, 2010.

Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.

ADVISE: Advanced Digital Video Information Segmentation Engine

Self-Supervised Segmentation of River Scenes Supreeth Achar *, Bharath Sankaran ‡, Stephen Nuske *, Sebastian Scherer *, Sanjiv Singh * * ‡

Recognition Of Textual Signs Final Project for “Probabilistic Graphics Models” Submitted by: Ezra Hoch, Golan Pundak, Yonatan Amit.

Distributed Representations of Sentences and Documents

Spatial Pyramid Pooling in Deep Convolutional

What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.

Recognizing Action at a Distance A.A. Efros, A.C. Berg, G. Mori, J. Malik UC Berkeley.

Bag of Video-Words Video Representation

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.

Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.

Feedforward semantic segmentation with zoom-out features

Describing People: A Poselet-Based Approach to Attribute Classification.

BACKGROUND MODEL CONSTRUCTION AND MAINTENANCE IN A VIDEO SURVEILLANCE SYSTEM Computer Vision Laboratory 指導教授：張元翔老師研究生：許木坪.

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

Max-Confidence Boosting With Uncertainty for Visual tracking WEN GUO, LIANGLIANG CAO, TONY X. HAN, SHUICHENG YAN AND CHANGSHENG XU IEEE TRANSACTIONS ON.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Naifan Zhuang, Jun Ye, Kien A. Hua

Olivier Siohan David Rybach

Unsupervised Learning of Video Representations using LSTMs

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Guillaume-Alexandre Bilodeau

Speaker Classification through Deep Learning

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Recurrent Neural Networks for Natural Language Processing

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

CLASSIFICATION OF TUMOR HISTOPATHOLOGY VIA SPARSE FEATURE LEARNING Nandita M. Nayak1, Hang Chang1, Alexander Borowsky2, Paul Spellman3 and Bahram Parvin1.

Week III: Deep Tracking

Learning Mid-Level Features For Recognition

Matt Gormley Lecture 16 October 24, 2016

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Mentor: Afshin Dehghan

Combining CNN with RNN for scene labeling (segmentation)

Introductory Seminar on Research: Fall 2017

mengye ren, ryan kiros, richard s. zemel

Shunyuan Zhang Nikhil Malik

Unsupervised Learning and Autoencoders

CS6890 Deep Learning Weizhen Cai

Textual Video Prediction

Bird-species Recognition Using Convolutional Neural Network

Introduction to Neural Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Lecture: Deep Convolutional Neural Networks

Outline Background Motivation Proposed Model Experimental Results

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

Zhedong Zheng, Liang Zheng and Yi Yang

Heterogeneous convolutional neural networks for visual recognition

Learn to Comment Mentor: Mahdi M. Kalayeh

Automatic Handwriting Generation

Presented By: Harshul Gupta

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Cengizhan Can Phoebe de Nooijer

Week 3 Volodymyr Bobyr.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Point Set Representation for Object Detection and Beyond

REU Program 2019 Week 5 Alex Ruiz Jyoti Kini.

Do Better ImageNet Models Transfer Better?

Presentation transcript:

A Hierarchical Deep Temporal Model for Group Activity Recognition MSc Thesis Defence Srikanth Muralidharan 12 April 2016 Good Afternoon. Welcome to my Thesis talk. I am going to present our work on Group Activity Recognition using hierarchical deep temporal model.

Outline Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion

Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion

Preview – Action Recognition Walking

Action Recognition Datasets : A brief overview 2010 Olympic sports dataset 16 classes 2014 Youtube 1M dataset 480+ classes 2004 KTH dataset 6 classes

Summary-Action Recognition Task : Predict what a single person is doing Difficulty – intraclass variations Difficulty - unconstrained nature of videos

Example : A surveillance scene We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.

It’s a walking scene. Walking Walking Walking Walking Walking Standing We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.

Example: Rally in a Volleyball Scene The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.

Left Spike Spiking Waiting Waiting Standing waiting Waiting Moving The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.

Challenge 1 – Context Dependency Group Activity = Majority’s Activity Group Activity = Key Player’s Activity Challenge 1 – Context Dependency Group Activity – Right spike Challenge 2 - high level description

Group Activity Recognition vs Action Recognition Walking

It’s hard! Group activity label Image Classifier Be careful with the description!

Intuitive fix: Use only the foreground features Therefore, the intuitive fix is to use just the features obtained from foreground

Group Activity – ???? waiting Person classifier Digging waiting spiking waiting Person classifier We cut out all the people, extract their feature representation

Possible Solution - Hierarchical model Pool person features Digging waiting waiting spiking waiting Stage 1 - Person feature extractor We cut out all the people, extract their feature representation

Possible Solution - Hierarchical model Output Group Activity Stage 2: Frame Classifier Pooled person features We cut out all the people, extract their feature representation

Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion

Pipeline Overview Learn People Representations Aggregate People Representations Learn Group Representations

From images to video clips Given the person level annotations, we track each person assigning same label across the tracks

LSTM – An Introduction Stands for Long Short Term Memory Sequential Neural Network that learns from arbitrary length inputs

LSTM – An Introduction Output Output Output LSTM LSTM LSTM x(t=T)

We use LSTMs for building person classification model and extracting person features We construct an LSTM based frame classifier on top of pooled LSTM features

Stage1 : Learning Individual Activity Features Softmax Softmax Softmax LSTM LSTM LSTM Alexnet Alexnet Alexnet

Stage1 : Learning Individual Activity Features Person 1 LSTM Person 1 feature Representation LSTM Person 2 feature Representation Person 2 LSTM Person 3 feature Representation Person 3 . . . LSTM Person n feature Representation Person n

Stage 2: Learning Frame Representations

Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion

Tracker details We obtain 10-frame video clips – 5 before, 4 after an annotated frame We use LSTMs with 10 video clips as batch size No annotations for the tracked frames - use of unlabelled data

Collective Activity Dataset Same label set for people and group activities 1925 video clips for training, 638 video clips for testing 1. Crossing 2. Queueing 3. Talking 4. Waiting 5. Walking

Experimental results on Collective Activity Dataset Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5

Experimental results on Collective Activity Dataset Method Accuracy Contextual Model [Lan NIPS’10] 79.1 Deep Structured Model [Deng BMVC‘15] 80.6 Our Model 81.5 Cardinality Kernel [Hajimirsadeghi CVPR‘15] 83.4 Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5

Volleyball Dataset – Frame Labels 1047 images for training, 478 images for testing 1. Spiking 2. Setting 3. Passing

Volleyball Dataset – People Labels 1047 images for training, 478 images for testing 1. Waiting 2. Digging 3. Setting 4. Spiking 5. Falling 6. Blocking

Experimental results on Volleyball Dataset Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1

Experimental results on Volleyball Dataset Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1

Visualization of results Left set Right pass Right Spike Left pass Left spike (Left pass) Right spike (Left spike)

Conclusion A two stage hierarchical model for group activity recognition LSTMs as a highly effective temporal model and temporal feature source Decent people-relation modeling with simple pooling

Future Work Semi-supervised approaches to diversify the new datasets Experiments under weakly supervised setting Semi-supervised approaches to diversify the new datasets

THANK YOU