A Hierarchical Deep Temporal Model for Group Activity Recognition MSc Thesis Defence Srikanth Muralidharan 12 April 2016 Good Afternoon. Welcome to my Thesis talk. I am going to present our work on Group Activity Recognition using hierarchical deep temporal model.
Outline Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion
Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion
Preview – Action Recognition Walking
Action Recognition Datasets : A brief overview 2010 Olympic sports dataset 16 classes 2014 Youtube 1M dataset 480+ classes 2004 KTH dataset 6 classes
Summary-Action Recognition Task : Predict what a single person is doing Difficulty – intraclass variations Difficulty - unconstrained nature of videos
Example : A surveillance scene We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.
It’s a walking scene. Walking Walking Walking Walking Walking Standing We consider two types of scenarios. First is a surveillance scene. Here, in this example, most of the people are seen walking on a sidewalk, and therefore this video could be labelled as a walking scene.
Example: Rally in a Volleyball Scene The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.
Left Spike Spiking Waiting Waiting Standing waiting Waiting Moving The second example is a rally in volleyball scene. Here, the high level activity is determined by the main activity taking place, i.e. a player in the left side involved spiking. Therefore, we could label this scene as left_spike.
Challenge 1 – Context Dependency Group Activity = Majority’s Activity Group Activity = Key Player’s Activity Challenge 1 – Context Dependency Group Activity – Right spike Challenge 2 - high level description
Group Activity Recognition vs Action Recognition Walking
It’s hard! Group activity label Image Classifier Be careful with the description!
Intuitive fix: Use only the foreground features Therefore, the intuitive fix is to use just the features obtained from foreground
Group Activity – ???? waiting Person classifier Digging waiting spiking waiting Person classifier We cut out all the people, extract their feature representation
Possible Solution - Hierarchical model Pool person features Digging waiting waiting spiking waiting Stage 1 - Person feature extractor We cut out all the people, extract their feature representation
Possible Solution - Hierarchical model Output Group Activity Stage 2: Frame Classifier Pooled person features We cut out all the people, extract their feature representation
Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion
Pipeline Overview Learn People Representations Aggregate People Representations Learn Group Representations
From images to video clips Given the person level annotations, we track each person assigning same label across the tracks
LSTM – An Introduction Stands for Long Short Term Memory Sequential Neural Network that learns from arbitrary length inputs
LSTM – An Introduction Output Output Output LSTM LSTM LSTM x(t=T)
We use LSTMs for building person classification model and extracting person features We construct an LSTM based frame classifier on top of pooled LSTM features
Stage1 : Learning Individual Activity Features Softmax Softmax Softmax LSTM LSTM LSTM Alexnet Alexnet Alexnet
Stage1 : Learning Individual Activity Features Person 1 LSTM Person 1 feature Representation LSTM Person 2 feature Representation Person 2 LSTM Person 3 feature Representation Person 3 . . . LSTM Person n feature Representation Person n
Stage 2: Learning Frame Representations
Part I : Introduction to Group Activity Part II : Description of the Model Part III : Experimental Results and Conclusion
Tracker details We obtain 10-frame video clips – 5 before, 4 after an annotated frame We use LSTMs with 10 video clips as batch size No annotations for the tracked frames - use of unlabelled data
Collective Activity Dataset Same label set for people and group activities 1925 video clips for training, 638 video clips for testing 1. Crossing 2. Queueing 3. Talking 4. Waiting 5. Walking
Experimental results on Collective Activity Dataset Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5
Experimental results on Collective Activity Dataset Method Accuracy Contextual Model [Lan NIPS’10] 79.1 Deep Structured Model [Deng BMVC‘15] 80.6 Our Model 81.5 Cardinality Kernel [Hajimirsadeghi CVPR‘15] 83.4 Method Accuracy Image Classification 63.0 Person Classification 61.8 Person - Fine tuned 66.3 Temp Model - Person 62.2 Temp Model - Image 64.2 Our Model 81.5
Volleyball Dataset – Frame Labels 1047 images for training, 478 images for testing 1. Spiking 2. Setting 3. Passing
Volleyball Dataset – People Labels 1047 images for training, 478 images for testing 1. Waiting 2. Digging 3. Setting 4. Spiking 5. Falling 6. Blocking
Experimental results on Volleyball Dataset Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1
Experimental results on Volleyball Dataset Method Accuracy Image Classification 46.7 Person Classification 33.1 Person - Fine tuned 35.2 Temp Model - Person 45.9 Temp Model - Image 37.4 Our Model 51.1
Visualization of results Left set Right pass Right Spike Left pass Left spike (Left pass) Right spike (Left spike)
Conclusion A two stage hierarchical model for group activity recognition LSTMs as a highly effective temporal model and temporal feature source Decent people-relation modeling with simple pooling
Future Work Semi-supervised approaches to diversify the new datasets Experiments under weakly supervised setting Semi-supervised approaches to diversify the new datasets
THANK YOU