Xiaodan Liang Sun Yat-Sen University ACM Multimedia, 2013 Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition Xiaodan Liang Sun Yat-Sen University Liang Lin Sun Yat-Sen University Liangliang Cao IBM Watson Research Center
Difficulties in human action recognition
We want to build a model which works well even with Variant scales and views Diversified appearance Clutterred background
Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?
Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?
Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?
And-Or graph We borrow the idea of And-Or graph, which hierarchically decomposes patterns with “and” nodes and “or” nodes allows rich structural variation builds a stochastic grammar And we want to extend it to video domain! Ref: Zhu and Mumford, A Stochastic Grammar of Images, 2006
Challenges Videos are more complicated than images, so We need more powerful models to capture more diversities We need more efficient algorithms to solve such models
Outline of our contributions A powerful model Spatio composition Temporal composition Contextual interaction An efficient learner Concave – convex procedure (CCCP) Structural Reconfiguration
Spatio-temporal and-or graph Global potential function Root-node Temporal anchor frames And-nodes Structural alternatives Or-nodes Local classifier for action parts Leaf-nodes Spatial and Temporal Edges Contextual Interactions
Temporal Composition Spatial Composition
Spatial composition And node for each anchor frame BOW histogram Aggregate the response scores from all or-nodes switch variables in hierarchy (i.e. or-nodes) are introduced to model the variance!!!
Temporal composition Root node for verifying the temporal compatibility of model Temporal displacement penalty Aggregate the scores from all anchor frames
Final model (STAOG) Spatial interactions Temporal interactions
Our learner We denote as the latent variables (spatial configuration alternatives, temporal displacement). An efficient learner is preferred to find the optimal response of : This is not a convex function and cannot be solved by traditional solvers.
Our learner (2) Convex Concave
Our learner (2) Convex Concave This problem can be solved by concave-convex procedure, which converts the problem into a sequence of convex programming. Yuille and Rangarajan, The concave-convex procedure, 2003
Learning the structure In addition to inferring latent variables, we will also update the model structure by clustering and greedy search. The whole learner works in an iterative way like EM algorithm, which will converge to a local minimum instead of global one. It took our learner 150-200 secons to infer one video from Olympic sport dataset and UCF Youtube dataset.
Discriminative frames and body parts Olympic sports dataset(High Jump) Frame #26 Frame #74 Frame #103 Frame #121 Frame #152
Discriminative frames and body parts YouTube Dataset(Swing) Frame #49 Frame #97 Frame #132 Frame #369 Frame #114 Frame #236
Average Precision on the Olympic Sports dataset
Accuracy on the YouTube dataset
Conclusion And-Or graph is a powerful representation which can be generalized to model large variance and diversity in video domain We develop efficient algorithms to efficient learn complex spatio-temporal And-Or graph. Our future work is to try to apply And-Or graph for large scale problems.
Thank you! Gracias! 谢谢!
Backup slides
experimental results The effectiveness of the or-nodes in spatial compositions Empirical analysis for different settings of spatial compositions, where we set different maximum numbers m of leaf-nodes under the or-nodes. Each pillar represents the ac-curacy for one action category. The color indicates the results with different settings , m= 2orm= 4.