Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiaodan Liang Sun Yat-Sen University

Similar presentations


Presentation on theme: "Xiaodan Liang Sun Yat-Sen University"— Presentation transcript:

1 Xiaodan Liang Sun Yat-Sen University
ACM Multimedia, 2013 Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition Xiaodan Liang Sun Yat-Sen University Liang Lin Sun Yat-Sen University Liangliang Cao IBM Watson Research Center

2 Difficulties in human action recognition

3 We want to build a model which works well even with
Variant scales and views Diversified appearance Clutterred background

4 Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?

5 Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?

6 Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?

7 And-Or graph We borrow the idea of And-Or graph, which
hierarchically decomposes patterns with “and” nodes and “or” nodes allows rich structural variation builds a stochastic grammar And we want to extend it to video domain! Ref: Zhu and Mumford, A Stochastic Grammar of Images, 2006

8 Challenges Videos are more complicated than images, so
We need more powerful models to capture more diversities We need more efficient algorithms to solve such models

9 Outline of our contributions
A powerful model Spatio composition Temporal composition Contextual interaction An efficient learner Concave – convex procedure (CCCP) Structural Reconfiguration

10 Spatio-temporal and-or graph
Global potential function Root-node Temporal anchor frames And-nodes Structural alternatives Or-nodes Local classifier for action parts Leaf-nodes Spatial and Temporal Edges Contextual Interactions

11 Temporal Composition Spatial Composition

12 Spatial composition And node for each anchor frame BOW histogram
Aggregate the response scores from all or-nodes switch variables in hierarchy (i.e. or-nodes) are introduced to model the variance!!!

13 Temporal composition Root node for verifying the temporal compatibility of model Temporal displacement penalty Aggregate the scores from all anchor frames

14 Final model (STAOG) Spatial interactions Temporal interactions

15 Our learner We denote as the latent variables (spatial configuration alternatives, temporal displacement). An efficient learner is preferred to find the optimal response of : This is not a convex function and cannot be solved by traditional solvers.

16 Our learner (2) Convex Concave

17 Our learner (2) Convex Concave
This problem can be solved by concave-convex procedure, which converts the problem into a sequence of convex programming. Yuille and Rangarajan, The concave-convex procedure, 2003

18 Learning the structure
In addition to inferring latent variables, we will also update the model structure by clustering and greedy search. The whole learner works in an iterative way like EM algorithm, which will converge to a local minimum instead of global one. It took our learner secons to infer one video from Olympic sport dataset and UCF Youtube dataset.

19 Discriminative frames and body parts
Olympic sports dataset(High Jump) Frame #26 Frame #74 Frame #103 Frame #121 Frame #152

20 Discriminative frames and body parts
YouTube Dataset(Swing) Frame #49 Frame #97 Frame #132 Frame #369 Frame #114 Frame #236

21 Average Precision on the Olympic Sports dataset

22 Accuracy on the YouTube dataset

23 Conclusion And-Or graph is a powerful representation which can be generalized to model large variance and diversity in video domain We develop efficient algorithms to efficient learn complex spatio-temporal And-Or graph. Our future work is to try to apply And-Or graph for large scale problems.

24 Thank you! Gracias! 谢谢!

25 Backup slides

26 experimental results The effectiveness of the or-nodes in spatial compositions Empirical analysis for different settings of spatial compositions, where we set different maximum numbers m of leaf-nodes under the or-nodes. Each pillar represents the ac-curacy for one action category. The color indicates the results with different settings , m= 2orm= 4.


Download ppt "Xiaodan Liang Sun Yat-Sen University"

Similar presentations


Ads by Google