Xiaodan Liang Sun Yat-Sen University

Slides:

Advertisements

Similar presentations

A Discriminative Key Pose Sequence Model for Recognizing Human Interactions Arash Vahdat, Bo Gao, Mani Ranjbar, and Greg Mori ICCV2011.

Advertisements

Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Recursive Compositional Models.

Face Alignment with Part-Based Modeling

Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,

Yuanlu Xu Advisor: Prof. Liang Lin Person Re-identification by Matching Compositional Template with Cluster Sampling.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

1 Building a Dictionary of Image Fragments Zicheng Liao Ali Farhadi Yang Wang Ian Endres David Forsyth Department of Computer Science, University of Illinois.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Image Parsing: Unifying Segmentation and Detection Z. Tu, X. Chen, A.L. Yuille and S-C. Hz ICCV 2003 (Marr Prize) & IJCV 2005 Sanketh Shetty.

Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,

Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Efficient Moving Object Segmentation Algorithm Using Background Registration Technique Shao-Yi Chien, Shyh-Yih Ma, and Liang-Gee Chen, Fellow, IEEE Hsin-Hua.

REALTIME OBJECT-OF-INTEREST TRACKING BY LEARNING COMPOSITE PATCH-BASED TEMPLATES Yuanlu Xu, Hongfei Zhou, Qing Wang*, Liang Lin Sun Yat-sen University,

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Latent Boosting for Action Recognition Zhi Feng Huang et al. BMVC Jeany Son.

A Scale and Rotation Invariant Approach to Tracking Human Body Part Regions in Videos Yihang BoHao Jiang Institute of Automation, CAS Boston College.

Bag of Video-Words Video Representation

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,

Planar Cycle Covering Graphs for inference in MRFS The Typhon Algorithm A New Variational Approach to Ground State Computation in Binary Planar Markov.

2004, 9/1 1 Optimal Content-Based Video Decomposition for Interactive Video Navigation Anastasios D. Doulamis, Member, IEEE and Nikolaos D. Doulamis, Member,

Leo Zhu CSAIL MIT Joint work with Chen, Yuille, Freeman and Torralba 1.

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

#MOTION ESTIMATION AND OCCLUSION DETECTION #BLURRED VIDEO WITH LAYERS

Object Detection with Discriminatively Trained Part Based Models

A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.

Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.

Computer Graphics and Image Processing (CIS-601).

A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION Yan Song, Sheng Tang, Yan-Tao Zheng, Tat-Seng Chua, Yongdong Zhang, Shouxun Lin.

Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

Human Re-identification by Matching Compositional Template with Cluster Sampling Yuanlu Xu 1, Liang Lin 1, Wei-Shi Zheng 1, Xiaobai Liu 2 Abstract This.

Recognition Using Visual Phrases

Skeleton Based Action Recognition with Convolutional Neural Network

Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

A Discriminatively Trained, Multiscale, Deformable Part Model Yeong-Jun Cho Computer Vision and Pattern Recognition,2008.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams Jun Ye Kai Li Guo-Jun Qi Kien.

LOCUS: Learning Object Classes with Unsupervised Segmentation

Yun-FuLiu Jing-MingGuo Che-HaoChang

Machine Learning Basics

Paper Presentation: Shape and Matching

Learning to Detect a Salient Object

Video Summarization via Determinantal Point Processes (DPP)

Context-Aware Modeling and Recognition of Activities in Video

Globally Optimal Generalized Maximum Multi Clique Problem (GMMCP) using Python code for Pedestrian Object Tracking By Beni Mulyana.

“The Truth About Cats And Dogs”

2 variants: Global fusion & Local perturbation

Presented by: Yang Yu Spatiotemporal GMM for Background Subtraction with Superpixel Hierarchy Mingliang Chen, Xing Wei, Qingxiong.

Estimating Networks With Jumps

Brief Review of Recognition + Context

Oral presentation for ACM International Conference on Multimedia, 2014

Author: Ye Li, Meng Joo Er, and Dayong Shen Speaker: Kai-Wen, Weng

Weakly Supervised Action Recognition

Outline Background Motivation Proposed Model Experimental Results

Expectation-Maximization & Belief Propagation

Automatic Segmentation of Data Sequences

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Sun Yat-sen University

The EM Algorithm With Applications To Image Epitome

NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &

Deep Structured Scene Parsing by Learning with Image Descriptions

Presentation transcript:

Xiaodan Liang Sun Yat-Sen University ACM Multimedia, 2013 Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition Xiaodan Liang Sun Yat-Sen University Liang Lin Sun Yat-Sen University Liangliang Cao IBM Watson Research Center

Difficulties in human action recognition

We want to build a model which works well even with Variant scales and views Diversified appearance Clutterred background

Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?

Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?

Our belief: There are discriminative anchor frames and distinct body parts despite those variations. How to model?

And-Or graph We borrow the idea of And-Or graph, which hierarchically decomposes patterns with “and” nodes and “or” nodes allows rich structural variation builds a stochastic grammar And we want to extend it to video domain! Ref: Zhu and Mumford, A Stochastic Grammar of Images, 2006

Challenges Videos are more complicated than images, so We need more powerful models to capture more diversities We need more efficient algorithms to solve such models

Outline of our contributions A powerful model Spatio composition Temporal composition Contextual interaction An efficient learner Concave – convex procedure (CCCP) Structural Reconfiguration

Spatio-temporal and-or graph Global potential function Root-node Temporal anchor frames And-nodes Structural alternatives Or-nodes Local classifier for action parts Leaf-nodes Spatial and Temporal Edges Contextual Interactions

Temporal Composition Spatial Composition

Spatial composition And node for each anchor frame BOW histogram Aggregate the response scores from all or-nodes switch variables in hierarchy (i.e. or-nodes) are introduced to model the variance!!!

Temporal composition Root node for verifying the temporal compatibility of model Temporal displacement penalty Aggregate the scores from all anchor frames

Final model (STAOG) Spatial interactions Temporal interactions

Our learner We denote as the latent variables (spatial configuration alternatives, temporal displacement). An efficient learner is preferred to find the optimal response of : This is not a convex function and cannot be solved by traditional solvers.

Our learner (2) Convex Concave

Our learner (2) Convex Concave This problem can be solved by concave-convex procedure, which converts the problem into a sequence of convex programming. Yuille and Rangarajan, The concave-convex procedure, 2003

Learning the structure In addition to inferring latent variables, we will also update the model structure by clustering and greedy search. The whole learner works in an iterative way like EM algorithm, which will converge to a local minimum instead of global one. It took our learner 150-200 secons to infer one video from Olympic sport dataset and UCF Youtube dataset.

Discriminative frames and body parts Olympic sports dataset(High Jump) Frame #26 Frame #74 Frame #103 Frame #121 Frame #152

Discriminative frames and body parts YouTube Dataset(Swing) Frame #49 Frame #97 Frame #132 Frame #369 Frame #114 Frame #236

Average Precision on the Olympic Sports dataset

Accuracy on the YouTube dataset

Conclusion And-Or graph is a powerful representation which can be generalized to model large variance and diversity in video domain We develop efficient algorithms to efficient learn complex spatio-temporal And-Or graph. Our future work is to try to apply And-Or graph for large scale problems.

Thank you! Gracias! 谢谢！

Backup slides

experimental results The effectiveness of the or-nodes in spatial compositions Empirical analysis for different settings of spatial compositions, where we set different maximum numbers m of leaf-nodes under the or-nodes. Each pillar represents the ac-curacy for one action category. The color indicates the results with different settings , m= 2orm= 4.