Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory, Engineering Department, University of Cambridge

Introduction and Motivations A novel real-time solution for action recognition utilises local-appearance and structural information. High run-time performances Local appearance + structural information Short response time Real-time feature extraction and classification Continuous / frame-by-frame recognition Pyramidal spatiotemporal relationship match (PSRM) Main features / major contributions: Main objective: efficiency

A short demo Please visit: “http://www.youtube.com/watch?v=eD5b8d7hV6E” on the Internet for the full demo video.“http://www.youtube.com/watch?v=eD5b8d7hV6E

Related Work Many current methods focus on: [Schuldt et al. ICPR2004, Niebles et al. BMVC06, Ryoo and Aggarwal ICCV09, Willems BMVC09, Riemenschneider et al. BMVC09] Some achieve high accuracies, but take a long time to recognise How can we improve efficiency? Can we improve codebook learning and feature matching? “Bag of words” model Sophisticated spatiotemporal features Learned classifier K-means codebook Accuracy Action representation model (Feature design)

Related Work Vector quantisation by random forest [Moosmann et al. ECCV06] For image segmentation [Shotton et al. CVPR08] Can we apply it in video analysis? Pyramid match kernel [Graumann and Darrell. ICCV05] Image recognition [Graumann and Darrell. ICCV05], scene classification [Lazebnik et al. CVPR06], etc. Spatiotemporal relationship match [Ryoo and Aggarwal ICCV09] S. Lazebnik C. Schmid J. Ponce “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories”, CVPR 2006 K. Grauman and T. Darrell “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features” ICCV2005 F. Moosmann, B. Triggs, and F. Jurie. “Fast discriminative visual codebooks using randomized clustering forests” NIPS2006 J. Shotton, M. Johnson, and R. Cipolla. “Semantic texton forests for image categorization and segmentation” CVPR2008 M. S. Ryoo and J. K. Aggarwal. “Spatio-temporal relationship match: Video structure comparison for recognition of copmlex human activities” ICCV2009 Graumann and Darrell. ICCV05 Moosmann NIPS2006 Moosmann NIPS2006 Ryoo and Aggarwal ICCV09

Our Contributions Our contribution is three-fold: Efficient codebook learning High run-time performance Local appearance + structural information

Typical Approaches Feature Encoding Feature Matching K-means Clustering Slow for Large Codebook The “Bag of Words” (BOW) Model Lacks Structural Information Quantisation Error Our Method Semantic Texton Forest Efficient PSRM Structural Information Hierarchical Matching Robust Comparison with existing approaches

Overview Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature detection Feature extraction Feature matching Classification

Feature detection Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature detection

V-FAST: Spatiotemporal Feature Detection A novel spatiotemporal interest point detector Inspired from FAST [Rosten and Drummond ECCV2006] A cascade of three FAST detectors. Consider three orthogonal Bensenham circles Features: Very fast! E. Rosten and T. Drummond. “Machine learning for high-speed corner detection” ECCV 2006

Feature extraction Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature extraction

Building a codebook using STF Extract small video cuboids at detected keypoints Visual codebook using STF: Efficient visual codebook One feature → multiple codewords. Quantisation and partial matching Random forest based codebook Work on pixels directly Hierarchical splits “Textonises” patches recursively

Feature extraction Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature matching

Pyramidal Spatiotemporal Relationship Match (PSRM) PSRM: a multi-codewords multi- resolution SRM Old method: SRM [Ryoo and Aggarwal ICCV09] PSRM: A multi-codebook multi- resolution version. Natural combination: local appearance + action structure Evaluate each pair of codewords using a set of association rules. A set of “rules” (in different colours) are designed to describe spatiotemporal structure of features.

TREE N Pyramidal Spatiotemporal Relationship Match (PSRM)

Apply on all each “association rules” Apply on each tree in the STF We apply it semantically but not spatially Assumption: neighbouring codewords are similar Merging the ajacent nodes, instead of merging ajacent spatial bins Pyramid match kernel: Typical pyramid match kernel Our Pyramid Match Kernel

Multiple Structural Relationship Histograms Pyramid Match Kernel (PMK) Pyramid Match Kernel (PMK) Pyramidal Spatiotemporal Relationship Match (PSRM)

Typical Methods Our Approach Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Continuous action recognition

Classification Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Classification!

Combined Classification PSRM and BOST (bag of spatiotemporal textons) are classified indenpendently: PSRM: k-means forest M.Muja and D. G. Lowe. “Fast approximate nearest neighbors with automatic algorithm” VISAPP2009 K-means tree figure courtesy of David Aldavert Miró : http://www.cvc.uab.cat/~aldavert/plor/ Originally uses for NN approximation Use PSRM as the matching kernel Combined with the BOST model for final results

Experiments Short video sequences (50 frames ~ 2 seconds) are extracted from the input video. Sampling frequency is 5 frames for experiment and 1 frame for the laptop demo. (so it is a frame-by-frame recognition) Two datsets are used for performance evaluation: The standard benchmark Six classes, with viewpoint changes, illumination changes, zoom, etc. KTH dataset Human interactions, 6 classes of actions, cluttered background UT dataset (for ICPR contest on Semantic Description of Human Activities 2010) Intel Core i7 920 (for accuracy and speed tests) Core 2 Duo P9400 (for laptop demo) Hardware specifications KTH dataset UT interaction dataset

Experiments: Results (KTH dataset) Comparable to most state-of-the-art. Around ~3% slower than the top performer Is it a sensible trade-off? Useful for many more practical applications. (surveillance, robotics, etc.) snippet: subsequence level recognition sequence: major voting of subsequence labels leave-of-out-cross- validation Leave-of-out-cross- validation

Experiments: Results Results: UT interaction dataset Run time performance PSRM and BOST gave low accuracies when applied separately. ~20% performance improved by simply combining the class labels! < 25 fps, but enough for most real-time applications Can be further optimised (e.g. GPU, mult-core processing)

Demo video Frame-level recognition Potential improvement: Delay (~1s) in recognition results (Depends on the subsequence length ) Please visit: “http://www.youtube.com/watch?v=eD5b8d7hV6E” on the Internet for the full demo video.“http://www.youtube.com/watch?v=eD5b8d7hV6E

Conclusions

THE END THANK YOU VERY MUCH

Extra slide Formulation of V-FAST

Extra slide Formulation of STF Split function model: Split criteria --- Information gain:

Extra slide Formulation of STF

Extra slide Formulation of PSRM Step 1 Feature matching: Step 2 Semantic PMK over histogram

Extra slide Formulation of Classifier training Optimising the clusters of feature which maximise the PMK with the mean.

Extra slide Experiment parameters

Extra slide Confusion matrix:

Extra slide Kernel k- means forest Random forest PSRMBOST Action recognition results (class labels) Weighted combination

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,

Similar presentations

Presentation on theme: "Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,

Similar presentations

Presentation on theme: "Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,"— Presentation transcript:

Similar presentations

About project

Feedback