Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,

Slides:



Advertisements
Similar presentations
Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities M. S. Ryoo and J. K. Aggarwal ICCV2009.
Advertisements

Antón R. Escobedo cse 252c Behavior Recognition via Sparse Spatio-Temporal Features Piotr Dollár Vincent Rabaud Garrison CottrellSerge Belongie.
Ignas Budvytis*, Tae-Kyun Kim*, Roberto Cipolla * - indicates equal contribution Making a Shallow Network Deep: Growing a Tree from Decision Regions of.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Silhouette-based Object Phenotype Recognition using 3D Shape Priors Yu Chen 1 Tae-Kyun Kim 2 Roberto Cipolla 1 University of Cambridge, Cambridge, UK 1.
Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.
Proposed concepts illustrated well on sets of face images extracted from video: Face texture and surface are smooth, constraining them to a manifold Recognition.
MIT CSAIL Vision interfaces Approximate Correspondences in High Dimensions Kristen Grauman* Trevor Darrell MIT CSAIL (*) UT Austin…
CS395: Visual Recognition Spatial Pyramid Matching Heath Vinicombe The University of Texas at Austin 21 st September 2012.
1 Part 1: Classical Image Classification Methods Kai Yu Dept. of Media Analytics NEC Laboratories America Andrew Ng Computer Science Dept. Stanford University.
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Su-A Kim 3 rd June 2014 Danhang Tang, Tsz-Ho Yu, Tae-kyun Kim Imperial College London, UK Real-time Articulated Hand Pose Estimation using Semi-supervised.
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Transferable Dictionary Pair based Cross-view Action Recognition Lin Hong.
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
Discriminative and generative methods for bags of features
Local Descriptors for Spatio-Temporal Recognition
Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.
Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
Lecture 28: Bag-of-words models
Object Recognition with Invariant Features n Definition: Identify objects or scenes and determine their pose and model parameters n Applications l Industrial.
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Bag-of-features models
Local Features and Kernels for Classification of Object Categories J. Zhang --- QMUL UK (INRIA till July 2005) with M. Marszalek and C. Schmid --- INRIA.
Pyramids of Features For Categorization Greg Griffin and Will Coulter (see Lazebnik et al., CVPR 2006, too)
Spatial Pyramid Pooling in Deep Convolutional
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Large Scale Recognition and Retrieval. What does the world look like? High level image statistics Object Recognition for large-scale search Focus on scaling.
Machine learning & category recognition Cordelia Schmid Jakob Verbeek.
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Bag of Video-Words Video Representation
Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Keypoint-based Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/04/10.
Action recognition with improved trajectories
Machine learning & category recognition Cordelia Schmid Jakob Verbeek.
1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Svetlana Lazebnik, Cordelia Schmid, Jean Ponce
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Pedestrian Detection and Localization
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
21 June 2009Robust Feature Matching in 2.3μs1 Simon Taylor Edward Rosten Tom Drummond University of Cambridge.
Human pose recognition from depth image MS Research Cambridge.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
Methods for classification and image representation
Hierarchical Matching with Side Information for Image Classification
First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,
Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Skeleton Based Action Recognition with Convolutional Neural Network
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Goggle Gist on the Google Phone A Content-based image retrieval system for the Google phone Manu Viswanathan Chin-Kai Chang Ji Hyun Moon.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Another Example: Circle Detection
Lecture IX: Object Recognition (2)
Learning Mid-Level Features For Recognition
Paper Presentation: Shape and Matching
By Suren Manvelyan, Crocodile (nile crocodile?) By Suren Manvelyan,
CS 1674: Intro to Computer Vision Scene Recognition
CVPR 2014 Orientational Pyramid Matching for Recognizing Indoor Scenes
Introduction to Object Tracking
A Graph-Matching Kernel for Object Categorization
Presentation transcript:

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory, Engineering Department, University of Cambridge

Introduction and Motivations A novel real-time solution for action recognition utilises local-appearance and structural information. High run-time performances Local appearance + structural information Short response time Real-time feature extraction and classification Continuous / frame-by-frame recognition Pyramidal spatiotemporal relationship match (PSRM) Main features / major contributions: Main objective: efficiency

A short demo Please visit: “ on the Internet for the full demo video.“

Related Work Many current methods focus on: [Schuldt et al. ICPR2004, Niebles et al. BMVC06, Ryoo and Aggarwal ICCV09, Willems BMVC09, Riemenschneider et al. BMVC09] Some achieve high accuracies, but take a long time to recognise How can we improve efficiency? Can we improve codebook learning and feature matching? “Bag of words” model Sophisticated spatiotemporal features Learned classifier K-means codebook Accuracy Action representation model (Feature design)

Related Work Vector quantisation by random forest [Moosmann et al. ECCV06] For image segmentation [Shotton et al. CVPR08] Can we apply it in video analysis? Pyramid match kernel [Graumann and Darrell. ICCV05] Image recognition [Graumann and Darrell. ICCV05], scene classification [Lazebnik et al. CVPR06], etc. Spatiotemporal relationship match [Ryoo and Aggarwal ICCV09] S. Lazebnik C. Schmid J. Ponce “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories”, CVPR 2006 K. Grauman and T. Darrell “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features” ICCV2005 F. Moosmann, B. Triggs, and F. Jurie. “Fast discriminative visual codebooks using randomized clustering forests” NIPS2006 J. Shotton, M. Johnson, and R. Cipolla. “Semantic texton forests for image categorization and segmentation” CVPR2008 M. S. Ryoo and J. K. Aggarwal. “Spatio-temporal relationship match: Video structure comparison for recognition of copmlex human activities” ICCV2009 Graumann and Darrell. ICCV05 Moosmann NIPS2006 Moosmann NIPS2006 Ryoo and Aggarwal ICCV09

Our Contributions Our contribution is three-fold: Efficient codebook learning High run-time performance Local appearance + structural information

Typical Approaches Feature Encoding Feature Matching K-means Clustering Slow for Large Codebook The “Bag of Words” (BOW) Model Lacks Structural Information Quantisation Error Our Method Semantic Texton Forest Efficient PSRM Structural Information Hierarchical Matching Robust Comparison with existing approaches

Overview Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature detection Feature extraction Feature matching Classification

Feature detection Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature detection

V-FAST: Spatiotemporal Feature Detection A novel spatiotemporal interest point detector Inspired from FAST [Rosten and Drummond ECCV2006] A cascade of three FAST detectors. Consider three orthogonal Bensenham circles Features: Very fast! E. Rosten and T. Drummond. “Machine learning for high-speed corner detection” ECCV 2006

Feature extraction Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature extraction

Building a codebook using STF Extract small video cuboids at detected keypoints Visual codebook using STF: Efficient visual codebook One feature → multiple codewords. Quantisation and partial matching Random forest based codebook Work on pixels directly Hierarchical splits “Textonises” patches recursively

Feature extraction Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Feature matching

Pyramidal Spatiotemporal Relationship Match (PSRM) PSRM: a multi-codewords multi- resolution SRM Old method: SRM [Ryoo and Aggarwal ICCV09] PSRM: A multi-codebook multi- resolution version. Natural combination: local appearance + action structure Evaluate each pair of codewords using a set of association rules. A set of “rules” (in different colours) are designed to describe spatiotemporal structure of features.

TREE N Pyramidal Spatiotemporal Relationship Match (PSRM)

Apply on all each “association rules” Apply on each tree in the STF We apply it semantically but not spatially Assumption: neighbouring codewords are similar Merging the ajacent nodes, instead of merging ajacent spatial bins Pyramid match kernel: Typical pyramid match kernel Our Pyramid Match Kernel

Multiple Structural Relationship Histograms Pyramid Match Kernel (PMK) Pyramid Match Kernel (PMK) Pyramidal Spatiotemporal Relationship Match (PSRM)

Typical Methods Our Approach Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Features Classification Continuous action recognition

Classification Spatiotemporal Semantic Texton Forest V-FAST Corner PSRM BOSTRandom Forest Classifier K-means Forest Results Spatio- temporal Cuboids Classification!

Combined Classification PSRM and BOST (bag of spatiotemporal textons) are classified indenpendently: PSRM: k-means forest M.Muja and D. G. Lowe. “Fast approximate nearest neighbors with automatic algorithm” VISAPP2009 K-means tree figure courtesy of David Aldavert Miró : Originally uses for NN approximation Use PSRM as the matching kernel Combined with the BOST model for final results

Experiments Short video sequences (50 frames ~ 2 seconds) are extracted from the input video. Sampling frequency is 5 frames for experiment and 1 frame for the laptop demo. (so it is a frame-by-frame recognition) Two datsets are used for performance evaluation: The standard benchmark Six classes, with viewpoint changes, illumination changes, zoom, etc. KTH dataset Human interactions, 6 classes of actions, cluttered background UT dataset (for ICPR contest on Semantic Description of Human Activities 2010) Intel Core i7 920 (for accuracy and speed tests) Core 2 Duo P9400 (for laptop demo) Hardware specifications KTH dataset UT interaction dataset

Experiments: Results (KTH dataset) Comparable to most state-of-the-art. Around ~3% slower than the top performer Is it a sensible trade-off? Useful for many more practical applications. (surveillance, robotics, etc.) snippet: subsequence level recognition sequence: major voting of subsequence labels leave-of-out-cross- validation Leave-of-out-cross- validation

Experiments: Results Results: UT interaction dataset Run time performance PSRM and BOST gave low accuracies when applied separately. ~20% performance improved by simply combining the class labels! < 25 fps, but enough for most real-time applications Can be further optimised (e.g. GPU, mult-core processing)

Demo video Frame-level recognition Potential improvement: Delay (~1s) in recognition results (Depends on the subsequence length ) Please visit: “ on the Internet for the full demo video.“

Conclusions

THE END THANK YOU VERY MUCH

Extra slide Formulation of V-FAST

Extra slide Formulation of STF Split function model: Split criteria --- Information gain:

Extra slide Formulation of STF

Extra slide Formulation of PSRM Step 1 Feature matching: Step 2 Semantic PMK over histogram

Extra slide Formulation of Classifier training Optimising the clusters of feature which maximise the PMK with the mean.

Extra slide Experiment parameters

Extra slide Confusion matrix:

Extra slide Kernel k- means forest Random forest PSRMBOST Action recognition results (class labels) Weighted combination