Human Action Recognition

Slides:

Advertisements

Similar presentations

1 Gesture recognition Using HMMs and size functions.

Advertisements

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

Scene Labeling Using Beam Search Under Mutex Constraints ID: O-2B-6 Anirban Roy and Sinisa Todorovic Oregon State University 1.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Actions in video Monday, April 25 Kristen Grauman UT-Austin.

- Recovering Human Body Configurations: Combining Segmentation and Recognition (CVPR’04) Greg Mori, Xiaofeng Ren, Alexei A. Efros and Jitendra Malik -

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.

Adviser ： Ming-Yuan Shieh Student ID ： M Student ： Chung-Chieh Lien VIDEO OBJECT SEGMENTATION AND ITS SALIENT MOTION DETECTION USING ADAPTIVE BACKGROUND.

Computer and Robot Vision I

An Introduction to Hidden Markov Models and Gesture Recognition Troy L. McDaniel Research Assistant Center for Cognitive Ubiquitous Computing Arizona State.

Hidden Markov Models Theory By Johan Walters (SR 2003)

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Local Descriptors for Spatio-Temporal Recognition

A New Block Based Motion Estimation with True Region Motion Field Jozef Huska & Peter Kulla EUROCON 2007 The International Conference on “Computer as a.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Quadtrees, Octrees and their Applications in Digital Image Processing

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

Tracking Video Objects in Cluttered Background

Multi-camera Video Surveillance: Detection, Occlusion Handling, Tracking and Event Recognition Oytun Akman.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.

Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.

Image Segmentation by Clustering using Moments by, Dhiraj Sakumalla.

A Scale and Rotation Invariant Approach to Tracking Human Body Part Regions in Videos Yihang BoHao Jiang Institute of Automation, CAS Boston College.

05 - Feature Detection Overview Feature Detection –Intensity Extrema –Blob Detection –Corner Detection Feature Descriptors Feature Matching Conclusion.

Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.

Tracking Pedestrians Using Local Spatio- Temporal Motion Patterns in Extremely Crowded Scenes Louis Kratz and Ko Nishino IEEE TRANSACTIONS ON PATTERN ANALYSIS.

3D Motion Capture Assisted Video human motion recognition based on the Layered HMM Myunghoon Suk & Ashok Ramadass Advisor : Dr. B. Prabhakaran Multimedia.

Shape-Based Human Detection and Segmentation via Hierarchical Part- Template Matching Zhe Lin, Member, IEEE Larry S. Davis, Fellow, IEEE IEEE TRANSACTIONS.

COMPUTER VISION: SOME CLASSICAL PROBLEMS ADWAY MITRA MACHINE LEARNING LABORATORY COMPUTER SCIENCE AND AUTOMATION INDIAN INSTITUTE OF SCIENCE June 24, 2013.

Hand Gesture Recognition System for HCI and Sign Language Interfaces Cem Keskin Ayşe Naz Erkan Furkan Kıraç Özge Güler Lale Akarun.

Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.

7-Speech Recognition Speech Recognition Concepts

Learning and Recognizing Human Dynamics in Video Sequences Christoph Bregler Alvina Goh Reading group: 07/06/06.

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.

Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.

COMPARISON OF IMAGE ANALYSIS FOR THAI HANDWRITTEN CHARACTER RECOGNITION Olarik Surinta, chatklaw Jareanpon Department of Management Information System.

CVPR Workshop on RTV4HCI 7/2/2004, Washington D.C. Gesture Recognition Using 3D Appearance and Motion Features Guangqi Ye, Jason J. Corso, Gregory D. Hager.

資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 A Static Hand Gesture Recognition Algorithm Using K- Mean Based Radial Basis Function Neural Network 作者 :Dipak Kumar Ghosh,

Background Subtraction based on Cooccurrence of Image Variations Seki, Wada, Fujiwara & Sumi Presented by: Alon Pakash & Gilad Karni.

Data Extraction using Image Similarity CIS 601 Image Processing Ajay Kumar Yadav.

Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Presented by Pundik.

Vision-based human motion analysis: An overview Computer Vision and Image Understanding(2007)

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.

First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,

Segmentation of Vehicles in Traffic Video Tun-Yu Chiang Wilson Lau.

Human Activity Recognition at Mid and Near Range Ram Nevatia University of Southern California Based on work of several collaborators: F. Lv, P. Natarajan,

 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.

Text From Corners: A Novel Approach to Detect Text and Caption in Videos Xu Zhao, Kai-Hsiang Lin, Yun Fu, Member, IEEE, Yuxiao Hu, Member, IEEE, Yuncai.

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.

SUMMERY 1. VOLUMETRIC FEATURES FOR EVENT DETECTION IN VIDEO correlate spatio-temporal shapes to video clips that have been automatically segmented we.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Course : T Computer Vision

SIFT Scale-Invariant Feature Transform David Lowe

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Video-based human motion recognition using 3D mocap data

Context-Aware Modeling and Recognition of Activities in Video

Tremor Detection Using Motion Filtering and SVM Bilge Soran, Jenq-Neng Hwang, Linda Shapiro, ICPR, /16/2018.

Presented by: Yang Yu Spatiotemporal GMM for Background Subtraction with Superpixel Hierarchy Mingliang Chen, Xing Wei, Qingxiong.

Human Activity Analysis

Handwritten Characters Recognition Based on an HMM Model

Human-object interaction

Problem Image and Volume Segmentation:

Presentation transcript:

Human Action Recognition Avner Atias December 2011

Problem Description and Applications Recognition and classification of human action in image sequences (Video). Using the temporality of video images to associate sets of images to an action Applications: Real-time surveillance. Specific action recognition Message and commands Many more…

Top down – Bottom-up – 2 Main Approaches Detect the human body extract geometrical features. Bottom-up – Extract low level features classify into an action category

Solution Approaches Three Main approaches will discussed: Hidden Markov Models (HMM’s) Junji Yamato, Jun Ohya and Kenichiro Ishii,”Recognizing human action in time-sequential images using HMM’s”, 1992. Shape motion prototype trees Zhe Lin, Zhuolin Jiang and Larry S. Davis, “Recognizing actions by shape-motion prototype trees”, Spatiotemporal graphs William Brendel and Sinisia Todorovic, “Learning spatiotemporal graphs of human activities”,

action in time-sequential images using HMM’s First Approach Recognizing human action in time-sequential images using HMM’s Junji Yamato Jun Ohya Kenichiro Ishii 1992

First Approach – General Principles Utilizes HMM’s to classify a set of images to a human action. Bottom-up approach Learning - HMM’s are trained for each action (category). Recognition - The forward variable . Action primitives.

?First Approach – What Are HMM’s Exemplary problem: The “Hidden” part of HMM. Taken from Rabiner’s tutorial of HMM (Link in references)

First Approach – What Are HMM’s (Cont.) Model notations: A – transition matrix between states B – symbol output probability - The initial state probability. The set of observations. These notation define a complete HMM: http://en.wikipedia.org/wiki/Hidden_Markov_model

(Cont.)?First Approach – What Are HMM’s HMM enables us to answer one of the following three questions: Given the observation sequence O and the model ,how can we efficiently compute ? Choose the most likely state sequence? (Viterbi algorithm) Maximize the probability ?

First Approach – Forward Variable In our case we have several HMM’s. Determine which of them is the most probable one. The forward variable is calculated as follows:

First Approach – Mesh Features Extracting low level features of the human figure. Mesh feature The feature vector: Binarization of the image:

First Approach – Mesh Features (Cont.) Calculating the feature vector: Where Clustering to 72 primitives (12 for each of 6 categories).

First Approach – Learning Phase Three learning/pre-processing were applied: Background – Background image was saved. Training of the HMM’s – Baum-Welch algorithm for maximizing the category probability Clusters generation – code words.

First Approach – Algorithm Block Diagram THR Image(t) Human Figure Extraction Mesh Feature Extraction + Background Image VQ Codewords (Clusters) Symbol Sequence HMM

First Approach – Results First experiment – Same persons. 10 repetitions. Second experiment – Different persons. 10 repetitions.

First Approach – Pro’s and Con’s Simplicity - Bottom-up approach requires low-level features of the image that are easy to extract. Con’s: Threshold setting – The threshold for human figure. Static camera. Robustness

Recognizing Actions by Shape-Motion Prototype Second Approach Recognizing Actions by Shape-Motion Prototype Trees Zhe Lin Zhuolin Jiang Larry S. Davis ICCV 2009

Second Approach – General Principles Full actions to atomic prototypes. Top-down approach. Tree configuration of the prototypes. Shape-motion descriptors.

Second Approach – What Are Shape-Motion Features? Descriptors: The shape descriptor: Si = # of background pixels in region i Motion Shape

Second Approach – What Are Shape-Motion Features? The motion descriptor is obtained as follows: Optical flow field ( and components). Median subtraction. Gaussian blurring.

Second Approach – What Are Shape-Motion Features? Motion descriptor:

Second Approach – What Are Prototype Trees? Action prototypes generated by K-mean clustering. The actions (A set of prototypes) are set on a binary tree for quick search and classification. Prototype Actions Shape-Motion Descriptors

Second Approach – Learning Phase (Cont.) Distance matrices are constructed between prototypes.

Second Approach – Algorithm Block Diagram

Second Approach – Results Three sets of datasets were used: authors original, Weizmann and KTH. All databases were tested using the Leave-One-Person-Out approach. Performance: The joint feature method outperformed the motion or shape only methods. The descriptor distance method yielded the same recognition rates as the joint method.

Second Approach – Experiments Authors Original Dataset General description: 14 different gesture classes 3 persons Each gesture class was performed 3 times Size: 3x3x14 = 126 learning videos sequences Experiments: Changing descriptors (Static camera):

Second Approach – Experiments (Cont.) Authors Original Changing the number of prototypes (Static camera):

Second Approach – Results (Cont.) Authors Original Changing descriptors (Dynamic camera and background): Changing the number of prototypes (Dynamic camera and background):

Second Approach – Results (Cont.) Weizmann Dataset General description: 10 prototype classes 9 persons Experiments: Static or dynamic? (Not stated) Changing descriptors:

Second Approach – Results (Cont.) Weizmann Dataset Changing the number of prototypes

Second Approach – Pro’s and Con’s The joint approach of motion and shape descriptors increases robustness Static and dynamic cameras. Con’s: The detection of the human figure is computationally expansive (Optical flow)

Learning spatiotemporal Third Approach Learning spatiotemporal Graphs of Human actions William Braendel Sinisa Todorovic ICCV 2011

Third Approach – General Principles Uses motion and intensity features to generate motion 2D+t tubes. Learns actions’ graphs and matches new actions to those graphs for classification. Top-down approach.

Third Approach – What are the 2D+t tubes? Objects’ and their motion are extracted throughout the image sequence. These tubes represent the objects relevant 3D spatiotemporal motion.

Third Approach – What are the 2D+t tubes? The tubes constructed by homogeneous blocks. Homogeneous block: a group of pixels that present a lower variation in motion and intensity then its surrounding

Third Approach – Extracting the graphs After a video was segmented to relevant moving objects, and the tubes were extracted, a spatiotemporal graph is rendered. Object Segmentation Spatiotemporal Graph Generation Tubes Extraction

Third Approach – Extracting the graphs (Cont.) Graph nodes represent the tubes. Edges: 3 types of relationships between the tubes: Hierarchical (‘ascendant’, ‘descendant’) Temporal (‘before’, ‘after’, ‘overlap’, ‘meet’). Spatial (‘Left’, ‘Up', 'Down’, ‘Right’). The directed edges are labeled with the strength of the relationships

Third Approach – Extracting the graphs (Cont.) Adjacency matrices were computed (nxn, where n – the number of nodes) The matrices contains the strength of each of the 3 relationships, between all nodes. The strengths were computed as follows: Hierarchical – the ratio of ascendant-descendant volume. Temporal – The ratio between the number of frames of the tube and the while video. Spatial – Binary values for absent or present (within a certain distance from each tube).

Third Approach – Results The database used was the Olympic sports dataset. The results were compared to other existing methods, both in accuracy of recognition and running-time. [12] I. Laptev,M.Marszalek, C. Schmid, B. Rozenfeld, I. Rennes, I. I. Grenoble, and L. L. B. Learning realistic human actions from movies. In CVPR, 2008. 7 [16] J. C. Niebles, C.-W. Chen, , and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010. 1, 6, 7, 8

Third Approach – Results (Cont.) Accuracy results were usually better than other methods: [20] S. Todorovic and N. Ahuja. Unsupervised category modeling, recognition, and segmentation in images. IEEE TPAMI, 30(12):1–17, 2008.

Third Approach – Pro’s and Con’s After graphs were extracted, the matching problem reduces to QAP (Quadratic Assignment Problem). More aware about what parts of the image represent the actions and movements relevant to the overall action. Con’s: The article is not self contained – the QAP is solved using the commercial cvx software.

Conclusion and Timeline The three methods presented represent a timeline of improvements: Approach Year Feature Model Learning Hidden Markov Models (HMM’s) 1992 Mesh feature HMM + Shape motion prototype trees 2009 shape-motion Binary tree Spatiotemporal graphs 2011 Directed graph

Conclusion and Timeline (Cont.) Performance comparison: In terms of run time, only the last 2 approaches can be compared because of almost 20 years of hardware difference: Accuracy: Approach Running Time [m/s] Shape motion prototype trees 0.5 Spatiotemporal graphs 14.2 Approach Average Recognition % Shape motion prototype trees 94.22 Spatiotemporal graphs 77.30

Conclusion and Timeline (Cont.) Note: The accuracy comparison is limited because the datasets differ, and only the last 2 approaches handled dynamic camera and background issues.