Human Activity Recognition at Mid and Near Range Ram Nevatia University of Southern California Based on work of several collaborators: F. Lv, P. Natarajan, S. Lee, C. Huang International Workshop on Video 2009 May 26, 2009
Activity Recognition: Motivation Is the key content of a video (along with scene description) Useful for Monitoring (alerts) Indexing (forensic, deep analysis, entertainment…) HCI ..
Activity Recognition: Goals Goal is not just to give a name, but also a description (not just the verb but a sentence) Who, what, when, where, why etc…? Some of these inferences require object recognition in addition to “action” recognition Actor, object, instrument…. Context and story understanding is important to infer intent
Action as Change of State A change in state, is given by some function, say f (s, s’, t), Example: walking changes position of the walker An event can also be defined over an interval where some properties of f are constant (or within a certain range) Example: walking at a constant speed or in the same direction Recognition methods require some estimate of the state, such as positions or pose of actors, their trajectories and relation to scene objects
Event Composition Composite Events Compositions of other, simpler events. Composition is usually, but not necessarily, a sequence operation, e.g. getting out of a car, opening a door and entering a building. Primitive events: those we choose not to decompose, e.g. walking Primitive events can be recognized directly from observables, by using standard classifiers. Graphical models, such as HMMs and CRFs are natural tools for recognition of composite events.
Hierarchical Models Hierarchical structure of events is naturally reflected in hierarchical graphical models
Issues in Activity Recognition Variations in image/video appearance due to changes in viewpoint, illumination, clothing, style of activity etc. Inherent ambiguities in 2-D videos Reliable detection and tracking of objects, especially those directly involved in activities Temporal segmentation “Recognition” of novel events
Mid vs Near Range Mid-range Limbs of human body, particularly the arms, are not distinguishable Common approach is to detect and track moving objects and make inferences based on trajectories Near-range Hands/arms are visible; activities are defined by pose transitions, not just the position transitions Pose tracking is difficult; top-down methods are commonly used
Mid-Range Example Example of abandoned luggage detection Based on trajectory analysis and simple object detection/recognition Uses a simple Bayesian classifier and logical reasoning about order of sub-events Tested on PETS and ETISEO data
Tracking in Crowded Environments Results from CVPR09 paper
Dealing with Track Failures In crowded environments, track fragmentation is common Events of interest themselves may cause occlusions, e.g. two (or more) people meeting Possible event detection can trigger a re- evaluation of the tracks Meeting event example People must have been separate, then get close to each other and stay together for some time How to distinguish between passing by and meeting? Both may cause tracks to vanish.
Meeting Event Result (Videos) Tracking Result Meeting Event Detection Result
Events requiring fine Pose Tracking Many events, e.g. gestures, requiring tracking of body pose, not just position Humans pose has large degrees of freedom > 50 joint angles/positions Bottom up pose tracking approaches are slow and not robust Top down approaches attempt to recognize activity and pose simultaneously Note that usually data is not pre-segmented into primitive action segments Closed-world assumption
Activity Recognition w/o Tracking Input sequence ………… 3D body pose ………… check watch punchkickpick upthrow + Action segments
Viewpoint change & pose ambiguity (with a single camera view) Difficulties Spatial and temporal variations (style, speed)
Key Poses and Action Nets Key poses are determined by an automatic method that computes large changes in energy; key poses may be shared among different actions
Experiments: Training Set 15 action models 177 key poses 6372 nodes in Action Net
Action Net: Apply constraints 0o0o 10 o …
Experiments: Test Set 50 clips, average length 1165 frames 5 viewpoints 10 actors (5 men, 5 women)
Experiments: Results PMKPMK-NU wo/ Action Net38.4%44.1% w/ Action Net56.7%80.6%
A Video Result extracted blob & ground truth with action net without action net original frame
Working with Natural Environments Foreground segmentation is difficult Leads to use of lower level features, e.g. edges and optical flow Key poses are not discriminative enough w/o accurate segmentation; actor position also needs to be inferred We introduce use of continuous pose sequence. More general graphical models that include Hierarchy Transition probabilities may depend on observations Observations may depend on multiple states Duration models (HMMs imply an exponential decay)
Experiments Tested the approach on videos of 6 actions- sit-on-ground(SG), standup-from-ground(StG), sit-on-chair(SC), standup-from-ground(StC), pickup(PK), point(P). Collected instances of these actions around 4 tilt angles and 5 pan angles A total of 400 instances over all actions with various backgrounds. We compared the relative importance of shape, flow and duration features with our system (shape+flow+duration).
Results Combining flow and shape produces a clear improvement. Bulk of the expense is in computing the flow.