Morphological Segmentation of Natural Gesture Stroke Retract Prepare Hold Jacob Eisenstein MAS 622 Final Project
Natural Gesture Gesture supplements verbal communication Turn boundaries Reference resolution Visual imagery What are the lowest-level gesture units? McNeill: “Movement phases” Stroke Prepare Hold Retract
Videos of people explaining things to each other Prepare Stroke Hold Time
Outline Hand Tracking “Guided” clustering Kalman Filter Gesture Recognition Durational HMMs Recurrent Neural Networks
Hand Tracking Seems easy Occlusion, shadows Hands are not in every frame 85% accuracy with color info alone How to do better?
N P Better Hand Tracking Other features But how to use these features? Position Edges But how to use these features? Supervised Training P = set of positive examples N = set of negative examples P N
N P “Guided” Training P’ N’ Labeling is very expensive Approximate P and N Initialize clusters at centers of P’ and N’ K-means cluster using all points N P P’ N’
Hand Tracking Results Error Rate: (FP + FN + 2*WrongPos) / ALL
Kalman Filtering X(t) = X(t-1) + V(t-1) V(t) = V(t-1) + W(t) Y(t) = X(t) + R(t) State Observation Initialization Cov(W) = [.1 0 0 .1] Cov(R) = [1 0 0 1] Parameters re-estimated using EM
Kalman Filter Results Reduces position accuracy Smoothes velocity Improves overall performance by ~5%
Movement Phase Recognition Two sources of information Observable features Velocity, position Temporal / sequential Ideal for HMM?
HMM Setup We have data with states labeled Learn state transitions and outputs directly from data No need for Baum-Welch estimation Find best path using Viterbi Can use any probabilistic classifier for the output probabilities
Initial Results Accuracy = percent classified correctly Including “no gesture” 5-class problem 1-component mixture: 34.6% 3-component mixture: 33.3% 7-component mixture: 32.6% Not very good!
Durational HMMs HMMs assume an exponential decay model for state duration What about other models of state duration? Rabiner explains parameter estimation for durational HMMs, but not Viterbi
Viterbi for Gaussian Durational HMMs Pi(d) Pj(d) Leaving a state obeys an probability density function P(d==t) = N(t,u,s) Each self-transition obeys a cumulative probability function P(d>t) = 1-C(t,u,s) Normalize for the cost you’ve already paid P(d=t|d>t-1) = N(t,u,s)/(1-C(t-1,u,s)) P(t>t|d>t-1) = (1-C(t,u,s))/(1-C(t-1,u,s))
Results for Durational Viterbi Standard 1 component: 34.6 3 components: 33.3 7 components: 31.6 Durational 1 component: 35.5 3 components: 36.7 7 components: 38.0 Best durational is 3.4% better than best baseline
Neural Networks Feedforward network (13 x 50 x 5): 44.5% Ignoring sequence and temporal information! Maybe recurrent NNs can do even better?
Future Work Hand Tracking Kalman Filtering Gesture Phase Recognition Cluster to mixtures of Gaussians instead of single Gaussians Kalman Filtering Noise is not Gaussian Particle filter? Gesture Phase Recognition Recurrent Neural Networks Other discriminantive methods