GW2003, Genoa April, 2003 GesRec3D: A real-time coded gesture-to-speech system with automatic segmentation and recognition thresholding using dissimilarity measures Michael P. Craven, School of Engineering, University of Technology, Jamaica <michael.craven@ieee.org> K. Mervyn Curtis, Department of Mathematics and Computer Science, University of the West Indies, Jamaica Work carried out at University of Nottingham, School of Electrical and Electronic Engineering in collaboration with Access to Communication and Technology, Regional Rehabilitation Centre, Oak Tree Lane Centre, Selly Oak, Birmingham. Funded by Action Research grant.
Motivations and Issues Apply gesture recognition to severely disabled users (cerebral palsy, stroke) Augmentative and Alternative Communication (AAC) e.g. gesture-to-speech environmental control e.g. opening doors, operating appliances replace mouse buttons in PC applications Segment and recognise ‘crude’ gestures be less reliant on fine motor control maintain spatial and temporal differences filter out ‘spurious’ movements Control over recognition confidence robust acceptance/rejection strategy reduce confusion between gestures, but avoid excessive rejection may be safety critical Human factors user fatigue: incremental training, short overall training time understandability: for both disabled users and their helpers
GesRec3D gesture-to-speech system
GesRec3D: Summary Gesture->Text->Speech system MS Windows application running on PC with Soundblaster card Polhemus Fastrak tracker (1 to 4 sensors, 20 samples/sec) Up to 30 user-defined gestures linked to a user-defined (or preset) table of words/phrases, spoken by TextAssist speech engine Minimising Fatigue On-line segmentation for fast training & recognition Only 5 examples of each gesture Incremental acquisition (or removal) of gesture examples Other features Speech and/or text to prompt user input Sensitive to differences in scale and duration but invariant to gesture start location User control over segmentation & rejection/confusion trade-off
Real-time on-line segmentation Continuation condition FALSE Min. Duration condition FALSE Time-out condition FALSE Continuation condition FALSE RESET Starting condition TRUE GESTURE Continuation condition FALSE, Min. Duration condition TRUE END [start timer] Time-out condition TRUE Starting condition FALSE Continue condition TRUE Parameters 1. Starting speed 2. Continuation speed 3. Minimum duration 4. Time-out interval 5. Pause interval Add to training set, Pause Recognise
On-line segmentation video
Training - dissimilarity measure Compare 2 segmented gestures Ga(x,y,z) and Gb(x,y,z), of lengths ma and mb Dissimilarity measure dab - accumulated ‘city block’ distance same length 1) (m = ma = mb ) different lengths dynamic time-warping (non-linear optimal match) - slowest linearly interpolate shorter gesture and use 1) - faster pad shorter gesture with zeros and use 2) - fastest (ma > mb ) 2)
Training - rejection threshold Train C gesture classes, n examples of each Calculate nCnC dissimilarity matrix e.g. 60x60 elements for n=5, C=12 (Note: scales with both n2 and C2) For each class, find worst match internal to class, dint (largest value) find best match external to class, dext (smallest value) calculate rejection threshold Default global rejection parameter K=1 (midpoint threshold) Decrease K for stricter rejection May also set bounds on dth
Recognition - algorithms Best match between unknown gesture and any in training set is minimum distance dmin Single sensor: 1.Acquire gesture and compare with training set for best match dmin 2. Find gesture class corresponding to dmin 3. If dmin<dth select that class, otherwise reject gesture 4. Perform action linked to selected gesture class Multiple sensors: 1. Find class with dmin for each sensor 2. (optional: reject gesture if classes are different) 3. Find dth for each class for all sensors 4. Add the dmin 5. Add the dth 6. If dmin < dth select class corresponding to ‘primary’ sensor, otherwise reject gesture 7. Perform action linked to selected gesture class
Experiment 1- Shape gestures Multiple sensors Fast Slow
Results - Shape gestures Hit rates between 82-96% 100 further arbitrary gestures all rejected Spurious short gestures rejected by segmentation algorithm Fewer misses from confusion than rejection Fast training - 5 minutes to input 60 gestures (5 examples x 12 classes) Fast 60x60 dissimilarity matrix calculation (on Pentium 133MHz): Zero padded - 0.06sec Linear Interpolation - 0.6sec DP - 7.5 sec
Dissimilarity data - one row
Experiment 2 - Greeting gestures Figures in brackets demonstrate use of a stricter threshold to obtain lower confusion - global threshold K reduced by 10%
Dissimilarity data - multiple sensors
Research Directions Design alternative algorithms for multiple sensors e.g. incorporate arm model Use dissimilarity data to suggest ‘better’ gestures Further filter out ‘spurious’ movements e.g. tremor Design mobile tracking device with wireless sensors Improve user interface more intuitive control over recognition parameters esp. for helpers assess user motivation esp. for children investigate memorability of gestures