UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University College Dublin Ireland
UCD Electronic and Electrical Engineering Overview Motivation Analysis for Speech and Mouth Feature Experts Results for Individual 2 Experts Automatic Integration of Experts Results of Integration Conclusions
UCD Electronic and Electrical Engineering Motivation Human Communication is multimodal Benefits of using visual information - Unaffected by acoustic noise - Complementary to audio signal - Audio and visual noise is uncorellated - Increased robustness and accuracy
UCD Electronic and Electrical Engineering Audio-Visual Platform Score Integration Feature Extraction Modelling/ Scoring
UCD Electronic and Electrical Engineering Audio Expert 20 ms Hamming window, 10 ms overlap 16 static features –15 Mel Frequency Cepstrum Coefficients (MFCC) –1 Energy of each frame 16 delta features
UCD Electronic and Electrical Engineering Mouth Features Expert ROI Extraction Gray scale image is employed Pre-processing: –Histogram-equalisation, –De-meaning DCT Transform applied to ROI (Top 14 features selected)
UCD Electronic and Electrical Engineering Database XM2VTS database 295 subjects 4 sessions (monthly spaced) of the sentence “Joe took fathers green shoe bench out”
UCD Electronic and Electrical Engineering Person Identification Tests Tested on 251 subjects from database of 295 Train models on monthly sessions 1, 2 and 3, Test on session 4 HMMs model audio and mouth features AWGN was added to the audio JPEG compression of video images
UCD Electronic and Electrical Engineering Audio Expert Scores 97% at 48 dB, 37% at 21dB Large roll off Audio SNR [dB] Identification Accuracy [%] Audio
UCD Electronic and Electrical Engineering Image Degradation Levels QF 8QF 6QF 4QF 3QF 2 QF 50 QF 25 QF 18QF 14 QF 10 QF 8QF 6QF 4QF 3QF 2 QF 50 QF 25 QF 18QF 14 QF levels of JPEG compression Image frames Mouth regions
UCD Electronic and Electrical Engineering Mouth Features Expert Scores JPEG Quality Factor Identification Accuracy [%] 86% at GF = 50, 48% at QF = 2
UCD Electronic and Electrical Engineering Audio-Visual Platform Score Integration Feature Extraction Modelling/ Scoring
UCD Electronic and Electrical Engineering Expert Weightings Weighted Likelihood Summation )|(.)|(.)|,( iVV ViAAAiVAAV SOlSOlSOOl V}{A,m ),|()|( SOlSOl mmmmm }{maxarg )1,0( VAV opt V V Expert Reliability Measure Automatically Choose Weight ] 1,0 [, and 1 VAVA
UCD Electronic and Electrical Engineering Expert Weightings Automatically choose weight
UCD Electronic and Electrical Engineering Fusion of Audio and Mouth Feature Experts A = 37% at 21dB, V = 48% at QF = 2, AV = 72% at (21db, QF=2) Accuracy [%] Audio Level [SNR ] Visual Level [JPEG] Visual Alone Audio Alone
UCD Electronic and Electrical Engineering Conclusions AV system is robust to both audio and visual degradations High performance of mouth region (85%) -Robust to facial expressions, occlusion. Further work Test other types of audio and visual degradations XM2VTS DB: High quality Record real world data in office type scenario …
UCD Electronic and Electrical Engineering XM2VTS Database Controlled, uniform illumination Constant visual background Controlled acousitc background
UCD Electronic and Electrical Engineering UCD Recordings Non-controlled, non-uniform illumination Varying viusal background Noisy acousitc background
UCD Electronic and Electrical Engineering Niall Fox Web: Dr Richard Reilly DSP Group, UCD, Dublin, Ireland This work is supported by Enterprise Ireland under the Informatics Research Initiative