Download presentation
Presentation is loading. Please wait.
1
CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing David Dean*, Patrick Lucey*, Sridha Sridharan* and Tim Wark* † Presented by David Dean
2
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 2 Audio-Visual Speech Processing - Overview Speech or speaker recognition traditionally audio only –Mature area of research Significant problems in real-world environments (Wark2001) –High acoustic noise –Variation of speech Audio-visual speech processing adds an additional modality to help alleviate these problems
3
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 3 Audio-Visual Speech Processing - Overview Speech and speaker recognition tasks have many overlapping areas The same configuration can be used for both text-dependent speaker recognition, and speaker-dependent speech recognition –Train speaker-dependent word (or sub-word) models –Speaker recognition chooses amongst speakers for a particular word, or –Word recognition chooses amongst words for a particular speaker.
4
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 4 Audio-Visual Speech Processing - Overview Little research has been done into how the two applications (speaker vs. speech) differ in areas other than the set of models chosen for recognition One area of interest in this research is the reliance on each modality –Acoustic features typically work equally well in either application (Young2002) –Little consensus has been reach on the suitability of visual features for each application
5
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 5 Experimental Setup Speech/ Speaker Decision Visual Feature Extraction Acoustic Feature Extraction Lip Location & Tracking Visual Speech/Speaker Models Acoustic Speech/Speaker Models Decision Fusion
6
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 6 Lip location and tracking
7
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 7 Finding Faces Manual Red, Green and Blue skin thresholds were trained for each speaker Faces were located by applying these thresholds to the video frames
8
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 8 Finding and tracking eyes Top half of face region is searched for eyes A shifted version of Cr-Cb thresholding was performed to locate possible eye regions (Butler2003) Invalid eye candidate regions were removed, and the most likely pair of candidates chosen as the eyes New eye location compared to old, and ignored if too far from old About 40% of sequences had to be manually eye-tracked every 50 frames.
9
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 9 Finding and tracking lips Eye locations are used to define rotation-normalised lip search region (LSR) LSR converted to Red/Green colour- space and thresholded Unlikely lip-candidates are removed Rectangular area with largest amount of lip-candidate area within is lip ROI.
10
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 10 Feature Extraction and Datasets MFCC – 15 + 1 energy, + deltas and accelerations = 48 features PCA – 20 eigenlip coefficients + deltas and accelerations = 60 features –Eigenlip-space trained on entire data set of lip images Stationary speech from CUAVE (Patterson2002) –5 sequences for training, 2 for testing (per speaker) –Testing was also performed on speech-babble corrupted noisy versions
11
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 11 Training Phone transcriptions obtained from earlier research (Lucey 2004) were used to train speaker independent HMM phone models in both audio and visual domains Speaker dependent models adapted using MLLR adaption from speaker independent models HMM Toolkit (HTK) was used (Young 2002)
12
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 12 Comparing acoustic and visual information for speech processing Investigated using the identification rates of speaker- dependent acoustic and visual phoneme models Test segments freely transcribed using all speaker dependent phoneme models –No restriction to specified user or word Confusion tables for speech (phoneme) and speaker recognition were examined to get identification rates Corrects02m, /w/s02m, /ah/s02m, /n/ Audios10m, /w/s02m, /ah/s02m, /n/ Videos02m, /sp/s02m, /n/
13
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 13 Example Confusion Table (Phonemes in Clean Acoustic Speech) Actual Phonemes Recognised Phonemes
14
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 14 Example Confusion Table (Phonemes in Clean Visual Speech) Actual Phonemes Recognised Phonemes
15
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 15 Likelihood of speaker and phone identification using phoneme models
16
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 16 Fusion Because of the differing performance of each modality at speech and speaker recognition, the fusion configuration for each task must be adjusted with these performances in mind For these experiments –Weighted sum fusion of the top 10 normalised scores in each modality – ranges from 0 (video only) to 1 (audio only)
17
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 17 Speech vs Speaker The response of each system to speech-babble noise over a selected range of values were compared. Word Identification
18
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 18 Speech vs Speaker The response of each system to speech-babble noise over a selected range of values were compared. Speaker Identification
19
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 19 Speech vs Speaker Acoustic performance is basically equal for both tasks Visual performance is clearly better for speaker recognition Speech recognition fusion is catastrophic at nearly all noise levels Speaker recognition is only catastrophic at high noise levels We can also get an idea of the dominance of each modality by looking at values of that produce the ‘best’ lines (ideal adaptive fusion)
20
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 20 ‘Best’ Fusion
21
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 21 Conclusion and Further Work PCA-based visual features are mostly person-dependent –Should be used with care in visual speech recognition tasks It is believed that this dependency stems from the large amount of static person-specific information capture along with the dynamic lip configuration –Skin colour, facial hair, etc. Visual information for speech recognition is only useful in high noise situations
22
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 22 Conclusion and Further Work Even at very low levels of acoustic noise, visual speech information can provide similar performance to acoustic information for speaker recognition Adaptive fusion for speaker recognition should therefore be biased towards visual features for best performance Further study needs to be performed in methods of improving the visual modality for speech recognition by focusing more on the dynamic speech-related information –Mean-image removal, Optical flow, Contour representations
23
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 23 References (Butler2003) D. Butler, C. McCool, M. McKay, S. Lowther, V. Chandran, and S. Sridharan, "Robust Face Localisation Using Motion, Colour and Fusion," presented at Proceedings of the Seventh International Conference on Digital Image Computing: Techniques and Applications, DICTA 2003, Macquarie University, Sydney, Australia, 2003. (Lucey2004) P. Lucey, T. Martin, and S. Sridharan, "Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments," presented at SST 2004, Sydney, Australia, 2004. (Patterson2002) E. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, "CUAVE: a new audio-visual database for multimodal human-computer interface research," presented at Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on, 2002. (Young2002) S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, 3.2 ed. Cambridge, UK: Cambridge University Engineering Department., 2002. (Wark2001) T. Wark and S. Sridharan, "Adaptive fusion of speech and lip information for robust speaker identification," Digital Signal Processing, vol. 11, pp. 169-186, 2001.
24
CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing e-Health Research Centre/ CSIRO ICT Centre 24 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.