Download presentation
Presentation is loading. Please wait.
1
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr Computer Vision, Speech Communication and Signal Processing Research Group HIWIRE
2
ICCS - NTUA HIWIRE Meeting, July 2006 Group Leader : Prof. Petros Maragos Ph.D. Students / Graduate Research Assistants : D. Dimitriadis (speech: recognition, modulations) V. Pitsikalis (speech: recognition, fractals/chaos, fusion) A. Katsamanis (speech: modulations, statistical processing, recognition, fusion) G. Papandreou (vision: PDEs, active contours, level sets, AV-ASR, fusion) G. Evangelopoulos (vision/speech: texture, modulations, fractals) S. Leukimmiatis (speech: statistical processing, microphone arrays) HIWIRE Involved CVSP Members
3
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA Tasks Involvement WP1: Environment and Sensor Robustness (26MM) Task 1: Sensor Integration & Independence (11MM) Subject 1: Multi-Microphone Systems ( 5MM) Subject 5: Multi-Modal Features (audio-visual) (6MM) Task 2: Noise Independence (15MM) Subject 2: Advanced Signal Processing (15MM) WP2: User Robustness (8MM) Task 1: Improved Speaker Independence (4MM) Task 2: Rapid Speaker Adaptation (4MM) WP3: System Integration (4MM) WP4: Evaluation (5MM) WP5: Exploitation and dissemination (1MM)
4
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE Evaluation Databases & Baseline Completed Platform Front-end Release 1 st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASR Baseline + Adv. Visual Features VADCompleted + Integration WP2 VTLN Platform IntegrationCompleted Speaker Normalization ResearchPrelim. Results Non-native Speech DatabaseCompleted
5
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE Evaluation Databases & Baseline Completed Platform Front-end Release 1 st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASR Baseline + Adv. Visual Features VADCompleted + Integration WP2 VTLN Platform IntegrationCompleted Speaker Normalization ResearchPrelim. Results Non-native Speech DatabaseCompleted
6
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Challenges Points Considered during Implementation Modular Architecture Implementation in C-Code Incorporation of Different Ideas/Algorithms User-friendly interface providing additional options dealing with on-site demands of the project
7
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Options Want VAD? No LTSDVAD / MTEVAD Yes Want Denoising? No Yes Wiener Denoising MFCC/ TECC MFCC Speech Signals Speech Processing (Features) Speech Pre-Processing (Denoising) 1 1 22 33 Support for Input Speech Signals Different Sampling Frequencies 8 kHz 11 kHz 16 kHz Different Byte-Ordering Little-endian Big-endian Different Input File Formats RAW NIST HTK Provides Flags/ Options: Preprocessing Smoothing of Speech Signals Hamming Windowing Pre-emphasis Denoising/ VAD Algorithms LTSD-VAD Algorithm (UGR) MTE-VAD Algorithm (ICCS-NTUA) Wiener Denoising Algorithm- (Used only with a VAD algorithm) Output Features MFCC TECC C0 or logE
8
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Things to Be Done Script is in Testing Phase Create a CVS where Additional Modules should be included Tested Further in Speech Databases Evaluation in progress Fine-Tuning is Necessary Final Version should be Faster (Real-Time Processing) Incorporate it in the HIWIRE Platform
9
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Aurora 3 - Spanish Connected-Digits, Sampling Frequency 8 kHz Training Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits 2 Back-end ASR Systems (ΗΤΚ and BLasr) Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC All-Pair, Unweighted Grammar (or Word-Pair Grammar) Performance Criterion: Word (digit) Accuracy Rates
10
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Databases: Aurora 2 Task: Speaker Independent Recognition of Digit Sequences TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F) Clean (8kHz, G712) Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB, clean Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB, clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)
11
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1 st, 2 nd Year Evaluation Databases & Baseline Completed Platform Front-end Release 1st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASR Baseline + Adv. Visual Features VADCompleted + Integration? WP2 VTLN Platform IntegrationCompleted Speaker Normalization ResearchPrelim. Results Non-native Speech DatabaseCompleted
12
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Microphone Arrays Multi-channel Speech Enhancement for Diffuse Noise Fields –MVDR (Minimum Variance Distortionless Response) Beamforming –Single Channel Linear and non-linear Post-Filtering MSE criterion leads to the linear Wiener Post-filter. MSE STSA and MSE log-STSA criteria leads to non-Linear Post-filters.
13
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Microphone Arrays The Overall Speech Enhancement System includes the following steps: The noisy channel’s inputs are fed into a time alignment module (Different propagation paths for every input channel) The time aligned noisy observations are projected to a single channel output with minimum noise variance, through the MVDR beamformer. The output of the beamformer is further processed by a post-filter according to the used speech enhancement criterion (MSE, MSE STSA, MSE log-STSA). For the post-filters, since they depend on second order statistics of the source and the noise signals, we have to develop an estimation scheme. Results on CMU Database 10 Speakers (13 utterances) Diffuse Noise SSNR Enhancement : SSNR output -E[SSNR input ] (E[] stands for the mean value of the N input channels) LAR, LSD, IS, LLR : Low values signify high speech quality. These measures are found to have a high correlation with the human perception.
14
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Results: CMU Database
15
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Spectrograms: CMU Database
16
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Multi-Microphone ASR Experiments Details on Setup of ASR Tasks: 700 Sentences for Training and 300 for Testing 12-state, left-right HMM w. Gaussian mixtures All-pair, unweighted grammar MFCC+C0+D+DD (39 coefficients in total)
17
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1 st, 2 nd Year Evaluation Databases & Baseline Completed Platform Front-end Release 1st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASR Baseline + Adv. Visual Features VADCompleted + Integration? WP2 VTLN Platform IntegrationCompleted Speaker Normalization ResearchPrelim. Results Non-native Speech DatabaseCompleted
18
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Multi-Cue Feature Fusion Goal: Fuse heterogeneous information streams optimally & adaptively Our approach: Explicitly model uncertainty in all feature measurements (due to noise or model fitting errors) Adjust model training to accommodate for uncertainty Dynamically compensate feature uncertainty during decoding Feature uncertainty estimation in the AV-ASR case: For the Audio Stream/MFCC: speech enhancement process For the Visual Stream: model fitting variance Properties: Adaptation at the frame level Explain and generalize cue weighting through stream exponents Integrates with a wide range of models, e.g. GMM, HMM Applicable to both audio-audio and audio-visual scenarios Can be combined with asynchronous models, e.g. Product-HMM
19
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Measurement Noise and Adaptive Fusion C X C X Y Our View: We can only measure noise-corrupt features Conventional View: Features are directly observable Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06
20
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 EM-Training with Partially Known Features C X C X Y Our View Conventional View Hidden Observed Hidden Observed Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06 Even training data can be uncertain
21
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 EM-Training: Results for GMM E-Step M-Step Filtered feature estimate Similar to conventional update rules Uncertainty- compensated scores Formulas for HMM are similar
22
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Decoding & Uncertain Features Variance-Compensated (“Soft”) Scoring Probabilistic Justification for Stream Exponents Relative Measurement Error Adaptation at each frame – stream/class/mixture dependent stream weights
23
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Audio-visual Asynchrony Modeling Multi-stream HMMProduct HMM Ref: Gravier et al., 2002
24
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Fusion: Multi-Cue Audio-Audio Feature Uncertainty for Audio features Baseline Audio Features: MFCC Enhancement using GMM of clean speech and Vector Taylor Series Approximation Uncertainty is Gaussian with Variance given by the enhancement process Used for Audio-Visual Fusion Fractal Audio Features: MFD On-going research applying a similar framework (GMM, VTS)
25
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 MFD: From Noisy Speech to Feature Uncertainty Ongoing Research: Noise Compensation for MFD Estimated Noisy Clean Noise True Noisy White Noise (0 dB)
26
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1 st, 2 nd Year Evaluation Databases & Baseline Completed Platform Front-end Release 1st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASR Baseline + Adv. Visual Features VADCompleted + Integration? WP2 VTLN Platform IntegrationCompleted Speaker Normalization ResearchPrelim. Results Non-native Speech DatabaseCompleted
27
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Showcase: Audio-Visual Speech Recognition = = Both shape & texture can assist lipreading Active Appearance Models for face modeling Shape and texture of faces “live” in low-dim manifolds Features: AAM Fitting (nonlinear least squares problem) Visual feature Uncertainty related to the sensitivity of the least-squares solution
28
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Demo: AAM fitting and uncertainty estimates The visual front-end supplies both features and their respective uncertainty.
29
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Audio-Visual ASR: Database Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations) CUAVE was kindly provided by the Clemson University
30
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Evaluation on the CUAVE Database
31
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Audio-Visual Speech Classification with MS-HMM Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06
32
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 AV Digit Classification Results (Word Accuracy) Audio: MFCC_D_Z (26 features) Visual: 6 shape + 12 texture AAM coefficients AV MS-HMM: AudioVisual Multistream HMM, weights (1,1) AV MS-HMM, Var-Comp: AudioVisual Multistream HMM+Variance Compensation AV P-HMM: AudioVisual Product HMM, weights (1,1) AV P-HMM, Var-Comp: AudioVisual Product HMM+ Variance Compensation SNR (babble) AudioVisualAV MS-HMM AV MS-HMM Var- Comp AV P-HMM AV P-HMM Var-Comp Clean100%68.7%95.1%97.0%95.4%99.6% 10 dB92.8% - 88.3%90.2%90.6%92.5% 5 dB73.9% - 84.5%86.8%87.2%89.1% 0 dB54.7% - 79.6%81.1%83.8%82.6% Ref: Pitsikalis, Katsamanis, Papandreou, and Maragos, ICSLP’06
33
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 AV-ASR: Results with Uncertain Training Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06
34
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1 st, 2 nd Year Evaluation Databases & Baseline Completed Platform Front-end Release 1st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASR Baseline + Adv. Visual Features VADCompleted + Integration? WP2 VTLN Platform IntegrationCompleted Speaker Normalization ResearchPrelim. Results Non-native Speech DatabaseCompleted
35
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Databases: Aurora 4 Task: 5000 Word, Continuous Speech Recognition WSJ0: (16 / 8 kHz) + Artificially Added Noise 2 microphones: Sennheiser, Other Filtering: G712, P341 Noises: Car, Babble, Restaurant, Street, Airport, Train Station Training (7138 Utterances per scenario) Clean: Sennheiser mic. Multi-Condition: Sennheiser – Other mic., 75% w. artificially added noise @ SNR: 10 – 20 dB Noisy: Sennheiser, artificially added noise SNR: 10 – 20 dB Testing (330 Utterances – 166 Utterances each. Speaker # = 8) SNR: 5-15 dB 1-7: Sennheiser microphone 8-14: Other microphone
36
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN on the Platform Warping in the front-end Piecewise Linear Warping Function Warping in the filterbank domain by stretching or compressing the frequency axis Training – HTK Implementation Testing Fast Implementation using GMM representing normalized speech to estimate warping factors per utterance.
37
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN on the Platform, Results
38
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN Research, TECC Features Teager Energy Cepstrum Coefficients are actually energy measurements at the output of a Gammatone filterbank, similarly to MFCC VTLN can be applied in a similar manner The bark scale along which the filters are uniformly positioned is properly stretched or shrunk to achieve warping Evaluation is currently in progress
39
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN Research, using Formants
40
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Raw Formants-Dynamic Programming time node
41
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Formant Tracking
42
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1 st, 2 nd Year Evaluation Databases & Baseline Completed Platform Release 1 st Version WP1 Noise Robust FeaturesCompleted Multi-mic. array EnhancementPrelim. Results Fusion Prelim. Results Audio-Visual ASR Baseline + Adv. Visual Features VADCompleted + Integration? WP2 VTLN Platform IntegrationCompleted Speaker Normalization ResearchPrelim. Results Non-native Speech DatabaseCompleted
43
HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Next... Fusion Audio+Audio, Audio+Visual, Nonlinear Features+Visual Visual Front-end VAD+ Nonlinear Features
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.