ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research.

ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr Computer Vision, Speech Communication and Signal Processing Research Group HIWIRE

ICCS - NTUA HIWIRE Meeting, July 2006 Group Leader : Prof. Petros Maragos Ph.D. Students / Graduate Research Assistants :  D. Dimitriadis (speech: recognition, modulations)  V. Pitsikalis (speech: recognition, fractals/chaos, fusion)  A. Katsamanis (speech: modulations, statistical processing, recognition, fusion)  G. Papandreou (vision: PDEs, active contours, level sets, AV-ASR, fusion)  G. Evangelopoulos (vision/speech: texture, modulations, fractals)  S. Leukimmiatis (speech: statistical processing, microphone arrays) HIWIRE Involved CVSP Members

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA Tasks Involvement WP1: Environment and Sensor Robustness (26MM)  Task 1: Sensor Integration & Independence (11MM) Subject 1: Multi-Microphone Systems ( 5MM) Subject 5: Multi-Modal Features (audio-visual) (6MM)  Task 2: Noise Independence (15MM) Subject 2: Advanced Signal Processing (15MM) WP2: User Robustness (8MM)  Task 1: Improved Speaker Independence (4MM)  Task 2: Rapid Speaker Adaptation (4MM) WP3: System Integration (4MM) WP4: Evaluation (5MM) WP5: Exploitation and dissemination (1MM)

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE Evaluation  Databases & Baseline Completed Platform Front-end Release 1 st Version WP1  Noise Robust FeaturesCompleted  Multi-mic. array EnhancementPrelim. Results  Fusion Prelim. Results  Audio-Visual ASR Baseline + Adv. Visual Features  VADCompleted + Integration WP2  VTLN Platform IntegrationCompleted  Speaker Normalization ResearchPrelim. Results  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Challenges Points Considered during Implementation Modular Architecture Implementation in C-Code Incorporation of Different Ideas/Algorithms User-friendly interface providing additional options dealing with on-site demands of the project

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Options Want VAD? No LTSDVAD / MTEVAD Yes Want Denoising? No Yes Wiener Denoising MFCC/ TECC MFCC Speech Signals Speech Processing (Features) Speech Pre-Processing (Denoising) 1 1 22 33 Support for Input Speech Signals Different Sampling Frequencies 8 kHz 11 kHz 16 kHz Different Byte-Ordering Little-endian Big-endian Different Input File Formats RAW NIST HTK Provides Flags/ Options: Preprocessing Smoothing of Speech Signals Hamming Windowing Pre-emphasis Denoising/ VAD Algorithms LTSD-VAD Algorithm (UGR) MTE-VAD Algorithm (ICCS-NTUA) Wiener Denoising Algorithm- (Used only with a VAD algorithm) Output Features MFCC TECC C0 or logE

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 HIWIRE Advanced Front-end: Things to Be Done Script is in Testing Phase Create a CVS where Additional Modules should be included Tested Further in Speech Databases Evaluation in progress Fine-Tuning is Necessary Final Version should be Faster (Real-Time Processing) Incorporate it in the HIWIRE Platform

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Aurora 3 - Spanish Connected-Digits, Sampling Frequency 8 kHz Training Set:  WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192  MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211)  HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) Testing Set:  WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits  MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits  HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits 2 Back-end ASR Systems (ΗΤΚ and BLasr) Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC All-Pair, Unweighted Grammar (or Word-Pair Grammar) Performance Criterion: Word (digit) Accuracy Rates

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Databases: Aurora 2 Task: Speaker Independent Recognition of Digit Sequences TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F)  Clean (8kHz, G712)  Multi-Condition (8kHz, G712) 4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB, clean Testing, artificially added noise  7 SNRs: [-5, 0, 5, 10, 15, 20dB, clean]  A: noises as in multi-cond train., G712 (28028 Utters)  B: restaurant, street, airport, train station, G712 (28028 Utters)  C: subway, street (MIRS) (14014 Utters)

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1 st, 2 nd Year Evaluation  Databases & Baseline Completed Platform Front-end Release 1st Version WP1  Noise Robust FeaturesCompleted  Multi-mic. array EnhancementPrelim. Results  Fusion Prelim. Results  Audio-Visual ASR Baseline + Adv. Visual Features  VADCompleted + Integration? WP2  VTLN Platform IntegrationCompleted  Speaker Normalization ResearchPrelim. Results  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Microphone Arrays Multi-channel Speech Enhancement for Diffuse Noise Fields –MVDR (Minimum Variance Distortionless Response) Beamforming –Single Channel Linear and non-linear Post-Filtering MSE criterion leads to the linear Wiener Post-filter. MSE STSA and MSE log-STSA criteria leads to non-Linear Post-filters.

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Microphone Arrays The Overall Speech Enhancement System includes the following steps:  The noisy channel’s inputs are fed into a time alignment module (Different propagation paths for every input channel)  The time aligned noisy observations are projected to a single channel output with minimum noise variance, through the MVDR beamformer.  The output of the beamformer is further processed by a post-filter according to the used speech enhancement criterion (MSE, MSE STSA, MSE log-STSA). For the post-filters, since they depend on second order statistics of the source and the noise signals, we have to develop an estimation scheme. Results on CMU Database  10 Speakers (13 utterances)  Diffuse Noise  SSNR Enhancement : SSNR output -E[SSNR input ] (E[] stands for the mean value of the N input channels)  LAR, LSD, IS, LLR : Low values signify high speech quality. These measures are found to have a high correlation with the human perception.

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Results: CMU Database

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Spectrograms: CMU Database

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Multi-Microphone ASR Experiments Details on Setup of ASR Tasks: 700 Sentences for Training and 300 for Testing 12-state, left-right HMM w. Gaussian mixtures All-pair, unweighted grammar MFCC+C0+D+DD (39 coefficients in total)

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Multi-Cue Feature Fusion Goal:  Fuse heterogeneous information streams optimally & adaptively Our approach:  Explicitly model uncertainty in all feature measurements (due to noise or model fitting errors)  Adjust model training to accommodate for uncertainty  Dynamically compensate feature uncertainty during decoding  Feature uncertainty estimation in the AV-ASR case: For the Audio Stream/MFCC: speech enhancement process For the Visual Stream: model fitting variance Properties:  Adaptation at the frame level  Explain and generalize cue weighting through stream exponents  Integrates with a wide range of models, e.g. GMM, HMM  Applicable to both audio-audio and audio-visual scenarios  Can be combined with asynchronous models, e.g. Product-HMM

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Measurement Noise and Adaptive Fusion C X C X Y Our View: We can only measure noise-corrupt features Conventional View: Features are directly observable Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 EM-Training with Partially Known Features C X C X Y Our View Conventional View Hidden Observed Hidden Observed Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06 Even training data can be uncertain

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 EM-Training: Results for GMM E-Step M-Step Filtered feature estimate Similar to conventional update rules Uncertainty- compensated scores Formulas for HMM are similar

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Decoding & Uncertain Features Variance-Compensated (“Soft”) Scoring Probabilistic Justification for Stream Exponents Relative Measurement Error Adaptation at each frame – stream/class/mixture dependent stream weights

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Audio-visual Asynchrony Modeling Multi-stream HMMProduct HMM Ref: Gravier et al., 2002

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Fusion: Multi-Cue Audio-Audio Feature Uncertainty for Audio features Baseline Audio Features: MFCC  Enhancement using GMM of clean speech and Vector Taylor Series Approximation  Uncertainty is Gaussian with Variance given by the enhancement process  Used for Audio-Visual Fusion Fractal Audio Features: MFD  On-going research applying a similar framework (GMM, VTS)

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 MFD: From Noisy Speech to Feature Uncertainty Ongoing Research: Noise Compensation for MFD Estimated Noisy Clean Noise True Noisy White Noise (0 dB)

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Showcase: Audio-Visual Speech Recognition = = Both shape & texture can assist lipreading Active Appearance Models for face modeling  Shape and texture of faces “live” in low-dim manifolds Features: AAM Fitting (nonlinear least squares problem) Visual feature Uncertainty related to the sensitivity of the least-squares solution

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Demo: AAM fitting and uncertainty estimates The visual front-end supplies both features and their respective uncertainty.

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Audio-Visual ASR: Database Subset of CUAVE database used:  36 speakers (30 training, 6 testing)  5 sequences of 10 connected digits per speaker  Training set: 1500 digits (30x5x10)  Test set: 300 digits (6x5x10) CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations) CUAVE was kindly provided by the Clemson University

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Evaluation on the CUAVE Database

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Audio-Visual Speech Classification with MS-HMM Ref: Katsamanis, Papandreou, Pitsikalis, and Maragos, EUSIPCO’06

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 AV Digit Classification Results (Word Accuracy) Audio: MFCC_D_Z (26 features) Visual: 6 shape + 12 texture AAM coefficients AV MS-HMM: AudioVisual Multistream HMM, weights (1,1) AV MS-HMM, Var-Comp: AudioVisual Multistream HMM+Variance Compensation AV P-HMM: AudioVisual Product HMM, weights (1,1) AV P-HMM, Var-Comp: AudioVisual Product HMM+ Variance Compensation SNR (babble) AudioVisualAV MS-HMM AV MS-HMM Var- Comp AV P-HMM AV P-HMM Var-Comp Clean100%68.7%95.1%97.0%95.4%99.6% 10 dB92.8% - 88.3%90.2%90.6%92.5% 5 dB73.9% - 84.5%86.8%87.2%89.1% 0 dB54.7% - 79.6%81.1%83.8%82.6% Ref: Pitsikalis, Katsamanis, Papandreou, and Maragos, ICSLP’06

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 AV-ASR: Results with Uncertain Training Ref: Papandreou, Katsamanis, Pitsikalis, and Maragos, submission to NIPS’06

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Databases: Aurora 4 Task: 5000 Word, Continuous Speech Recognition WSJ0: (16 / 8 kHz) + Artificially Added Noise 2 microphones: Sennheiser, Other Filtering: G712, P341 Noises: Car, Babble, Restaurant, Street, Airport, Train Station Training (7138 Utterances per scenario)  Clean: Sennheiser mic.  Multi-Condition: Sennheiser – Other mic., 75% w. artificially added noise @ SNR: 10 – 20 dB  Noisy: Sennheiser, artificially added noise SNR: 10 – 20 dB Testing (330 Utterances – 166 Utterances each. Speaker # = 8)  SNR: 5-15 dB  1-7: Sennheiser microphone  8-14: Other microphone

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN on the Platform Warping in the front-end  Piecewise Linear Warping Function  Warping in the filterbank domain by stretching or compressing the frequency axis Training – HTK Implementation Testing  Fast Implementation using GMM representing normalized speech to estimate warping factors per utterance.

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN on the Platform, Results

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN Research, TECC Features Teager Energy Cepstrum Coefficients are actually energy measurements at the output of a Gammatone filterbank, similarly to MFCC VTLN can be applied in a similar manner The bark scale along which the filters are uniformly positioned is properly stretched or shrunk to achieve warping Evaluation is currently in progress

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 VTLN Research, using Formants

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Raw Formants-Dynamic Programming time node

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Formant Tracking

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 ICCS-NTUA in HIWIRE: 1 st, 2 nd Year Evaluation  Databases & Baseline Completed Platform Release 1 st Version WP1  Noise Robust FeaturesCompleted  Multi-mic. array EnhancementPrelim. Results  Fusion Prelim. Results  Audio-Visual ASR Baseline + Adv. Visual Features  VADCompleted + Integration? WP2  VTLN Platform IntegrationCompleted  Speaker Normalization ResearchPrelim. Results  Non-native Speech DatabaseCompleted

HIWIRE ICCS - NTUA HIWIRE Meeting, July 2006 Next... Fusion  Audio+Audio,  Audio+Visual,  Nonlinear Features+Visual Visual Front-end VAD+ Nonlinear Features

ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research.

Similar presentations

Presentation on theme: "ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research.

Similar presentations

Presentation on theme: "ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: Computer Vision, Speech Communication and Signal Processing Research."— Presentation transcript:

Similar presentations

About project

Feedback