AMSP : Advanced Methods for Speech Processing An expression of Interest to set up a Network of Excellence in FP6 Prepared by members of COST-277 and colleagues Submitted by Marcos FAUNDEZ-ZANUY Presented here by Gérard CHOLLET GET-ENST/CNRS-LTCI
Outline Rationale of the proposition Objectives Approaches Modeling Recognition by synthesis Robustness to environmental conditions Evaluation paradigm Excellence Integration and structuring effect
Rationale for the NoE-AMSP The areas of Automatic Speech Processing (recognition, synthesis, coding, language identification, speaker verification) should be better integrated Better models of Speech Production and Perception Investigate Nonlinear Speech Processing Understanding, Semantic interpretation
Integrated platform for Automatic Speech Processing
Levels of representations
Features of Speech Models Reflect auditory properties of human perception Explain articulatory movements Surpass the limitations of the source-filter model Capture the dynamics of speech Capable of natural speech restitution Be discriminant for segmental information Robust to noise and channel distortions Adaptable to new speakers and new environments
Time – Frequency distributions Short Time Fourier Transform Non-linear frequency scale (PLP, WLP), mel-cepstrum Wavelets, FAMlets Bilinear distributions (Wigner-Ville, Choi-Williams,...) Instantaneous frequency, Teager operator Time – dependent representations (parametric and non parametric) Vector quantisation Matrix quantisation, non linear prediction
Time-dependent Spectral Models Temporal Decomposition (B. Atal, 1983) Vectorial Autoregressive models with detection of model ruptures (A. DeLima, Y. Grenier) Segmental parameterisation using a time-dependent polynomial expansion (Y. Grenier)
Modeling of segmental units Hidden Markov Model Markov Fields Bayesian Networks, Graphical Models OR Production models Synthesis (concatenative or rule based) with voice transformation AND / OR Non linear predictor
Expected achievements in Speech Coding and Synthesis Modeling the non-linearities in Speech Production and Perception will lead to more accurate and/or compact parametric representations. Integrate segmental recognition and synthesis techniques in the coding loop to achieve bit rates as low as a few 100's bps with natural quality Develop voice transformation techniques in order to : Adapt segmental coders to new speakers, Modify the characteristics of synthetic voices
Expected achievements in Speech Synthesis Self-excited nonlinear feedback oscillators will allow to better match synthetic and human voices. Current concatenative techniques should be supplemented (or replaced) by (nonlinear) model based generative techniques to improve quality, naturalness, flexibility, training and adaptation. Model-based voice mimicry controled by textual, phonetic and/or parametric input should not only improve synthesis but also coding, recognition and speaker characterisation.
Automatic Speech Recognition Limitations of the HMM and hybrid HMM-ANN approaches Keyword spotting (detection with SVM), noise robustness, adaptation Large Vocabulary Speech Recognition (SIROCCO) Markov Random Fields, Bayesian Networks and Graphical Models
Markov Random Fields Bayesian Networks and Graphical Models Speech modelling with state constrained Markov Random Field over Frequency bands (Guillaume Gravier and Marc Sigelle) Comparative framework to study MRF, Bayesian Networks and Graphical Models.
Recognition by Synthesis If we could drive a synthesizer with meaningful units (phone sequences, words,...) to produce a speech signal that mimics the one to recognize, we may come close to transcription. Analysis by Synthesis (which is in fact modeling) is a powerful tool in recognition and coding. A trivial implementation is indexing a labelled speech memory
A L I S P Automatic Language Independent Speech Processing Automatic discovery of segmental units for speech coding, synthesis, recognition, language identification and speaker verification.
The robustness issue : Mismatch between training and testing conditions High Order Statistics are less sensitive to environment and transmission noise than autocorrelation CMS, RASTA filtering Independent Component Analysis From Speaker Independent to Speaker Dependent recognition (Personalisation)
Expected achievements in Automatic Speech Recognition Dynamic nonlinear models should allow to merge feature extraction and classification under a common paradigm Such models should be more robust to noise, channel distortions and missing data (transmission errors and packet losses) Indexing a speech memory may help in the verification of hypotheses (a technique shared with Very Low Bit Rate Coders) Statistical language models should be supplemented with adapted semantic information (conceptual graphs)
Voice technology in Majordome Server side background tasks: continuous speech recognition applied to voice messages upon reception Detection of sender’s name and subject User interaction: Speaker identification and verification Speech recognition (receiving user commands through voice interaction) Text-to-speech synthesis (reading text summaries, s or faxes)
Collaboration with COST-278 COST-278: Vocal Dialogue is a continuation of COST-249 High interest in Robust Speech Recognition, Word spotting, Speech to actions, Speaker adaptation,... Some members contribute to the Eureka- MAJORDOME project Could be the seed for a Network of Excellence in FP6
Evaluation paradigm DARPA NIST Could we organize evaluation campaigns in Europe ? The 6 th program of the EU is trying to promote Networks of Excellence. How should excellence be evaluated ? Should financial support be correlated with evaluation results ?