Alexandrina Rogozan Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition UNIVERSITE du MAINE

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Audio-visual speaker verification using continuous fused HMMs.

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Advanced Speech Enhancement in Noisy Environments

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

Speech Group INRIA Lorraine

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

Cognitive Processes PSY 334 Chapter 2 – Perception April 9, 2003.

Speaker Adaptation for Vowel Classification

Lip Feature Extraction Using Red Exclusion Trent W. Lewis and David M.W. Powers Flinders University of SA VIP2000.

Cognitive Processes PSY 334 Chapter 2 – Perception.

Why is ASR Hard? Natural speech is continuous

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,

June 28th, 2004 BioSecure, SecurePhone 1 Automatic Speaker Verification : Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST.

By Sarita Jondhale1 Pattern Comparison Techniques.

Microphone Integration – Can Improve ARS Accuracy? Tom Houy

Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

7-Speech Recognition Speech Recognition Concepts

EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Multimodal Information Analysis for Emotion Recognition

Speech Perception 4/4/00.

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.

Olivier Siohan David Rybach

Mr. Darko Pekar, Speech Morphing Inc.

Online Multiscale Dynamic Topic Models

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Cognitive Processes PSY 334

Pick samples from task t

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Department of Electrical Engineering

John H.L. Hansen & Taufiq Al Babba Hasan

EE 492 ENGINEERING PROJECT

Speaker Identification:

BCI Research at the ISRC, University of Ulster N. Ireland, UK

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Alexandrina Rogozan Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition UNIVERSITE du MAINE

Alexandrina Rogozan2 Bio Sketch Assistant Professor in Computer Science and Electrical Engineering at University of Le Mans, France & Member of Speech Processing Group at LIUM 1999 : Ph.D. in Computer Science from University of Paris XI - Orsay  Heterogeneous Data Fusion for Audio-Visual Speech Recognition : Participant at the French project AMIBE  Improvement of the Robustness and Confidentiality of Man- Machine Communication by using Audio and Visual Data Universities of Grenoble, Le Mans, Toulouse, Avignon, Paris 6 & INRIA

Alexandrina Rogozan3 Research Activity  GOAL: Study the benefit of visual information for ASR  METHOD: Develop different audio-visual ASR systems  APPROACH: Copy the synergy observed in speech perception  EVALUATION: Test the accuracy of recognition process on a speaker-dependent connected-letter task

Alexandrina Rogozan4 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

Alexandrina Rogozan5 1. Audio-Visual Speech System Overview  Obtaining the synergy of acoustic and visual modalities  Audio-visual fusion results > Uni-modal results Face Tracking Lip Localization Visual-features Extraction Acoustic-features Extraction Joint Treatment Visual Front End Integration Strategies

Alexandrina Rogozan6 1. Unanswered questions in AV ASR:  When has the audio-visual fusion take place: before or after the categorization in each modality ?  How to take into account the differences in the temporal evolution of speech events in acoustic and visual modalities ?  How to adapt the relative contribution of acoustic and visual modalities during the recognition process ?

Alexandrina Rogozan7 1. Relative contribution of acoustic and visual modalities  Speech features: Place & Manner of articulation & Voicing  Vary with the phonemic content Ex: Which modality to distinguish /m/ from /n/, but /m/ from /p/ ?  Vary with the environmental context Ex: Acoustic features on the place of articulation = the least robust ones Exploit the complementary nature of modalities A V

Alexandrina Rogozan8 1. Differences in temporal evolution of phonemes in acoustic and visual modalities  Anticipation & Retention Phenomena: temporal shift up to 250 ms [Abry & Lalouache, 1991]  ‘Natural asynchrony’ Handle with different phonemic frontiers  Vary with the phonemic content Exploit the ‘natural asynchrony’

Alexandrina Rogozan9 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models - One-level Fusion Architectures - Hybrid Fusion Architecture 3. Implementation of the Proposed Hybrid-Fusion Model 4. Improvements of the Hybrid-Fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

Alexandrina Rogozan10 2. One-level fusion architectures  At the Data (Features) Level  At the Results (Decision) Level Categorization Fusion Recognized Speech-Unit Visual Data Acoustic Data Categorization Fusion Visual Data Acoustic Data Recognized Speech-Unit

Alexandrina Rogozan11 2. Fusion before categorization  Concatenation or Direct Identification (DI)  Re-coding in the Dominant (RD) or in a Motor space (RM) [Robert-Ribes, 1995] Pb: Choice of the dominant space nature & Temporal ‘resynchronization’ in the common space Adaptation Acoustic Data Fusion Visual Data Categorization Recognized Speech-Unit Categorization Adaptation Recognized Speech-Unit Visual Data Acoustic Data

Alexandrina Rogozan12 2. Fusion after categorization  Separate Identification (SI)  Parallel Structure  Serial Structure Categorization Adaptation Fusion Visual Data Acoustic Data Categorization Recognized Speech-Unit Categorization Adaptation Fusion Visual Data Acoustic Data Categorization Recognized Speech-Unit

Alexandrina Rogozan13

Alexandrina Rogozan14 2. Level of audio-visual fusion in speech perception Ex: Lip image + Larynx-frequency => Voicing features [Grant, 1985] pulse train ( 4,7 % ) ( 28,9 % ) ( 51,1 % )  Audio-visual fusion before categorisation Ex: Lip image (t) + Speech signal (t+  T) => McGurk illusions [Massaro, 1996] /ga/ V /ba/ A /da/ AV  Audio-visual fusion after categorisation Flexibility and robustness of speech perception Adaptability of the fusion mechanisms

Alexandrina Rogozan15 2. Hybrid-fusion model for Audio-Visual ASR Adaptation Categorization Fusion a av v Sequence of phonemes A AV V Continuous, time-varying space of data Discrete, categorical space of results DI SI

Alexandrina Rogozan16 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model - Structure of the DI-based Fusion - Structure of the SI-based Fusion 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

Alexandrina Rogozan17 3. Implementation of the DI-based fusion Adaptation Categorization Phonemic HMM Fusion a av v Sequence of phonemes A AV V Continuous, time-varying space of data Discrete, categorical space of results DI SI

Alexandrina Rogozan18 3. Characteristics of the DI-based fusion  Synchronisation of acoustic and visual speech events on the phonemic HMM states  Visual TOO strength  Perturbs the acoustic at the time of TRANSITION between HMM states and of speech-unit LABELING Necessity to adapt the DI-based fusion

Alexandrina Rogozan19 3. Adaptation of the DI-based fusion  To the RELEVANCE of speech features in each modality  To the RELIABILITY of processing in each modality Necessity to estimate a posteriori the reliability of the global process

Alexandrina Rogozan20 3. Realization of the adaptation in the DI- based fusion  Exponential weight  :  Global to the recognition hypothesis  Selected a posteriori  According to the SNR & the phonemic content A Phonemic HMM Fusion  i i  j Choice Sequence of phonemes Phonemic HMM V

Alexandrina Rogozan21 3. Choice of the hybrid-fusion architecture DI + V = > Asynchronous fusion of information Adaptation Categorization Phonemic HMM Fusion a av v Sequence of phonemes A AV V Continuous, time-varying space of data Discrete, categorical space of results DI SI

Alexandrina Rogozan22 3. Implementation of SI-based fusion  Serial structure => visual evaluation of DI solutions Phonemic HMM Fusion av v Adaptation Sequence of phonemes AV V Continuous, time- varying space of data Discrete, categorical space of results N-best phonetically  solutions DI SI

Alexandrina Rogozan23 3. Characteristics of the SI-based fusion  Multiplication of the modality output probabilities  Possibility of temporal shift up to 100 ms between modality phonemic frontiers  ‘Natural asynchrony’ allowed  Visual TOO strength  Perturbs the acoustic at the time of speech-unit LABELING Necessity to adapt the SI-based fusion

Alexandrina Rogozan24 3. Realization of the adaptation in the SI- based fusion  Exponential weight :  Calculated a posteriori according to the relative reliability of acoustic and visual modalities  Dispersion of 4-best solutions  Variation of with the SNR on the test data

Alexandrina Rogozan25 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model - Visual categorization - Parallel Structure for the SI-based Fusion 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

Alexandrina Rogozan26 4. Type of interaction in the SI-based fusion  Effective integration vs coherence verification  depends on the ratio :  IMPROVEMENT: Reinforcement of the purely-visual component  Discriminative learning  Effective visual categorization

Alexandrina Rogozan27 4. Discriminative learning of visual speech by Neural Networks (NN)  Necessity of relevant visual differences between the classes to discriminate  Inconsistent with phonemic classes because of visual doubles i. e. /p/, /b/, /m/ Use of adapted classes : VISEMES  Sources of variability  language, speech rate, among speakers

Alexandrina Rogozan28 4. Definition of visemes  Extraction of visual phonemes from the training data  Middle of acoustic-phonemic segments anchors visual segments of 140 ms  Mapping of extracted visual phonemes  Kohonen’s algorithm for Self Organising Map (SOM)  Identification of visemes  3 resolution levels n pbm fv s zt dk j ch rlg Consonant visemes Vowel visemes  ei ua  oy

Alexandrina Rogozan29 4. Reinforced purely-visual component in SI parallel structure  Getting ride of temporal dependence between DI and V  Effective visual categorisation  Difficulty to take into account the temporal dimension of speech with NN  Towards hybrid HMM - NN based categorization

Alexandrina Rogozan30 4. Hybrid HMM - NN based categorization NN HMM Visible speech A posteriori probabilities Recognized sequence of visemes HMM NN Visible speech Segmentation Recognized sequence of visemes NN + HMM HMM + NN NN Visible speech Recognized sequence of visemes HMM Segmentation Visemes confusion Recognized sequence of visemes HMM / NN

Alexandrina Rogozan31 4. Reinforced purely-visual component in SI parallel structure Non-homogeneity of output scores  Inconsistent with previous multiplicative-based SI fusion Sequence of phonemes Visemic HMM / NN Phonemic HMM Fusion av v Adaptation AV V Continuous, time- varying space of data Discrete, categorical space of results DI SI

Alexandrina Rogozan32 4. Implementation of SI-based fusion in a parallel structure Phonemes => Visemes Adaptation Sequence of phonemes av Likelihood rate calculation N solutions N Discrete, categorical space of results v Edition-distance based alignment 1

Alexandrina Rogozan33 4. ‘ Phonetic Plus Post-categorical ’ proposed by Burnham (1998)  2-level fusion architecture  visual categorization by comparison to visemic prototypes  facultative use of purely-visual component after categorization Categorization Fusion Categorization Adaptation Perceived speech Visual Speech Audible Speech FusionCategorization

Alexandrina Rogozan34 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

Alexandrina Rogozan35 5. Experiments  Audio-visual data of AMIBE project  connected letters  ‘dining-hall’ noise at SNR of 10 dB, 0 dB and -10 dB  Speech features  Visual: internal lip-shape height, width and area +  ’ +  ’’  Acoustic: 12 MFCC + energy +  ’ +  ’’  Speech modeling  HMM + duration model [Suaudeau & André-Obrecht, 1994]  TDNN, SOM

Alexandrina Rogozan36 5. Results The hybrid-fusion model DI+V allows for obtaining the audiovisual synergy.

Alexandrina Rogozan37 5. Results -10 dB0 dB10 dBclean AUDIO-2,167,98891,5 VISUAL30,9 DI40,876,490,895,4 SI6,381,689,491,9 DI+V41,977,891,295,8

Alexandrina Rogozan38 5. Comparisons  Master-Slave Model proposed at IRIT, Univ. Toulouse [André-Orecht et al, 1997]  Product of Models proposed at LIUPAV, Univ. Avignon [Jourlin, 1998]

Alexandrina Rogozan39 5. Master-Slave Model of IRIT (1997) Acoustic HMM parameters = Probabilistic functions of the master-labial HMM model Master labial HMM Slave acoustic HMM Open lips Semi-open lips Close lips

Alexandrina Rogozan40 5. Product of Models of LIUAPV (1998) The audio-visual HMM parameters are computed from separate acoustic and visual HMMs. T T 11 T 33 T 12 T 23 D 1 (A) D 2 (A) D 3 (A) T 23 x T 66 1,5 1,4 1,6 2,5 2,4 2,6 3,5 3,4 3,6 T 11 x T 44 T 12 x T 56 T 11 x T 45 D 6 (V) T 44 T 66 T 45 T 56 D 4 (V) D 5 (V) T 55 Acoustic HMM Visual HMM Audio-visual HMM

Alexandrina Rogozan41 6. Conclusion : Contribution  Taking into account the amount of problems in AV ASR Fusion(A)synchronyAdaptationVisemes  Proposition of the hybrid-fusion DI+V model  Audio-visual fusion adaptation a posteriori to variations of both the context and the content  Definition of specific-visual units, visemes, by auto- organisation and grouping

Alexandrina Rogozan42 6. Conclusion : Further Work  Use of visemes also during the DI-based fusion  Learning of temporal shifts between modalities for the SI- based fusion  Definition of a dependency function between pre and post categorical weights  Modality-weight estimation at a finer level Learning on consequent training data & Extensive testing

Alexandrina Rogozan43 6. Perspectives Towards a global platform for Audio-Visual Speech Communication  Preprocessing  source localization, enhancement of speech signal, scene analysis  Recognition  Synthesis  Coding