Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Audio Visual Speech Recognition
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Speech Recognition with Hidden Markov Models Winter 2011
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Nonlinear Dimension Reduction Presenter: Xingwei Yang The powerpoint is organized from: 1.Ronald R. Coifman et al. (Yale University) 2. Jieping Ye, (Arizona.
SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Principal Component Analysis
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Advances in WP1 and WP2 Paris Meeting – 11 febr
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.
Biomedical Image Analysis and Machine Learning BMI 731 Winter 2005 Kun Huang Department of Biomedical Informatics Ohio State University.
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
REAL TIME EYE TRACKING FOR HUMAN COMPUTER INTERFACES Subramanya Amarnag, Raghunandan S. Kumaran and John Gowdy Dept. of Electrical and Computer Engineering,
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Face Recognition and Feature Subspaces
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Multimodal Interaction Dr. Mike Spann
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
7-Speech Recognition Speech Recognition Concepts
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Object Tracking and Asynchrony in Audio- Visual Speech Recognition Mark Hasegawa-Johnson AIVR Seminar August 31, 2006 AVICAR is thanks to: Bowon Lee,
Dealing with Acoustic Noise Part 2: Beamforming Mark Hasegawa-Johnson University of Illinois Lectures at CLSP WS06 July 25, 2006.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Power Linear Discriminant Analysis (PLDA) M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit.
Face Recognition: An Introduction
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Database and Visual Front End Makis Potamianos.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hmm, HID HMMs Gerald Dalley MIT AI Lab Activity Perception Group Group Meeting 17 April 2003.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Speech Enhancement Summer 2009
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
LECTURE 11: Advanced Discriminant Analysis
University of Ioannina
Statistical Models for Automatic Speech Recognition
Face Recognition and Feature Subspaces
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
Object Modeling with Layers
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Dimensionality Reduction
NonLinear Dimensionality Reduction or Unfolding Manifolds
Dealing with Acoustic Noise Part 1: Spectral Estimation
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering

Audio-Visual Speech Recognition 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

I. Video Noise 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

AVICAR Database ● AVICAR = Audio-Visual In a CAR ● 100 Talkers ● 4 Cameras, 7 Microphones ● 5 noise conditions: Engine idling, 35mph, 35mph with windows open, 55mph, 55mph with windows open ● Three types of utterances: –Digits & Phone numbers, for training and testing phone- number recognizers –TIMIT sentences, for training and testing large vocabulary speech recognition –Isolated Letters, to test the use of video for an acoustically hard recognition problem

AVICAR Recording Hardware (Lee, Hasegawa-Johnson et al., ICSLP 2004) 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard 8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. System is not permanently installed; mounting requires 10 minutes.

AVICAR Video Noise  Lighting: Many different angles, many types of weather  Interlace: 30fps NTSC encoding used to transmit data from camera to digital video tape  Facial Features: –Hair –Skin –Clothing –Obstructions

AVICAR Noisy Image Examples

Related Problem: Dimensionality  Dimension of the raw grayscale lip rectangle: 30x200=6000 pixels  Dimension of the DCT of the lip rectangle: 30x200=6000 dimensions  Smallest truncated DCT that allows a human viewer to recognize lip shapes (Hasegawa-Johnson, informal experiments): 25x25=625 dimensions  Truncated DCT typically used in AVSR: 4x4=16 dimensions  Dimension of “geometric lip features” that allow high-accuracy AVSR (e.g., Chu and Huang, 2000): 3 dimensions (lip height, lip width, vertical assymmetry)

Dimensionality Reduction: The Classics Principal Components Analysis (PCA):  Project onto eigenvectors of the total covariance matrix  Projection includes noise Linear Discriminant Analysis (LDA):  Project onto v=W -1 (d  ), W=within-class covariance  Projection reduces noise

Manifold Estimation (e.g., Roweis and Saul, Science 2000) Neighborhood Graph  Node = data point  Edge = connect each data point to its K nearest neighbors Manifold Estimation  The K nearest neighbors of each data point define the local (K-1)- dimensional tangent space of a manifold

Local Discriminant Graph (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Maximize Local Inter- Manifold Interpolation Errors, subject to a constant Same- Class Interpolation Error: Find P to maximize  D  i ||P T (x i -  k c k y k )|| 2, y k Є KNN(x i ), other classes Subject to  S = constant,  S =  i ||P T (x i -  j c j x j )|| 2, x j Є KNN(x i ), same class

PCA, LDA, LDG: Experimental Test (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Lip Feature Extraction: DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph

Lip Reading Results (Digits) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph

II. Audio Noise 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

Audio Noise

 Beamforming –Filter-and-sum (MVDR) vs. Delay-and-sum  Post-Filter –MMSE log spectral amplitude estimator (Ephraim and Malah, 1984) vs. Spectral Subtraction  Voice Activity Detection –Likelihood ratio method (Sohn and Sung, ICASSP 1998) –Noise estimates:  Fixed noise  Time-varying noise (autoregressive estimator)  High-variance noise (backoff estimator) Audio Noise Compensation

MVDR Beamformer + MMSElogSA Postfilter (MVDR = Minimum variance distortionless response) (MMSElogSA = MMSE log spectral amplitude estimator) (Proof of optimality: Balan and Rosca, ICASSP 2002)

Word Error Rate: Beamformers  Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open)  DS=Delay-and-sum; MVDR=Minimum variance distortionless response

Word Error Rate: Postfilters

 Most errors at low SNR are because noise gets misrecognized as speech  Effective solution: voice activity detection (VAD)  Likelihood ratio VAD (Sohn and Sung, ICASSP 1998):  t = log { p(X t =S t +N t ) / p(X t =N t ) }  t = log { p(X t =S t +N t ) / p(X t =N t ) } X t = Measured Power Spectrum S t, N t = Exponentially Distributed Speech, Noise  t > threshold → Speech Present  t < threshold → Speech Absent Voice Activity Detection

 Fixed estimate: N 0 =average of first 10 frames  Autoregressive estimator (Sohn and Sung): N t =  t X t + (1-  t ) N t-1 N t =  t X t + (1-  t ) N t-1  t = function of X t, N 0  Backoff estimator (Lee and Hasegawa- Johnson, DSP for In-Vehicle and Mobile Systems, 2007): N t =  t X t + (1-  t ) N 0 VAD: Noise Estimators

Word Error Rate: Digits

III. Pronunciation Variability 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

Graphical Methods: Dynamic Bayesian Network Bayesian Network = A Graph in which  Bayesian Network = A Graph in which  Nodes are Random Variables (RVs)  Edges Represent Dependence  Dynamic Bayesian Network = A BN in which  RVs are repeated once per time step  Example: an HMM is a DBN  Most important RV: the “phonestate” variable q t  Typically q t Є {Phones} x {1,2,3}  Acoustic features x t and video features y t depend on q t

Example: HMM is a DBN q t-1  t-1 x t-1 y t-1 w t-1 winc t-1 qinc t-1 Frame t-1 qtqtqtqt tttt xtxtxtxt ytytytyt wtwtwtwt winc t qinc t Frame t q t is the phonestate, e.g., q t Є { /w/1, /w/2, /w/3, /n/1, /n/2, … }  q t is the phonestate, e.g., q t Є { /w/1, /w/2, /w/3, /n/1, /n/2, … }  w t is the word label at time t, for example, wt Є {“one”, “two”, …}   t is the position of phone q t within word w t :  t Є {1 st, 2 nd, 3 rd, …}  qinc t Є {0,1} specifies whether  t+1 =  t or  t+1 =  t +1

Pronunciation Variability  Even when reading phone numbers, talkers “blend” articulations.  For example: “seven eight:” /s  vәnet/→ /s  vne?/  As speech gets less formal, pronunciation variability gets worse, e.g., worse in a car than in the lab; worse in conversation than in read speech

A Related Problem: Asynchrony  Audio and Video information are not synchronous  For example: “th” (/  /) in “three” is visible, but not yet audible, because the audio is still silent  Should HMM be in q t =“silence,” or q t =/  /?

qtqtqtqt tttt wtwtwtwt winc t qinc t Frame t xtxtxtxt vtvtvtvt tttt vinc t ytytytyt tttt q t-1  t-1 w t-1 winc t-1 qinc t-1 Frame t-1 x t-1 v t-1  t-1 vinc t-1 y t-1  t-1 A Solution: Two State Variables (Chu and Huang, ICASSP 2000)  Coupled HMM (CHMM): Two parallel HMMs  q t : Audio state (x t : audio observation)  v t : Video state (y t : video observation)   t =  t -  t : Asynchrony, capped at |  t |<3

Asynchrony in Articulatory Phonology (Livescu and Glass, 2004)  It’s not really the AUDIO and VIDEO that are ssynchronous…  It is the LIPS, TONGUE, and GLOTTIS that are asynchronous S1S1 S1S1 word ind 1 ind 2 ind 3 U1U1 U1U1 S2S2 S2S2 U2U2 U2U2 U3U3 S3S3 S3S3 U3U3 sync 1,2 sync 2,3 sync 1,2 sync 2,3

Asynchrony in Articulatory Phonology Dental /  / Tongue Glottis Unvoiced Retroflex /r/ Voiced Palatal /i/ “three,” dictionary form time  It’s not really the AUDIO and VIDEO that are ssynchronous…  It is the LIPS, TONGUE, and GLOTTIS that are asynchronous Dental /  / Tongue Glottis Unvoiced Retroflex /r/ Voiced Palatal /i/ “three,” casual speech Silent Silent

Asynchrony in Articulatory Phonology Fricative /v/ Lips Tongue Wide /  / Closed /n/ “seven,” dictionary form: /s  vәn/ time Fricative /s/  Same mechanism represents pronunciation variability: –“Seven:” /vәn/→ /vn/ if tongue closes before lips open –“Eight:” /et/ → /e?/ if glottis closes before tongue tip closes Fricative /v/ Lips Tongue Wide /  / Closed /n/ “seven,” casual speech: /s  vn/ time Fricative /s/ Neutral /ә/

ltltltlt t wtwtwtwt winc t linc t tttttttt tttt tinc t tttt l t-1 t-1 t-1 w t-1 winc t-1 linc t-1 t t-1  t-1 tinc t-1  t-1 An Articulatory Feature Model (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)  There is no “phonestate” variable. Instead, we use a vector q t →[l t,t t,g t ] –Lipstate variable l t –Tonguestate variable t t –Glotstate variable g t gtgtgtgt tttt ginc t tttt g t-1  t-1 ginc t-1  t-1

Experimental Test (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)  Training and test data: CUAVE corpus –Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002 –169 utterances used, 10 digits each, silence between words –Recorded without Audio or Video noise (studio lighting; silent bkgd)  Audio prepared by Kate Saenko at MIT –NOISEX speech babble added at various SNRs –MFCC+d+dd feature vectors, 10ms frames  Video prepared by Amar Subramanya at UW –Feature vector = DCT of lip rectangle –Upsampled from 33ms frames to 10ms frames  Experimental Condition: Train-Test Mismatch –Training on clean data –Audio/video weights tuned on noise-specific dev sets –Language model: uniform (all words equal probability), constrained to have the right number of words per utterance

Experimental Questions (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007) 1)Does Video reduce word error rate? 2)Does Audio-Video Asynchrony reduce word error rate? 3)Should asynchrony be represented as 1)Audio-Video Asynchrony (CHMM), or 2)Lips-Tongue-Glottis Asynchrony (AFM) 4)Is it better to use only CHMM, only AFM, or a combination of both methods?

Results, part 1: Should we use video? Answer: YES. Audio-Visual WER < Single-stream WER

Results, part 2: Are Audio and Video be asynchronous? Answer: YES. Async WER < Sync WER.

Results, part 3: Should we use CHMM or AFM? Answer: DOESN’T MATTER! WERs are equal.

Results, part 4: Should we combine systems? Answer: YES. Best is AFM+CH1+CH2 ROVER

 Video Feature Extraction: –Manifold discriminant is better than a global discriminant  Audio Feature Extraction: –Beamformer: Delay-and-sum beats Filter-and-sum –Postfilter: Spectral subtraction gives best WER (though MMSE-logSA sounds best) –VAD: Backoff noise estimation works best in this corpus  Audio-Video Fusion: –Video reduces WER in train-test mismatch conditions –Audio and video are asynchronous (CHMM) –Lips, tongue and glottis are asynchronous (AFM) –It doesn’t matter whether you use CHMM or AFM, but... –Best result: combine both representations Conclusions