Download presentation
Presentation is loading. Please wait.
Published byOlivia Williamson Modified over 9 years ago
2
Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering
3
Audio-Visual Speech Recognition 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition
4
I. Video Noise 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition
5
AVICAR Database ● AVICAR = Audio-Visual In a CAR ● 100 Talkers ● 4 Cameras, 7 Microphones ● 5 noise conditions: Engine idling, 35mph, 35mph with windows open, 55mph, 55mph with windows open ● Three types of utterances: –Digits & Phone numbers, for training and testing phone- number recognizers –TIMIT sentences, for training and testing large vocabulary speech recognition –Isolated Letters, to test the use of video for an acoustically hard recognition problem
6
AVICAR Recording Hardware (Lee, Hasegawa-Johnson et al., ICSLP 2004) 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard 8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. System is not permanently installed; mounting requires 10 minutes.
7
AVICAR Video Noise Lighting: Many different angles, many types of weather Interlace: 30fps NTSC encoding used to transmit data from camera to digital video tape Facial Features: –Hair –Skin –Clothing –Obstructions
8
AVICAR Noisy Image Examples
9
Related Problem: Dimensionality Dimension of the raw grayscale lip rectangle: 30x200=6000 pixels Dimension of the DCT of the lip rectangle: 30x200=6000 dimensions Smallest truncated DCT that allows a human viewer to recognize lip shapes (Hasegawa-Johnson, informal experiments): 25x25=625 dimensions Truncated DCT typically used in AVSR: 4x4=16 dimensions Dimension of “geometric lip features” that allow high-accuracy AVSR (e.g., Chu and Huang, 2000): 3 dimensions (lip height, lip width, vertical assymmetry)
10
Dimensionality Reduction: The Classics Principal Components Analysis (PCA): Project onto eigenvectors of the total covariance matrix Projection includes noise Linear Discriminant Analysis (LDA): Project onto v=W -1 (d ), W=within-class covariance Projection reduces noise
11
Manifold Estimation (e.g., Roweis and Saul, Science 2000) Neighborhood Graph Node = data point Edge = connect each data point to its K nearest neighbors Manifold Estimation The K nearest neighbors of each data point define the local (K-1)- dimensional tangent space of a manifold
12
Local Discriminant Graph (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Maximize Local Inter- Manifold Interpolation Errors, subject to a constant Same- Class Interpolation Error: Find P to maximize D i ||P T (x i - k c k y k )|| 2, y k Є KNN(x i ), other classes Subject to S = constant, S = i ||P T (x i - j c j x j )|| 2, x j Є KNN(x i ), same class
13
PCA, LDA, LDG: Experimental Test (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Lip Feature Extraction: DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph
14
Lip Reading Results (Digits) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph
15
II. Audio Noise 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition
16
Audio Noise
17
Beamforming –Filter-and-sum (MVDR) vs. Delay-and-sum Post-Filter –MMSE log spectral amplitude estimator (Ephraim and Malah, 1984) vs. Spectral Subtraction Voice Activity Detection –Likelihood ratio method (Sohn and Sung, ICASSP 1998) –Noise estimates: Fixed noise Time-varying noise (autoregressive estimator) High-variance noise (backoff estimator) Audio Noise Compensation
18
MVDR Beamformer + MMSElogSA Postfilter (MVDR = Minimum variance distortionless response) (MMSElogSA = MMSE log spectral amplitude estimator) (Proof of optimality: Balan and Rosca, ICASSP 2002)
19
Word Error Rate: Beamformers Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open) DS=Delay-and-sum; MVDR=Minimum variance distortionless response
20
Word Error Rate: Postfilters
21
Most errors at low SNR are because noise gets misrecognized as speech Effective solution: voice activity detection (VAD) Likelihood ratio VAD (Sohn and Sung, ICASSP 1998): t = log { p(X t =S t +N t ) / p(X t =N t ) } t = log { p(X t =S t +N t ) / p(X t =N t ) } X t = Measured Power Spectrum S t, N t = Exponentially Distributed Speech, Noise t > threshold → Speech Present t < threshold → Speech Absent Voice Activity Detection
22
Fixed estimate: N 0 =average of first 10 frames Autoregressive estimator (Sohn and Sung): N t = t X t + (1- t ) N t-1 N t = t X t + (1- t ) N t-1 t = function of X t, N 0 Backoff estimator (Lee and Hasegawa- Johnson, DSP for In-Vehicle and Mobile Systems, 2007): N t = t X t + (1- t ) N 0 VAD: Noise Estimators
23
Word Error Rate: Digits
24
III. Pronunciation Variability 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition
25
Graphical Methods: Dynamic Bayesian Network Bayesian Network = A Graph in which Bayesian Network = A Graph in which Nodes are Random Variables (RVs) Edges Represent Dependence Dynamic Bayesian Network = A BN in which RVs are repeated once per time step Example: an HMM is a DBN Most important RV: the “phonestate” variable q t Typically q t Є {Phones} x {1,2,3} Acoustic features x t and video features y t depend on q t
26
Example: HMM is a DBN q t-1 t-1 x t-1 y t-1 w t-1 winc t-1 qinc t-1 Frame t-1 qtqtqtqt tttt xtxtxtxt ytytytyt wtwtwtwt winc t qinc t Frame t q t is the phonestate, e.g., q t Є { /w/1, /w/2, /w/3, /n/1, /n/2, … } q t is the phonestate, e.g., q t Є { /w/1, /w/2, /w/3, /n/1, /n/2, … } w t is the word label at time t, for example, wt Є {“one”, “two”, …} t is the position of phone q t within word w t : t Є {1 st, 2 nd, 3 rd, …} qinc t Є {0,1} specifies whether t+1 = t or t+1 = t +1
27
Pronunciation Variability Even when reading phone numbers, talkers “blend” articulations. For example: “seven eight:” /s vәnet/→ /s vne?/ As speech gets less formal, pronunciation variability gets worse, e.g., worse in a car than in the lab; worse in conversation than in read speech
28
A Related Problem: Asynchrony Audio and Video information are not synchronous For example: “th” (/ /) in “three” is visible, but not yet audible, because the audio is still silent Should HMM be in q t =“silence,” or q t =/ /?
29
qtqtqtqt tttt wtwtwtwt winc t qinc t Frame t xtxtxtxt vtvtvtvt tttt vinc t ytytytyt tttt q t-1 t-1 w t-1 winc t-1 qinc t-1 Frame t-1 x t-1 v t-1 t-1 vinc t-1 y t-1 t-1 A Solution: Two State Variables (Chu and Huang, ICASSP 2000) Coupled HMM (CHMM): Two parallel HMMs q t : Audio state (x t : audio observation) v t : Video state (y t : video observation) t = t - t : Asynchrony, capped at | t |<3
30
Asynchrony in Articulatory Phonology (Livescu and Glass, 2004) It’s not really the AUDIO and VIDEO that are ssynchronous… It is the LIPS, TONGUE, and GLOTTIS that are asynchronous S1S1 S1S1 word ind 1 ind 2 ind 3 U1U1 U1U1 S2S2 S2S2 U2U2 U2U2 U3U3 S3S3 S3S3 U3U3 sync 1,2 sync 2,3 sync 1,2 sync 2,3
31
Asynchrony in Articulatory Phonology Dental / / Tongue Glottis Unvoiced Retroflex /r/ Voiced Palatal /i/ “three,” dictionary form time It’s not really the AUDIO and VIDEO that are ssynchronous… It is the LIPS, TONGUE, and GLOTTIS that are asynchronous Dental / / Tongue Glottis Unvoiced Retroflex /r/ Voiced Palatal /i/ “three,” casual speech Silent Silent
32
Asynchrony in Articulatory Phonology Fricative /v/ Lips Tongue Wide / / Closed /n/ “seven,” dictionary form: /s vәn/ time Fricative /s/ Same mechanism represents pronunciation variability: –“Seven:” /vәn/→ /vn/ if tongue closes before lips open –“Eight:” /et/ → /e?/ if glottis closes before tongue tip closes Fricative /v/ Lips Tongue Wide / / Closed /n/ “seven,” casual speech: /s vn/ time Fricative /s/ Neutral /ә/
33
ltltltlt t wtwtwtwt winc t linc t tttttttt tttt tinc t tttt l t-1 t-1 t-1 w t-1 winc t-1 linc t-1 t t-1 t-1 tinc t-1 t-1 An Articulatory Feature Model (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007) There is no “phonestate” variable. Instead, we use a vector q t →[l t,t t,g t ] –Lipstate variable l t –Tonguestate variable t t –Glotstate variable g t gtgtgtgt tttt ginc t tttt g t-1 t-1 ginc t-1 t-1
34
Experimental Test (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007) Training and test data: CUAVE corpus –Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002 –169 utterances used, 10 digits each, silence between words –Recorded without Audio or Video noise (studio lighting; silent bkgd) Audio prepared by Kate Saenko at MIT –NOISEX speech babble added at various SNRs –MFCC+d+dd feature vectors, 10ms frames Video prepared by Amar Subramanya at UW –Feature vector = DCT of lip rectangle –Upsampled from 33ms frames to 10ms frames Experimental Condition: Train-Test Mismatch –Training on clean data –Audio/video weights tuned on noise-specific dev sets –Language model: uniform (all words equal probability), constrained to have the right number of words per utterance
35
Experimental Questions (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007) 1)Does Video reduce word error rate? 2)Does Audio-Video Asynchrony reduce word error rate? 3)Should asynchrony be represented as 1)Audio-Video Asynchrony (CHMM), or 2)Lips-Tongue-Glottis Asynchrony (AFM) 4)Is it better to use only CHMM, only AFM, or a combination of both methods?
36
Results, part 1: Should we use video? Answer: YES. Audio-Visual WER < Single-stream WER
37
Results, part 2: Are Audio and Video be asynchronous? Answer: YES. Async WER < Sync WER.
38
Results, part 3: Should we use CHMM or AFM? Answer: DOESN’T MATTER! WERs are equal.
39
Results, part 4: Should we combine systems? Answer: YES. Best is AFM+CH1+CH2 ROVER
40
Video Feature Extraction: –Manifold discriminant is better than a global discriminant Audio Feature Extraction: –Beamformer: Delay-and-sum beats Filter-and-sum –Postfilter: Spectral subtraction gives best WER (though MMSE-logSA sounds best) –VAD: Backoff noise estimation works best in this corpus Audio-Video Fusion: –Video reduces WER in train-test mismatch conditions –Audio and video are asynchronous (CHMM) –Lips, tongue and glottis are asynchronous (AFM) –It doesn’t matter whether you use CHMM or AFM, but... –Best result: combine both representations Conclusions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.