Presentation is loading. Please wait.

Presentation is loading. Please wait.

Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering."— Presentation transcript:


2 Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering

3 Audio-Visual Speech Recognition 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

4 I. Video Noise 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

5 AVICAR Database ● AVICAR = Audio-Visual In a CAR ● 100 Talkers ● 4 Cameras, 7 Microphones ● 5 noise conditions: Engine idling, 35mph, 35mph with windows open, 55mph, 55mph with windows open ● Three types of utterances: –Digits & Phone numbers, for training and testing phone- number recognizers –TIMIT sentences, for training and testing large vocabulary speech recognition –Isolated Letters, to test the use of video for an acoustically hard recognition problem

6 AVICAR Recording Hardware (Lee, Hasegawa-Johnson et al., ICSLP 2004) 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard 8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. System is not permanently installed; mounting requires 10 minutes.

7 AVICAR Video Noise  Lighting: Many different angles, many types of weather  Interlace: 30fps NTSC encoding used to transmit data from camera to digital video tape  Facial Features: –Hair –Skin –Clothing –Obstructions

8 AVICAR Noisy Image Examples

9 Related Problem: Dimensionality  Dimension of the raw grayscale lip rectangle: 30x200=6000 pixels  Dimension of the DCT of the lip rectangle: 30x200=6000 dimensions  Smallest truncated DCT that allows a human viewer to recognize lip shapes (Hasegawa-Johnson, informal experiments): 25x25=625 dimensions  Truncated DCT typically used in AVSR: 4x4=16 dimensions  Dimension of “geometric lip features” that allow high-accuracy AVSR (e.g., Chu and Huang, 2000): 3 dimensions (lip height, lip width, vertical assymmetry)

10 Dimensionality Reduction: The Classics Principal Components Analysis (PCA):  Project onto eigenvectors of the total covariance matrix  Projection includes noise Linear Discriminant Analysis (LDA):  Project onto v=W -1 (d  ), W=within-class covariance  Projection reduces noise

11 Manifold Estimation (e.g., Roweis and Saul, Science 2000) Neighborhood Graph  Node = data point  Edge = connect each data point to its K nearest neighbors Manifold Estimation  The K nearest neighbors of each data point define the local (K-1)- dimensional tangent space of a manifold

12 Local Discriminant Graph (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Maximize Local Inter- Manifold Interpolation Errors, subject to a constant Same- Class Interpolation Error: Find P to maximize  D  i ||P T (x i -  k c k y k )|| 2, y k Є KNN(x i ), other classes Subject to  S = constant,  S =  i ||P T (x i -  j c j x j )|| 2, x j Є KNN(x i ), same class

13 PCA, LDA, LDG: Experimental Test (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Lip Feature Extraction: DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph

14 Lip Reading Results (Digits) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph

15 II. Audio Noise 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

16 Audio Noise

17  Beamforming –Filter-and-sum (MVDR) vs. Delay-and-sum  Post-Filter –MMSE log spectral amplitude estimator (Ephraim and Malah, 1984) vs. Spectral Subtraction  Voice Activity Detection –Likelihood ratio method (Sohn and Sung, ICASSP 1998) –Noise estimates:  Fixed noise  Time-varying noise (autoregressive estimator)  High-variance noise (backoff estimator) Audio Noise Compensation

18 MVDR Beamformer + MMSElogSA Postfilter (MVDR = Minimum variance distortionless response) (MMSElogSA = MMSE log spectral amplitude estimator) (Proof of optimality: Balan and Rosca, ICASSP 2002)

19 Word Error Rate: Beamformers  Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open)  DS=Delay-and-sum; MVDR=Minimum variance distortionless response

20 Word Error Rate: Postfilters

21  Most errors at low SNR are because noise gets misrecognized as speech  Effective solution: voice activity detection (VAD)  Likelihood ratio VAD (Sohn and Sung, ICASSP 1998):  t = log { p(X t =S t +N t ) / p(X t =N t ) }  t = log { p(X t =S t +N t ) / p(X t =N t ) } X t = Measured Power Spectrum S t, N t = Exponentially Distributed Speech, Noise  t > threshold → Speech Present  t < threshold → Speech Absent Voice Activity Detection

22  Fixed estimate: N 0 =average of first 10 frames  Autoregressive estimator (Sohn and Sung): N t =  t X t + (1-  t ) N t-1 N t =  t X t + (1-  t ) N t-1  t = function of X t, N 0  Backoff estimator (Lee and Hasegawa- Johnson, DSP for In-Vehicle and Mobile Systems, 2007): N t =  t X t + (1-  t ) N 0 VAD: Noise Estimators

23 Word Error Rate: Digits

24 III. Pronunciation Variability 1)Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3)Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for Audio- Visual Speech Recognition

25 Graphical Methods: Dynamic Bayesian Network Bayesian Network = A Graph in which  Bayesian Network = A Graph in which  Nodes are Random Variables (RVs)  Edges Represent Dependence  Dynamic Bayesian Network = A BN in which  RVs are repeated once per time step  Example: an HMM is a DBN  Most important RV: the “phonestate” variable q t  Typically q t Є {Phones} x {1,2,3}  Acoustic features x t and video features y t depend on q t

26 Example: HMM is a DBN q t-1  t-1 x t-1 y t-1 w t-1 winc t-1 qinc t-1 Frame t-1 qtqtqtqt tttt xtxtxtxt ytytytyt wtwtwtwt winc t qinc t Frame t q t is the phonestate, e.g., q t Є { /w/1, /w/2, /w/3, /n/1, /n/2, … }  q t is the phonestate, e.g., q t Є { /w/1, /w/2, /w/3, /n/1, /n/2, … }  w t is the word label at time t, for example, wt Є {“one”, “two”, …}   t is the position of phone q t within word w t :  t Є {1 st, 2 nd, 3 rd, …}  qinc t Є {0,1} specifies whether  t+1 =  t or  t+1 =  t +1

27 Pronunciation Variability  Even when reading phone numbers, talkers “blend” articulations.  For example: “seven eight:” /s  vәnet/→ /s  vne?/  As speech gets less formal, pronunciation variability gets worse, e.g., worse in a car than in the lab; worse in conversation than in read speech

28 A Related Problem: Asynchrony  Audio and Video information are not synchronous  For example: “th” (/  /) in “three” is visible, but not yet audible, because the audio is still silent  Should HMM be in q t =“silence,” or q t =/  /?

29 qtqtqtqt tttt wtwtwtwt winc t qinc t Frame t xtxtxtxt vtvtvtvt tttt vinc t ytytytyt tttt q t-1  t-1 w t-1 winc t-1 qinc t-1 Frame t-1 x t-1 v t-1  t-1 vinc t-1 y t-1  t-1 A Solution: Two State Variables (Chu and Huang, ICASSP 2000)  Coupled HMM (CHMM): Two parallel HMMs  q t : Audio state (x t : audio observation)  v t : Video state (y t : video observation)   t =  t -  t : Asynchrony, capped at |  t |<3

30 Asynchrony in Articulatory Phonology (Livescu and Glass, 2004)  It’s not really the AUDIO and VIDEO that are ssynchronous…  It is the LIPS, TONGUE, and GLOTTIS that are asynchronous S1S1 S1S1 word ind 1 ind 2 ind 3 U1U1 U1U1 S2S2 S2S2 U2U2 U2U2 U3U3 S3S3 S3S3 U3U3 sync 1,2 sync 2,3 sync 1,2 sync 2,3

31 Asynchrony in Articulatory Phonology Dental /  / Tongue Glottis Unvoiced Retroflex /r/ Voiced Palatal /i/ “three,” dictionary form time  It’s not really the AUDIO and VIDEO that are ssynchronous…  It is the LIPS, TONGUE, and GLOTTIS that are asynchronous Dental /  / Tongue Glottis Unvoiced Retroflex /r/ Voiced Palatal /i/ “three,” casual speech Silent Silent

32 Asynchrony in Articulatory Phonology Fricative /v/ Lips Tongue Wide /  / Closed /n/ “seven,” dictionary form: /s  vәn/ time Fricative /s/  Same mechanism represents pronunciation variability: –“Seven:” /vәn/→ /vn/ if tongue closes before lips open –“Eight:” /et/ → /e?/ if glottis closes before tongue tip closes Fricative /v/ Lips Tongue Wide /  / Closed /n/ “seven,” casual speech: /s  vn/ time Fricative /s/ Neutral /ә/

33 ltltltlt t wtwtwtwt winc t linc t tttttttt tttt tinc t tttt l t-1 t-1 t-1 w t-1 winc t-1 linc t-1 t t-1  t-1 tinc t-1  t-1 An Articulatory Feature Model (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)  There is no “phonestate” variable. Instead, we use a vector q t →[l t,t t,g t ] –Lipstate variable l t –Tonguestate variable t t –Glotstate variable g t gtgtgtgt tttt ginc t tttt g t-1  t-1 ginc t-1  t-1

34 Experimental Test (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)  Training and test data: CUAVE corpus –Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002 –169 utterances used, 10 digits each, silence between words –Recorded without Audio or Video noise (studio lighting; silent bkgd)  Audio prepared by Kate Saenko at MIT –NOISEX speech babble added at various SNRs –MFCC+d+dd feature vectors, 10ms frames  Video prepared by Amar Subramanya at UW –Feature vector = DCT of lip rectangle –Upsampled from 33ms frames to 10ms frames  Experimental Condition: Train-Test Mismatch –Training on clean data –Audio/video weights tuned on noise-specific dev sets –Language model: uniform (all words equal probability), constrained to have the right number of words per utterance

35 Experimental Questions (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007) 1)Does Video reduce word error rate? 2)Does Audio-Video Asynchrony reduce word error rate? 3)Should asynchrony be represented as 1)Audio-Video Asynchrony (CHMM), or 2)Lips-Tongue-Glottis Asynchrony (AFM) 4)Is it better to use only CHMM, only AFM, or a combination of both methods?

36 Results, part 1: Should we use video? Answer: YES. Audio-Visual WER < Single-stream WER

37 Results, part 2: Are Audio and Video be asynchronous? Answer: YES. Async WER < Sync WER.

38 Results, part 3: Should we use CHMM or AFM? Answer: DOESN’T MATTER! WERs are equal.

39 Results, part 4: Should we combine systems? Answer: YES. Best is AFM+CH1+CH2 ROVER

40  Video Feature Extraction: –Manifold discriminant is better than a global discriminant  Audio Feature Extraction: –Beamformer: Delay-and-sum beats Filter-and-sum –Postfilter: Spectral subtraction gives best WER (though MMSE-logSA sounds best) –VAD: Backoff noise estimation works best in this corpus  Audio-Video Fusion: –Video reduces WER in train-test mismatch conditions –Audio and video are asynchronous (CHMM) –Lips, tongue and glottis are asynchronous (AFM) –It doesn’t matter whether you use CHMM or AFM, but... –Best result: combine both representations Conclusions

Download ppt "Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering."

Similar presentations

Ads by Google