Download presentation
Presentation is loading. Please wait.
1
HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2) Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos (WP1) Vassilis Diakoloukas (WP2)
2
Outline Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR Work package 2 Adaptation Data collection
3
Outline Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR Work package 2 Adaptation Data collection
4
Baseline Baseline Performance Completed Aurora 2 on HTK Aurora 3 on HTK Aurora 4 on HTK Lattices for Aurora 4 Baseline Performance Ongoing WSJ1 (Decipher) DMHMMs (Decipher)
5
Aurora 2 Database Based on TIdigits downsampled to 8KHz Noise artificially added at several SNRs 3 sets of noises A: subway, babble, car, exhib. hall B: restaurant, street, airport, train station C: subway, street (with different freq. characteristics) Two training conditions Training on clean data Multi-condition Training on noisy data
6
Aurora 2 Database 8440 training sentences 1001 test sentences / test set Three front-end configurations HTK default WI007 (Aurora 2 distribution) WI008 (Thanks to Prof. Segura)
7
Aurora 2: Clean training HTK default Front-End
8
Aurora 2: Multi-Condition training HTK default Front-End
9
Aurora 2: Clean vs Multi-Condition Training
10
Aurora 2 Front End Comparison: Clean Training
11
Front End Comparison: Multi-Condition Training
12
Aurora 3 Database 5 languages Finnish German Italian Spanish Danish 3 noise conditions quiet low noisy (low) high noisy (high) 2 recording modes close-talking microphone (ch0) hands-free microphone (ch1)
13
Aurora 3 Database 3 experimental setups Well-Matched (WM) 70% of all utts in “quiet, low, high” conditions were used for training remaining 30% were used for testing Medium Mismatched (MM) 100% hands-free recordings from “quiet” and “low” for training 100% hands-free recordings from “high” for testing High Mismatched (HM) 70% of close-talking recordings from all noise conditions for training 30% of hands-free recordings from “low” and “high” for testing
14
Baseline Aurora 3 performance
16
Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison ) FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI007-TUC90,5372,530,3586,8873,7242,2390,5879,0674,24 WI007-UGR92,7480,5140,5392,9480,3151,5591,281,0473,17 TRAIN(#sent.)177856188933921607169620329971007 TEST(#sent.)7701462831522850631867241394 DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI007-TUC79,6249,2933,1593,6482,0239,84 WI007-UGR87,2867,3239,3793,6482,0239,84 TRAIN(#sent.)344012541720295112451720 TEST(#sent.)14742046581309405626
17
Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison )
18
Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison )
19
Aurora 4 Database Based on the WSJ phase 0 collection 5000 word vocabulary 7138 training data (ARPA evaluation) 2 recording microphones 6 different noises artificially added Car, Babble, Restaurant, Street, Airport, TrainSt
20
Aurora 4 Training Data Sets 3 Training Conditions (Clean – MultiCondition – Noisy) 7138 utterances (as in the ARPA evaluation) 7138 utterances 3569 utterances (Sennheiser) 3569 utterances (2nd mic) 893 (no noise added) 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) Clean trainingMulticondition training 2676 (1 out of 6 noises added at SNRs between 10 and 20 dB) 893 (no noise added)
21
Aurora 4 Test Sets 14 Test Sets 2 sizes: small (166 utts) and large (330 utts) 330 utt. (Sennheiser microphone) SET 1 330 utt. (Sennheiser mic; Noise 1 added at SNRs between 5 and 15 dB) SET 2 … 330 utt. (Sennheiser mic; Noise 2 added at SNRs between 5 and 15 dB) SET 3 330 utt. (Sennheiser mic; Noise 6 added at SNRs between 5 and 15 dB) SET 7 330 utt. (2nd microphone) SET 8 330 utt. (2nd mic; Noise 1 added at SNRs between 5 and 15 dB) SET 9 … 330 utt. (2nd mic; Noise 2 added at SNRs between 5 and 15 dB) SET 10 330 utt. (2nd mic; Noise 6 added at SNRs between 5 and 15 dB) SET 14
22
Lattices Obtained from SONIC recognizer real time decoding for WSJ 5k task State-of-the-art performance (8% WERR) Lattices obtained from clean models Three sizes lattices: small, medium, large Fixed branching factor for each lattice size (small=2.5, medium=4, large=5.5) Speed-up factor compared to HTK decoding: x100, x50, x10
23
Baseline Aurora 4 with Lattices
25
Baseline Aurora 4 (Comparing Lattices)
26
Aurora4 Baseline Conclusions on Lattices Lattices speed up recognition Medium Size Lattice is ~ 60 times faster Small Size Lattice is ~ 108 times faster Problem: improved performance in noisy test Careful when using lattices in mismatched conditions (clean training-noisy data)! Solution: two sets of lattices lattices: matched, mismatched
27
Audio-Visual ASR: Database Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10) CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations)
28
CUAVE Database Speakers
29
Audio-Visual ASR: Feature Extraction Lip region of interest (ROI) tracking A fixed size ROI is detected using template matching ROI minimizes RGB-Euclidean distance with a given ROI template ROI template is selected from 1 st frame of each speaker Continuity constraint: search within a 20x20 pixel window of previous frame ROI (does not work for rapid speaker movements)
31
Audio-Visual ASR: Feature Extraction Features extracted from ROI ROI is transformed to grayscale ROI is decimated to a 16x16 pixel region 2D separable DCT is applied to 16x16 pixel region Upper-left 6x6 region is kept (excluding first coef.) 35 feature vector is resampled in time from 29.97 fps (NTSC) to 100 fps First and second derivatives in time are computed using a 6 frame window (feature size 105) Sanity check: unsupervised k-means clustering of ROI results in …
33
Experiments Recognition experiment: Open loop digit grammar (50 digits per utterance, no endpointing) Classification experiment: Single digit grammar (endpointed digits based on provided segmentation)
34
Models Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A) Audio-Visual: 39+35 feats (MFCC_D_A+ROIDCT) HMM models 8 state, left-to-right HMM whole-digit models with no state skipping Single Gaussian mixture Audio-Visual HMM uses separate audio and video feature streams with equal weights (1,1)
35
Results (Word Accuracy] Data Training: 1500 digits (30 speakers) Testing: 300 digits (6 speakers) AudioVisualAudioVisual Recognition98%26%78% Classification99%46%85%
36
Future Work Multi-mixture models Front-end (NTUA) Tracking algorithms Feature extraction Feature Combination Feature integration Feature weighting
37
Outline Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR Work package 2 Adaptation Data collection
38
Feature extraction and combination Noise Robust Features (NTUA) – m12 AM-FM Features (NTUA) – m12 Feature combination – m12 Supra-segmental features (see also segment models) – m18
39
Outline Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR Work package 2 Adaptation Data collection
40
Segment Models Baseline system Supra-segmental features Phone Transition modeling – m12 Prosody modeling – m18 Stress modeling – m18 Parametric modeling of feature trajectories Dynamical system modeling Combine with HMMs
41
Outline Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR Work package 2 Adaptation Data collection
42
Blind Source Separation (Mokios, Sidiropoulos] Based on PARallel FACtor (PARAFAC) analysis, i.e., low- rank decomposition of multi-dimensional tensorial data Collecting spatial covariance matrix estimates which are sufficiently separated in time: Assumptions uncorrelated speaker signals and noise D(t) is a diagonal matrix of speaker powers for measurement period t denotes noise power (estimated from silence intervals)
43
Outline Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR Work package 2 Adaptation Data collection
44
Acoustic Model Adaptation Adaptation Method: Bayes’ Optimal Classification Acoustic Models: Discrete Mixture HMMs
45
Bayes optimal classification Classifier decision for a test data vector x test : Choose the class that results in the highest value:
46
Bayes optimal versus MAP Assumption: the posterior is sufficiently peaked around the most probable point MAP approximation: θ MAP is the set of parameters that maximize:
47
Why Bayes optimal classification Optimal classification criterion The prediction of all the parameter hypotheses is combined Better discrimination Less training data Faster asymptotic convergence to the ML estimate
48
Why Bayes optimal classification However: Computationally more expensive Difficult to find analytical solutions....hence some approximations should still be considered
49
Approximate Bayesian Decision rule (Merhav, Ephraim 1991) Having Training data y Test sequence x M the number of source models H λ the parameter set of each source However Still difficult to be implemented Strong assumptions
50
Discrete-Mixture HMMs (Digalakis et. al. 2000) It is based on sub-vector quantization Introduces a new form of observation distributions
51
DMHMMs benefits (Digalakis et. al. 2000) Speech Recognition performance driven quantization scheme Quantization of the acoustic space in sufficient detail Mixtures capture the correlation between sub-vectors Well-matched in client-server applications Comparable performance to continuous HMMs Faster decoding speeds
52
DMHMM parameters that could be adapted Partitioning into sub-vectors How many sub-vectors Which MFCCs to form each sub-vector Bit-allocation Optimize bit-allocation based on adaptation data Discrete Mixture Weights Centroids of codebooks Centroid observation probabilities
53
Adaptation on DMHMMs Goal: Reestimate the centroids observation distribution Transformation-based adaptation ? Maybe not enough training data for the amount of centroids Bayesian adaptation ? Could benefit from its convergence property Optimal Bayes classification ? Easier to find approximate forms for DMHMMs
54
Outline Work package 1 Baseline: Aurora 2, Aurora 3, Aurora 4 (lattices) Audio-Visual ASR: Baseline Feature extraction and combination Segment models for ASR Blind Source Separation for multi-microphone ASR Work package 2 Adaptation Data collection
55
TUC Non-Native Recordings 10 Speakers (6 male – 4 female) Fluency in English: 4 excellent 5 good – very good 1 satisfactory Speaker pronunciation: 1 from Cyprus 3 from Northern Greece 1 from Ionian Islands 2 Athens area 1 from Crete 1 from Central Greece
56
EXTRA SLIDES
57
Prior Work Overview MLST. Constr. Est. Adapt. MAP (Bayes) Adapt. Genones Segment Models VTLN Combinations Robust Features
58
HIWIRE Work Proposal Adaptation Bayes optimal class. Audio Visual ASR Baseline experiments Microphone Arrays Speech/Noise Separation Feature Selection AM-FM Features Acoustic Modeling Segment Models
59
Aurora 2 Performance with HTK FE (Clean Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean98,8398,9798,8199,1498,9498,8398,9798,8199,1498,9499,0298,979998,95 20 dB96,9689,9696,8496,294,9989,1995,7790,0794,3892,3594,4795,1994,8393,9 15 dB92,9173,4389,5391,8586,9374,3988,2776,8983,6280,7987,6389,6988,6684,82 10 dB78,7249,0666,2475,167,2852,7266,7553,1559,6158,0675,1975,2775,2365,18 5 dB53,3927,0332,843,5139,1829,5738,1530,6929,7132,0352,8448,8550,8538,65 0 dB27,311,7313,2715,9817,0711,718,6815,8412,2514,6226,0121,6423,8317,44 -5 dB12,624,968,357,658,45,0410,078,088,497,9212,110,711,48,81 Avg.65,8250,7357,9861,3558,9751,6359,5253,3655,3154,9663,8962,963,458,25
60
Aurora 2 Performance with HTK FE (Multi-Condition Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean98,5998,5298,4898,5598,5498,5998,5298,4898,5598,5498,6598,5298,5998,55 20 dB97,6497,6197,8596,9897,5296,5697,4697,1796,6496,9697,0596,4396,7497,14 15 dB96,7596,897,6496,5896,9494,7295,9295,6295,2595,3895,4695,595,4896,02 10 dB94,3895,2295,6593,1294,5990,9794,292,7892,3592,5892,3591,992,1393,29 5 dB88,4287,6786,1786,9587,381,8585,3484,9182,9183,7581,4681,8681,6684,75 0 dB65,6761,0350,8261,859,8356,8360,2264,3654,2158,9145,1654,0549,6157,42 -5 dB26,0126,1819,1522,4923,4622,626,327,6518,8823,8618,6125,5422,0823,34 Avg.88,5787,6785,6387,0987,2484,1986,6386,9784,2785,5282,383,9583,1285,72
61
Aurora 2 Performance with WI008 FE (Clean Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean99,0899,0399,0599,2399,199,0899,0399,0599,2399,199,0299,03 99,08 20 dB97,8898,2598,3697,8198,0898,0797,6498,4298,4398,1497,3697,6797,5297,99 15 dB96,3896,7497,5296,796,8495,3396,5897,0596,7696,4395,395,7495,5296,41 10 dB92,2691,9995,2992,5993,0389,8792,7493,2693,8692,4390,3390,7590,5492,29 5 dB83,8880,6886,0184,0583,6676,0583,2583,5484,281,7678,8878,4878,6881,9 0 dB61,9351,1266,0663,560,6550,2659,760,2462,2358,1152,5952,1252,3657,98 -5 dB31,0718,9529,8233,228,2618,3929,2327,3229,5626,1325,1526,1225,6426,88 Avg.86,4783,7688,6586,9386,4581,9285,9886,587,185,3782,8982,9582,9285,31
62
Aurora 2 Performance with WI008 FE (Multi-Condition Training) ABC SubwayBabbleCarExhibitAvg.RestrStreetAirportStationAvg.Sub.M.Str.M.Avg.Overall Clean99,0298,8298,9999,1498,9999,0298,8298,9999,1498,99 98,8598,9298,98 20 dB98,6298,5898,5498,2498,598,198,1398,6398,898,4298,0797,9498,0198,37 15 dB97,5497,9198,4297,5697,8696,9397,8598,0397,6997,6397,5497,7397,6497,72 10 dB95,3396,0797,3895,3496,0394,8495,5995,9196,0595,695,5895,3195,4595,74 5 dB91,4390,2190,9390,190,6787,1490,3991,4490,1689,7888,9287,5288,2289,82 0 dB75,2868,7180,77675,1765,5573,8575,7874,0872,3266,9965,6366,3172,26 -5 dB39,8530,0540,4144,9938,8328,5238,8840,9541,7537,5330,4330,5930,5136,64 Avg.91,6490,393,1991,4591,6588,5191,1691,9691,3690,7589,4288,8389,1390,78
63
Aurora 3 HTK Settings Spanish Parametrize.csh Set Options = “-F RAW –fs 8 –q –noc0 –swap” Config_tr TARGETKIND = MFCC_E_D_A DELTAWINDOW = 3 ACCWINDOW = 2 ENORMALISE = F HNET:TRACE = 2 NATURALREADORDER = T NATURALWRITEORDER = T
64
Aurora 3 HTK Settings Italian Sdc_it.conf $FE_OPTIONS = “-q -F RAW –fs 8 ” Config TARGETKIND = MFCC_D_A_E HNET:TRACE = 2 ACCWINDOW = 2 DELTAWINDOW = 3 ENORMALISE = F NATURALREADORDER = T NATURALWRITEORDER = T
65
Baseline Aurora 3 Performance FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI00790,5372,530,3586,8873,7242,2390,5879,0674,24 WI00895,6276,6886,1193,4785,4181,0294,4988,7389,55 TRAIN(#sent.)177856188933921607169620329971007 TEST(#sent.)7701462831522850631867241394 DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI00779,6249,2933,1593,6482,0239,84 WI00884,9965,6863,9196,5888,5388,22 TRAIN(#sent.)344012541720295112451720 TEST(#sent.)14742046581309405626
66
Baseline Aurora 3 with WI007 FE ( TUC - UGR comparison ) FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI007-TUC90,5372,530,3586,8873,7242,2390,5879,0674,24 WI007-UGR92,7480,5140,5392,9480,3151,5591,281,0473,17 TRAIN(#sent.)177856188933921607169620329971007 TEST(#sent.)7701462831522850631867241394 DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI007-TUC79,6249,2933,1593,6482,0239,84 WI007-UGR87,2867,3239,3793,6482,0239,84 TRAIN(#sent.)344012541720295112451720 TEST(#sent.)14742046581309405626
67
Baseline Aurora 3 with WI008 FE ( TUC - UGR comparison ) FINNISHSPANISHGERMAN FRONT-ENDWMMMHMWMMMHMWMMMHM WI008-TUC95,6276,6886,1193,4785,4181,0294,4988,7389,55 WI008-UGR96,0980,9286,6196,6493,9291,5595,1190,8491,25 TRAIN(#sent.)177856188933921607169620329971007 TEST(#sent.)7701462831522850631867241394 DANISHITALIAN FRONT-ENDWMMMHMWMMMHM WI008-TUC84,9965,6863,9196,5888,5388,22 WI008-UGR93,3781,4979,5996,7192,5389 TRAIN(#sent.)344012541720295112451720 TEST(#sent.)14742046581309405626
68
Baseline Aurora 4 with Lattices Small Lattice Size 891011121314Avg. Clean86,5680,7768,4364,7555,3170,9859,772,37 Multi86,8586,5283,9882,581,3384,6481,8484,88 Noisy8785,9781,5880,4876,5182,6577,4883,3 Average86,884,427875,9171,0579,4273,0180,19 1234567 Clean88,3685,6774,3673,4466,4174,5963,87 Multi86,8186,8585,7885,3485,5685,8984,42 Noisy87,8186,9685,7183,6183,0985,681,8 Average87,6686,4981,9580,878,3582,0376,7
69
Baseline Aurora 4 with Lattices Medium Lattice Size 1234567 Clean87,9284,7172,8972,7865,1273,7862,91 Multi85,9785,5284,7983,8383,984,2483,24 Noisy87,3385,7884,4282,2881,5884,1680,88 Average87,0785,3480,779,6378,8780,7375,68 891011121314Avg. Clean85,6779,4566,0863,6853,8669,0758,3171,16 Multi86,1984,9782,6581,1880,6382,8480,2983,59 Noisy86,785,4581,1478,6774,2282,2176,6582,25 Average86,1983,2976,6274,5169,5778,0471,7579,14
70
Baseline Aurora 4 with Lattices Small Lattice Size 891011121314Avg. Clean86,5680,7768,4364,7555,3170,9859,772,37 Multi86,8586,5283,9882,581,3384,6481,8484,88 Noisy8785,9781,5880,4876,5182,6577,4883,3 Average86,884,427875,9171,0579,4273,0180,19 1234567 Clean88,3685,6774,3673,4466,4174,5963,87 Multi86,8186,8585,7885,3485,5685,8984,42 Noisy87,8186,9685,7183,6183,0985,681,8 Average87,6686,4981,9580,878,3582,0376,7
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.