Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear representations
Abstract INTRODUCTION
Speech dynamics into ASR dynamics of speech production to constrain recognizer –noisy environments –conversational speech –speaker adaptation efficient, complete and trainable models –for recognition –for analysis –for synthesis INTRODUCTION
Articulatory trajectories from West (2000) INTRODUCTION
Articulatory-trajectory model INTRODUCTION
intermediate finite-state surface Level source dependent Articulatory-trajectory model INTRODUCTION
Multi-level Segmental HMM segmental finite-state process intermediate “articulatory” layer –linear trajectories mapping required –linear transformation –radial basis function network INTRODUCTION
Linear-trajectory model INTRODUCTION acoustic layer articulatory-to- acoustic mapping intermediate layer segmental HMM 23451
Linear-trajectory equations Defined as where Segment probability: THEORY
Linear mapping Objective function with matched sequences and THEORY
Trajectory parameters Utterance probability, and, for the optimal (ML) state sequence THEORY
Non-linear (RBF) mapping... acoustic layer formant trajectories THEORY
Trajectory parameters With the RBF, the least-squares solution is sought by gradient descent: THEORY
Tests on TIMIT N. American English, at 8kHz –MFCC13 acoustic features (incl. zero’ th ) a)F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker b)F1-3+BE5: five band energies added c)PFS12: synthesiser control parameters METHOD
TIMIT baseline performance Constant-trajectory SHMM (ID_0) Linear-trajectory SHMM (ID_1) RESULTS
Performance across feature sets RESULTS
Phone categorisation No.No.Description A 1all data B 2silence; speech C 6linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate D 10as Deng and Ma (2000): silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop E 10discrete articulatory regions F 49silence; individual phones METHOD
Discrete articulatory regions FeaturesDescription 0 -voiceSilence, non-speech 1 +voice, VT openVowel, glide 2 +voice, VT part.Liquid, approximant 3 +voice, VT closed, +velumNasal 4 +voice, VT closedVoiced plosive (closure) 5 -voice, VT closedVoiceless plosive (closure) 6 +voice, VT open, +plosionVoiced plosive (release) 7 -voice, VT open, +plosionVoiceless plosive (release) 8 +voice, VT part., +fric/aspVoiced fricative 9 -voice, VT part., +fric/aspVoiceless fricative METHOD
Performance across groupings RESULTS
Results across groupings RESULTS
Tests on MOCHA S. British English, at 16kHz –MFCC13 acoustic features (incl. zero’ th ) –articulatory x - & y -coords from 7 EMA coils –PCA9+Lx: first nine articulatory modes plus the laryngograph log energy METHOD
MOCHA baseline performance RESULTS
Performance across mappings RESULTS
Model visualisation DISCUSSION Original acoustic data Constant- trajectory model Linear- trajectory model, (F) PFS12 (c)
Conclusions Theory of Multi-level Segmental HMMs Benefits of linear trajectories Results show near optimal performance with linear mappings Progress towards unified models of the speech production process What next? –unsupervised (embedded) training, to derive pseudo-articulatory representations –implement non-linear mapping (i.e., RBF) –include biphone language model, and segment duration models SUMMARY