Download presentation
Presentation is loading. Please wait.
Published byDeborah McDaniel Modified over 9 years ago
1
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear representations http://web.bham.ac.uk/p.jackson/balthasar/
2
Abstract INTRODUCTION
3
Speech dynamics into ASR dynamics of speech production to constrain recognizer –noisy environments –conversational speech –speaker adaptation efficient, complete and trainable models –for recognition –for analysis –for synthesis INTRODUCTION
4
Articulatory trajectories from West (2000) INTRODUCTION
5
Articulatory-trajectory model INTRODUCTION
6
intermediate finite-state surface Level source dependent Articulatory-trajectory model INTRODUCTION
7
Multi-level Segmental HMM segmental finite-state process intermediate “articulatory” layer –linear trajectories mapping required –linear transformation –radial basis function network INTRODUCTION
8
Linear-trajectory model INTRODUCTION acoustic layer articulatory-to- acoustic mapping intermediate layer segmental HMM 23451
9
Linear-trajectory equations Defined as where Segment probability: THEORY
10
Linear mapping Objective function with matched sequences and THEORY
11
Trajectory parameters Utterance probability, and, for the optimal (ML) state sequence THEORY
12
Non-linear (RBF) mapping... acoustic layer formant trajectories THEORY
13
Trajectory parameters With the RBF, the least-squares solution is sought by gradient descent: THEORY
14
Tests on TIMIT N. American English, at 8kHz –MFCC13 acoustic features (incl. zero’ th ) a)F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker b)F1-3+BE5: five band energies added c)PFS12: synthesiser control parameters METHOD
15
TIMIT baseline performance Constant-trajectory SHMM (ID_0) Linear-trajectory SHMM (ID_1) RESULTS
16
Performance across feature sets RESULTS
17
Phone categorisation No.No.Description A 1all data B 2silence; speech C 6linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate D 10as Deng and Ma (2000): silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop E 10discrete articulatory regions F 49silence; individual phones METHOD
18
Discrete articulatory regions FeaturesDescription 0 -voiceSilence, non-speech 1 +voice, VT openVowel, glide 2 +voice, VT part.Liquid, approximant 3 +voice, VT closed, +velumNasal 4 +voice, VT closedVoiced plosive (closure) 5 -voice, VT closedVoiceless plosive (closure) 6 +voice, VT open, +plosionVoiced plosive (release) 7 -voice, VT open, +plosionVoiceless plosive (release) 8 +voice, VT part., +fric/aspVoiced fricative 9 -voice, VT part., +fric/aspVoiceless fricative METHOD
19
Performance across groupings RESULTS
20
Results across groupings RESULTS
21
Tests on MOCHA S. British English, at 16kHz –MFCC13 acoustic features (incl. zero’ th ) –articulatory x - & y -coords from 7 EMA coils –PCA9+Lx: first nine articulatory modes plus the laryngograph log energy METHOD
22
MOCHA baseline performance RESULTS
23
Performance across mappings RESULTS
24
Model visualisation DISCUSSION Original acoustic data Constant- trajectory model Linear- trajectory model, (F) PFS12 (c)
25
Conclusions Theory of Multi-level Segmental HMMs Benefits of linear trajectories Results show near optimal performance with linear mappings Progress towards unified models of the speech production process What next? –unsupervised (embedded) training, to derive pseudo-articulatory representations –implement non-linear mapping (i.e., RBF) –include biphone language model, and segment duration models SUMMARY
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.