A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture
2 Introduction Speech coding at low data rates (a few hundred bits/s) requires compact, low-dimensional representation. => code variable-length speech “segments”. Automatic speech recognition is potentially a powerful way to identify useful segments for coding. BUT: HMM-based coding has limitations: shortcomings of HMMs as production models typical recognition feature sets (e.g. cepstral coefficients) impose limits on coded speech quality difficult to retain speaker characteristics (at least for speaker-independent recognition ).
3 A “unified” model for speech coding
4 A simple coding scheme Demonstrate principles of coding using same model for both recognition and synthesis. Model represents linear formant trajectories. Recognition: linear trajectory segmental HMMs of formant features. Synthesis: JSRU parallel-formant synthesizer. Coding is applied to analysed formant trajectories => relatively high bit-rate (typically bits/s). Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.
5 Segment coding scheme overview
6 Formant analyser (EUROSPEECH’97) –Each formant frequency estimate is assigned a value representing confidence in its measurement accuracy. When formants are indistinct, confidence is low. –In cases of ambiguity, the analyser offers two alternative sets of formant trajectories for resolution in the recognition process. “four seven”
7 Linear formant trajectory recognition Feature set: formant frequencies plus mel- cepstrum coefficients and overall energy feature. Confidences: represent as variances: low confidence => large variance. Add confidence variance to model variance, so low- confidence features have little influence. Formant alternatives: choose one giving highest probability for each possible data segment and model state. Numbers of segments depend on phone identity: e.g. 1 segment for fricatives; 3 for voiceless stops. Range of durations : segment-dependent minimum and maximum segment duration.
8 Frame-by-frame synthesizer controls Values for each of 10 synthesizer control parameters are obtained at 10ms intervals: Voicing and fundamental frequency from excitation analysis program. 3 Formant frequency controls from formant analyser. 5 Formant amplitude controls from FFT-based method. With 6 bits assigned to each of the 10 controls, the baseline data rate is 6000 bits/s.
9 Segment coding –Segments identified by recognizer are coded using straight-line fits to observed formant parameters. –Use a least mean square error criterion. For formant frequencies, frame error is weighted by confidence variance. Thus the more reliable frames have more influence. –To code a segment, represent value at start, and difference of end value from start value. –Force continuity across segment boundaries where smooth changes are required for naturalness (e.g. semivowel-vowel boundaries). –When there are formant alternatives, use those selected by recognizer.
10 Coding experiments –Tested on 2 tasks: speaker-independent connected digit recognition and speaker-dependent recognition of airborne reconnaissance reports (500 word vocab.). –Frame-by-frame analysis-synthesis (at 6000 bits/s) generally produced a close copy of original speech. –Segment-coded versions preserved main characteristics. –There were some instances of formant analysis errors. –In some cases, using the recognizer to select between alternative formant trajectories improved segment coding quality. –In general, coding still works well even if there are recognition errors, as main requirement is to identify suitable segments for linear trajectory coding.
11 Coded at about 600bps –Speaker 1: digits –Speaker 2: digits –Speaker 3: digits –Speaker 1: ARM report Natural –Speaker 1: digits –Speaker 2: digits –Speaker 3: digits –Speaker 1: ARM report Speech Coding results Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding. Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.