Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated.

Similar presentations


Presentation on theme: "Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated."— Presentation transcript:

1 Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated speech (Siegler and Stern, 1995; Mirghafori et al., 1995; Martínez et al., 1997; Pfau and Ruske, 1998; Alleva et al., 1998). What linguistic unit should we use to quantify speech rate and what domain is appropriate? What are the most important effects on the realisation of the lexical forms? How well are the acoustic models suited for different speech rates? Articulation rate and phone classification Articulation rate and realised lexical form Articulation rate measures Articulation Rate: Measures, Realised Lexical Form and Phone Classification in Spontaneous and Read German Speech Jürgen Trouvain, Jacques Koreman, Attilio Erriquez and Bettina Braun Universität des Saarlandes, Saarbrücken, Germany {trouvain,koreman,erriquez,bebr}@coli.uni-sb.de We investigated several linguistic units for measuring articulation rate as well as two different domains. It is important to distinguish between intended and realised units. Intended forms can be easily derived from the canonical transcription of the uttered words, but their actual realisation can vary strongly: Am blauen Himmel ziehen die Wolken Engl. In the blue sky wander the clouds /   / [   ] The definition of what is and is not a unit is also problematical: Linguistic unit Vowel-/r/ combinations are counted as two phones in the intended form, but as one in the realised form (except for schwa-/r/ combinations, which were labelled /  /). Realised syllables can be problematical, as e.g. the /  /- syllable in „ziehen“ in the example above can be realised as a syllabic or non-syllabic /n/, leading to different syllable counts (one and zero, respectively). Despite many sources of variation in both the units and the domains, we found high correlations between the number of units and domain duration for all units, both for read and spontaneous speech. Results Correlations higher for ips than for IP. Words/second show lowest correlations with duration. Realised phones/second result in the highest correlation with duration Realised phones/second in ips used in this study For other applications, comparable results can be obtained when using the graphical word or the intended syllable, which can be measured/derived more easily Note: Although phone and syllable deletions lower the measured articulation rate, it is not clear what their effect on the perceived articulation rate is. Results and discussion Domain inter-pause stretch (ips) -The pauses which delimit them (pause, breathing, filled pause, lip smacks, coughing and other non-verbal articulations) are easy to determine in the labelfile and are often used to delimit the domain over which articulation rate is calculated. -ASR is primarily interested in decoding speech (not silence) from the information contained in the phone segments. Articulation rate changes continuously while speaking and is not always constant within an utterance. Therefore we use two prosodic domains (although it is clear that more local variation can and will occur even within these domains): -is considered as an important planning unit, reflected by the intonation contour. -utterances must be labelled intonationally to obtain IPs. The criteria for IPs can differ considerably between studies. intended word intended syllable realised syllable intended phone realised phone The following units were measured: 8 syllables 20 phones 10 syllables 26 phones Glottal stops are considered to be a phone (in contrast to laryngealisation) Due to the labelling conventions of the database, affricates are counted as two phones and diphthongs as one intonational phrase (IP) The KielCorpus database was subdivided into three parts on the basis of the articulation rate measured in realised phones/second (for read and spontaneous speech separately): slow:more than 1 sd below the mean medium:between -1 and +1 sd from the mean fast:more than 1 sd above the mean Database analysis: realised lexical form Generally more deletions than replacements, especially / , , t/ consonants generally affected more strongly by deletions and replacements than vowels Exception: schwa for /n/ more replacements than deletions (place assimilation) /  /, which is /  r/ in the canonical form, is seldom deleted or reduced Deletion of /  /, /  /, /t/ (closure and especially release + aspiration) and also /n/ should be represented in the lexicon by means of pronunciation variants. Pronunciation variants due to assimilation of /n/ and replacement of /t/ closures and /  / should also be added to the lexicon. If there is any vowel reduction, therefore, it must take place on the acoustic rather than the lexical level (except for schwa). Implications for ASR Results: Phone classification for individual phones in hidden Markov modelling experiments using HTK, for read and spontaneous speech separately Hidden Markov models: 3 states (5 for diphthongs), left- to-right (no states skipped) and 8 mixtures per state Jackknife experiments with 20% of the database as test data, results computed as weighted averages Results evaluated for slow, medium and fast speech Phone classification We found a deterioration of phone classification with articulation rate (unlike e.g. Siegler and Stern, 1995). Our findings are comparable to those of Wrede et al. (2001) and are probably caused by the greater spectral variation at faster articulation rates. Articulation rate affects both vowels and consonants (lower phone classification results for faster speaking rates). t-tests on average vowel classification rates for matched pairs showed that the phone classification rates for normal and fast vowels do not differ significantly in spontaneous speech. This is probably due to the large amount of variation in the average vowel classification rates. Among the consonants, particularly fricatives and also plosives were affected by articulation rate. Articulation rate effects Introduction Aims By performing phone classification for individual phones, the effects of silences in the utterance are excluded as a source of recognition error. The emphasis is entirely on the recognition of the phones at different articulation rates (calculated for each ips as phones/second). Database The German KielCorpus for Read and Spontaneous Speech manually labelled realised phones along with intended (canonical) transcriptions large parts also prosodically annotated single sentences of variable length and two short stories 4 hours 53 speakers (27 male and 26 female) appointment-making dialogues 4 hours 42 speakers (24 male, 18 female) Read: Spontaneous: Only segmentally and prosodically labelled parts selected for this study. References We address three questions: As is well-known, spontaneous speech differs from read speech with respect to pauses (more unfilled pauses, filled pauses, ungrammatical pauses). We also find differences in temporal structure (shorter phrases, greater variance in phrase duration, greater variance in articulation rate for spontaneous speech). But we also observe changes on the phonemic level. Database analysis: spontaneous versus read speech Alleva, F., Huang, X., Hwang, M-Y. & Jiang, L. "Can continous speech recognisers handle isolated speech?" Speech Communication 26 (3), 183-190, 1998. Martínez, F., Tapias, D., Álvarez, J. & León, P. "Characteristics of slow, average and fast speech and their effects in large vocabulary continous speech recognition." Proc. Eurospeech Rhodes, 469-472, 1997. Mirghafori, N., Fosler, E. & Morgan, N. "Fast speakers in large vocabulary continous speech recognition." Proc. Eurospeech Madrid. 1995. Pfau, T. & Ruske, G. "Creating Hidden Markov Models for fast speech." Proc. ICSLP Sydney, 205-208, 1998. Siegler, M. A. & Stern, R. M. "On the effects of speech rate in large vocabulary speech recognition systems." Proc. ICASSP Detroit (1), 612-615, 1995. Wrede, B. Fink, G. and Sagerer, G., ”An investigation of modelling aspects for rate-dependent speech recognition." Proc. Eurospeech Aalborg, 2001. 0-2-3123 articulation rate (sd) word error rate # utterances fast slow Phone classification rates for consonants, particularly voiceless obstruents, are higher than for vowels. Schwa is recognised particularly poorly, possibly because of its liability to transconsonantal coarticulation Diphthongs and /  / are also recognised poorly. Rate-independent phone classe effects Average phone classification rates for slow, normal and fast speech (read and spontaneous)


Download ppt "Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated."

Similar presentations


Ads by Google