Download presentation
Presentation is loading. Please wait.
Published bySylvia Pitts Modified over 9 years ago
1
Human and Machine Performance in Speech Processing Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies: this presentation resembles keynote at ICPhS’99, San Fransisco, CA)
2
IFA Herengracht 338 Amsterdam Heraeus-Seminar “Speech Recognition and Speech Understanding” April 3-5, 2000, Physikzentrum Bad Honnef, Germany welcome
3
Overview n Phonetics and speech technology n Do recognizers need ‘intelligent ears’? n What is knowledge? n How good is human/machine speech recogn.? n How good is synthetic speech? n Pre-processor characteristics n Useful (phonetic) knowledge n Computational phonetics n Discussion/conclusions
4
Phonetics Speech Technology
5
Machine performance more difficult, if …….. n test condition deviates from training condition, because of: -nativeness and age of speakers -size and content of vocabulary -speaking style, emotion, rate -microphone, background noise, reverberation, communication channel -nonavailability of certain features n however, machines get never tired, bored or distracted
6
Do recognizers need intelligent ears? n intelligent ears front-end pre-processor n only if it improves performance n humans are generally better speech processors than machines, perhaps system developers can learn from human behavior n robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)
7
What is knowledge? n phonetic knowledge n probabilistic knowledge from databases n fixed set of features vs. adaptable set n trading relations, selectivity n knowledge of the world, expectation n global vs. detailed see video (with permission from Interbrew Nederland NV)
9
Video is a metaphor for: n from global to detail (world Europe Holland North Sea coast Scheveningen beach young lady drinking Dommelsch beer) n sound speech speaker English utterance n ‘recognize speech’ or ‘wreck a nice beach’ n zoom in on whatever information is available n make intelligent interpretation, given context n beware for distracters!
10
Human auditory sensitivity n stationary vs. dynamic signals n simple vs. spectrally complex n detection threshold n just noticeable differences
11
Detection thresholds and jnd multi-harmonic, simple, stationary signals single-formant-like periodic signals 3 - 5% 1.5 Hz 20 - 40% frequency F2 BW Table 3 in Proc. ICPhS’99 paper
12
DL for short speech-like transitions Adopted from van Wieringen & Pols (Acta Acustica ’98) complex simple short longer trans.
13
How good is human / machine speech recognition?
14
n machine SR surprisingly good for certain tasks n machine SR could be better for many others -robustness, outliers n what are the limits of human performance? -in noise -for degraded speech -missing information (trading)
15
Human word intelligibility vs. noise Adopted from Steeneken (1992) recognizers have trouble! humans start to have some trouble
16
Robustness to degraded speech n speech = time-modulated signal in frequency bands n relatively insensitive to (spectral) distortions -prerequisite for digital hearing aid -modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz n temporal smearing of envelope modulation -ca. 4 Hz max. in modulation spectrum syllable -LP>4 Hz and HP<8 Hz little effect on intelligibility n spectral envelope smearing -for BW>1/3 oct masked SRT starts to degrade (for references, see paper in Proc. ICPhS’99)
17
Robustness to degraded speech and missing information n partly reversed speech (Saberi & Perrott, Nature, 4/99) -fixed duration segments time reversed or shifted in time -perfect sentence intelligibility up to 50 ms (demo: every 50 ms reversedoriginal) -low frequency modulation envelope (3-8 Hz) vs. acoustic spectrum -syllable as information unit? (S. Greenberg) n gap and click restoration ( Warren ) n gating experiments
18
How good is synthetic speech? (not main theme of this seminar, however, still attention for synthesis and dialogue) n good enough for certain applications n could be better in most others n evaluation:application-specific or multi-tier required n interesting experience: Synthesis workshop at Jenolan Caves, Australia, Nov. 1998
19
Workshop evaluation procedure n participants as native listeners n DARPA-type procedures in data preparations n balanced listening design n no detailed results made public n 3 text types -newspaper sentences -semantically unpredictable sentences -telephone directory entries n 42 systems in 8 languages tested
20
Screen for newspaper sentences
21
Some global results n it worked!, but many practical problems (for demo see http://www.fon.hum.uva.nl) n this seems the way to proceed and to expand n global rating (poor to excellent) -text analysis, prosody & signal processing n and/or more detailed scores n transcriptions subjectively judged -major/minor/no problems per entry n web site access of several systems (http://www.ldc.upenn.edu/ltts/)
22
Phonetic knowledge to improve speech synthesis (supposing concatenative synthesis) n control emotion, style, voice characteristics n perceptual implications of -parameterization (LPC, PSOLA) -discontinuities (spectral, temporal, prosody) n improve naturalness (prosody!) n active adaptation to other conditions -hyper/hypo, noise, comm. channel, listener impairment n systematic evaluation
23
Desired pre-processor characteristics in Automatic Speech Recognition n basic sensitivity for stationary and dynamic sounds n robustness to degraded speech -rather insensitive to spectral and temporal smearing n robustness to noise and reverberation n filter characteristics -is BP, PLP, MFCC, RASTA, TRAPS good enough? -lateral inhibition (spectral sharpening); dynamics n what can be neglected? -non-linearities, limited dynamic range, active elements, co-modulation, secondary pitch, etc.
24
Caricature of present-day speech recognizer n trained with a variety of speech input -much global information, no interrelations n monaural, uni-modal input n pitch extractor generally not operational n performs well on average behavior -does poorly on any type of outlier (OOV, non-native, fast or whispered speech, other communication channel) n neglects lots of useful (phonetic) information n heavily relies on language model
25
Useful (phonetic) knowledge neglected so far n pitch information n (systematic) durational variability n spectral reduction/coarticulation (other than multiphone) n intelligent selection from multiple features n quick adaptation to speaker, style & channel n communicative expectations n multi-modality n binaural hearing
26
Useful information: durational variability Adopted from Wang (1998)
27
Useful information: durational variability Adopted from Wang (1998) normal rate=95 primary stress=104 word final=136 utterance final=186 overall average=95 ms
28
Useful information: V and C reduction, coarticulation n spectral variability is not random but, at least partly, speaker-, style-, and context-specific n read - spontaneous; stressed - unstressed n not just for vowels, but also for consonants -duration -spectral balance -intervocalic sound energy difference -F2 slope difference -locus equation
29
Mean consonant durationMean error rate for C identification Adopted from van Son & Pols (Eurospeech’97) C-duration C error rate 791 VCV pairs (read & spontan.; stressed & unstr. segments; one male) C-identification by 22 Dutch subjects
30
Other useful information: n pronunciation variation (ESCA workshop) n acoustic attributes of prominence (B. Streefkerk) n speech efficiency (post-doc project R. v. Son) n confidence measure n units in speech recognition -rather than PLU, perhaps syllables (S. Greenberg) n quick adaptation n prosody-driven recognition / understanding n multiple features
31
Speech efficiency n speech is most efficient if it contains only the information needed to understand it: “Speech is the missing information” (Lindblom, JASA ‘96) n less information needed for more predictable things: -shorter duration and more spectral reduction for high- frequent syllables and words -C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log 2 (Prob(x)) in bits (see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))
32
Correlation between consonant confusion and 4 measures indicated Adopted from van Son et al. (Proc. ICSLP’98) Dutch male sp. 20 min. R/S 12 k syll. 8k words 791 VCV R/S 308 lex. str. (+) 483 unstr. (–) C ident. 22 Ss + p 0.01 p 0.001
33
Computational Phonetics (first suggested by R. Moore, ICPhS’95 Stockholm) n duration modeling n optimal unit selection (like in concatenative synthesis) n pronunciation variation modeling (SpeCom Nov. ‘99) n vowel reduction models n computational prosody n information measures for confusion n speech efficiency models n modulation transfer function for speech
34
Discussion / Conclusions n speech technology needs further improvement for certain tasks (flexibility, robustness) n phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that n phonetics and speech / language technology should work together more closely, for their mutual benefit n this Heraeus-seminar is a possible platform for that discussion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.