Presentation is loading. Please wait.

Presentation is loading. Please wait.

Human and Machine Performance in Speech Processing Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies:

Similar presentations


Presentation on theme: "Human and Machine Performance in Speech Processing Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies:"— Presentation transcript:

1 Human and Machine Performance in Speech Processing Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies: this presentation resembles keynote at ICPhS’99, San Fransisco, CA)

2 IFA Herengracht 338 Amsterdam Heraeus-Seminar “Speech Recognition and Speech Understanding” April 3-5, 2000, Physikzentrum Bad Honnef, Germany welcome

3 Overview n Phonetics and speech technology n Do recognizers need ‘intelligent ears’? n What is knowledge? n How good is human/machine speech recogn.? n How good is synthetic speech? n Pre-processor characteristics n Useful (phonetic) knowledge n Computational phonetics n Discussion/conclusions

4 Phonetics  Speech Technology

5 Machine performance more difficult, if …….. n test condition deviates from training condition, because of: -nativeness and age of speakers -size and content of vocabulary -speaking style, emotion, rate -microphone, background noise, reverberation, communication channel -nonavailability of certain features n however, machines get never tired, bored or distracted

6 Do recognizers need intelligent ears? n intelligent ears  front-end pre-processor n only if it improves performance n humans are generally better speech processors than machines, perhaps system developers can learn from human behavior n robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)

7 What is knowledge? n phonetic knowledge n probabilistic knowledge from databases n fixed set of features vs. adaptable set n trading relations, selectivity n knowledge of the world, expectation n global vs. detailed  see video (with permission from Interbrew Nederland NV)

8

9 Video is a metaphor for: n from global to detail (world  Europe  Holland  North Sea coast  Scheveningen  beach  young lady  drinking Dommelsch beer) n sound  speech  speaker  English  utterance n ‘recognize speech’ or ‘wreck a nice beach’ n zoom in on whatever information is available n make intelligent interpretation, given context n beware for distracters!

10 Human auditory sensitivity n stationary vs. dynamic signals n simple vs. spectrally complex n detection threshold n just noticeable differences

11 Detection thresholds and jnd multi-harmonic, simple, stationary signals single-formant-like periodic signals 3 - 5% 1.5 Hz 20 - 40% frequency F2 BW Table 3 in Proc. ICPhS’99 paper

12 DL for short speech-like transitions Adopted from van Wieringen & Pols (Acta Acustica ’98) complex simple short longer trans.

13 How good is human / machine speech recognition?

14 n machine SR surprisingly good for certain tasks n machine SR could be better for many others -robustness, outliers n what are the limits of human performance? -in noise -for degraded speech -missing information (trading)

15 Human word intelligibility vs. noise Adopted from Steeneken (1992) recognizers have trouble! humans start to have some trouble

16 Robustness to degraded speech n speech = time-modulated signal in frequency bands n relatively insensitive to (spectral) distortions -prerequisite for digital hearing aid -modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz n temporal smearing of envelope modulation -ca. 4 Hz max. in modulation spectrum  syllable -LP>4 Hz and HP<8 Hz little effect on intelligibility n spectral envelope smearing -for BW>1/3 oct masked SRT starts to degrade (for references, see paper in Proc. ICPhS’99)

17 Robustness to degraded speech and missing information n partly reversed speech (Saberi & Perrott, Nature, 4/99) -fixed duration segments time reversed or shifted in time -perfect sentence intelligibility up to 50 ms (demo: every 50 ms reversedoriginal) -low frequency modulation envelope (3-8 Hz) vs. acoustic spectrum -syllable as information unit? (S. Greenberg) n gap and click restoration ( Warren ) n gating experiments

18 How good is synthetic speech? (not main theme of this seminar, however, still attention for synthesis and dialogue) n good enough for certain applications n could be better in most others n evaluation:application-specific or multi-tier required n interesting experience: Synthesis workshop at Jenolan Caves, Australia, Nov. 1998

19 Workshop evaluation procedure n participants as native listeners n DARPA-type procedures in data preparations n balanced listening design n no detailed results made public n 3 text types -newspaper sentences -semantically unpredictable sentences -telephone directory entries n 42 systems in 8 languages tested

20 Screen for newspaper sentences

21 Some global results n it worked!, but many practical problems (for demo see http://www.fon.hum.uva.nl) n this seems the way to proceed and to expand n global rating (poor to excellent) -text analysis, prosody & signal processing n and/or more detailed scores n transcriptions subjectively judged -major/minor/no problems per entry n web site access of several systems (http://www.ldc.upenn.edu/ltts/)

22 Phonetic knowledge to improve speech synthesis (supposing concatenative synthesis) n control emotion, style, voice characteristics n perceptual implications of -parameterization (LPC, PSOLA) -discontinuities (spectral, temporal, prosody) n improve naturalness (prosody!) n active adaptation to other conditions -hyper/hypo, noise, comm. channel, listener impairment n systematic evaluation

23 Desired pre-processor characteristics in Automatic Speech Recognition n basic sensitivity for stationary and dynamic sounds n robustness to degraded speech -rather insensitive to spectral and temporal smearing n robustness to noise and reverberation n filter characteristics -is BP, PLP, MFCC, RASTA, TRAPS good enough? -lateral inhibition (spectral sharpening); dynamics n what can be neglected? -non-linearities, limited dynamic range, active elements, co-modulation, secondary pitch, etc.

24 Caricature of present-day speech recognizer n trained with a variety of speech input -much global information, no interrelations n monaural, uni-modal input n pitch extractor generally not operational n performs well on average behavior -does poorly on any type of outlier (OOV, non-native, fast or whispered speech, other communication channel) n neglects lots of useful (phonetic) information n heavily relies on language model

25 Useful (phonetic) knowledge neglected so far n pitch information n (systematic) durational variability n spectral reduction/coarticulation (other than multiphone) n intelligent selection from multiple features n quick adaptation to speaker, style & channel n communicative expectations n multi-modality n binaural hearing

26 Useful information: durational variability Adopted from Wang (1998)

27 Useful information: durational variability Adopted from Wang (1998) normal rate=95 primary stress=104 word final=136 utterance final=186 overall average=95 ms

28 Useful information: V and C reduction, coarticulation n spectral variability is not random but, at least partly, speaker-, style-, and context-specific n read - spontaneous; stressed - unstressed n not just for vowels, but also for consonants -duration -spectral balance -intervocalic sound energy difference -F2 slope difference -locus equation

29 Mean consonant durationMean error rate for C identification Adopted from van Son & Pols (Eurospeech’97) C-duration C error rate 791 VCV pairs (read & spontan.; stressed & unstr. segments; one male) C-identification by 22 Dutch subjects

30 Other useful information: n pronunciation variation (ESCA workshop) n acoustic attributes of prominence (B. Streefkerk) n speech efficiency (post-doc project R. v. Son) n confidence measure n units in speech recognition -rather than PLU, perhaps syllables (S. Greenberg) n quick adaptation n prosody-driven recognition / understanding n multiple features

31 Speech efficiency n speech is most efficient if it contains only the information needed to understand it: “Speech is the missing information” (Lindblom, JASA ‘96) n less information needed for more predictable things: -shorter duration and more spectral reduction for high- frequent syllables and words -C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log 2 (Prob(x)) in bits (see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

32 Correlation between consonant confusion and 4 measures indicated Adopted from van Son et al. (Proc. ICSLP’98) Dutch male sp. 20 min. R/S 12 k syll. 8k words 791 VCV R/S 308 lex. str. (+) 483 unstr. (–) C ident. 22 Ss + p  0.01  p  0.001

33 Computational Phonetics (first suggested by R. Moore, ICPhS’95 Stockholm) n duration modeling n optimal unit selection (like in concatenative synthesis) n pronunciation variation modeling (SpeCom Nov. ‘99) n vowel reduction models n computational prosody n information measures for confusion n speech efficiency models n modulation transfer function for speech

34 Discussion / Conclusions n speech technology needs further improvement for certain tasks (flexibility, robustness) n phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that n phonetics and speech / language technology should work together more closely, for their mutual benefit n this Heraeus-seminar is a possible platform for that discussion


Download ppt "Human and Machine Performance in Speech Processing Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies:"

Similar presentations


Ads by Google