Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University.

Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University of Amsterdam The Netherlands

IFA Herengracht 338 Amsterdam My pre-predecessor: Louise Kaiser Secretary of First International Congress of Phonetic Sciences Amsterdam, 3-8 July 1932 welcome

Amsterdam ICPhS’32 Jac. van Ginneken, president L. Kaiser, secretaryA. Roozendaal, Treasurer Subjects: - physiology of speech and voice (experimental phonetics in its strict meaning) - study of the development of speech and voice in the individual; their evolution in the history of mankind; the influence of heridity - anthropology of speech and voice - phonology - linguistic psychology 136 participants - pathology of speech and voice from 16 countries - comparative physiology of the sounds of animals 43 plenary papers - musicology 24 demonstrations

Amsterdam ICPhS’32 Some of the participants: prof. Daniel Jones, London: The theory of phonemes, and its importance in Practical Linguistics Sir Richard Paget, London: The Evolution of Speech in Men prof. R.H. Stetson, Oberlin: Breathing Movements in Speech prof. Prince N. Trubetzkoy, Wien: Charakter und Methode der systematischen phonologischen Darstellung einer gegebenen Sprache dr. E. Zwirner, Berlin-Buch: - Phonetische Untersuchungen an Aphasischen und Amusischen - Quantität, Lautdauerschätzung und Lautkurvenmessung (Theorie und Material) ----------------------------------------------------------------- 2nd, London ‘35; 3rd, Ghent’38; 4th, Helsinki ‘61; 5th, Münster ‘64;

Overview n Phonetics and speech technology n Do recognizers need ‘intelligent ears’? n What is knowledge? n How good is human/machine speech recogn.? n How good is synthetic speech? n Pre-processor characteristics n Useful (phonetic) knowledge n Computational phonetics n Discussion/conclusions

Phonetics  Speech Technology

Do recognizers need intelligent ears? n intelligent ears  front-end pre-processor n only if it improves performance n humans are generally better speech processors than machines, perhaps system developers can learn from human behavior n robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)

What is knowledge? n phonetic knowledge n probabilistic knowledge from databases n fixed set of features vs. adaptable set n trading relations, selectivity n knowledge of the world, expectation n global vs. detailed  see video (with permission from Interbrew Nederland NV)

Video is a metaphor for: n from global to detail (world  Europe  Holland  North Sea coast  Scheveningen  beach  young lady  drinking Dommelsch beer) n sound  speech  speaker  English  utterance n ‘recognize speech’ or ‘wreck a nice beach’ n zoom in on whatever information is available n make intelligent interpretation, given context n beware for distracters!

Human auditory sensitivity n stationary vs. dynamic signals n simple vs. spectrally complex n detection threshold n just noticeable differences n see Table 3 in paper

Detection thresholds and jnd multi-harmonic, simple, stationary signals single-formant-like periodic signals 3 - 5% 1.5 Hz 20 - 40% frequency F2 BW

DL for short speech-like transitions Adopted from van Wieringen & Pols (Acta Acustica ’98) complex simple short longer trans.

How good is human / machine speech recognition?

n machine SR surprisingly good for certain tasks n machine SR could be better for many others -robustness, outliers n what are the limits of human performance? -in noise -for degraded speech -missing information (trading)

Human word intelligibility vs. noise Adopted from Steeneken (1992) recognizers have trouble! humans start to have some trouble

Robustness to degraded speech n speech = time-modulated signal in frequency bands n relatively insensitive to (spectral) distortions -prerequisite for digital hearing aid -modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz n temporal smearing of envelope modulation -ca. 4 Hz max. in modulation spectrum  syllable -LP>4 Hz and HP<8 Hz little effect on intelligibility n spectral envelope smearing -for BW>1/3 oct masked SRT starts to degrade (for references, see paper in Proc. ICPhS’99)

Robustness to degraded speech and missing information n partly reversed speech (Saberi & Perrott, Nature, 4/99) -fixed duration segments time reversed or shifted in time -perfect sentence intelligibility up to 50 ms (demo: every 50 ms reversedoriginal) -low frequency modulation envelope (3-8 Hz) vs. acoustic spectrum -syllable as information unit? (S. Greenberg) n gap and click restoration ( Warren ) n gating experiments

How good is synthetic speech? n good enough for certain applications n could be better in most others n evaluation:application-specific or multi-tier required n interesting experience: Synthesis workshop at Jenolan Caves, Australia, Nov. 1998

Workshop evaluation procedure n participants as native listeners n DARPA-type procedures in data preparations n balanced listening design n no detailed results made public n 3 text types -newspaper sentences -semantically unpredictable sentences -telephone directory entries n 42 systems in 8 languages tested

Screen for newspaper sentences

Some global results n it worked!, but many practical problems (for demo see http://www.fon.hum.uva.nl) n this seems the way to proceed and to expand n global rating (poor to excellent) -text analysis, prosody & signal processing n and/or more detailed scores n transcriptions subjectively judged -major/minor/no problems per entry n web site access of several systems (http://www.ldc.upenn.edu/ltts/)

Phonetic knowledge to improve speech synthesis (suppose concatenative synthesis) n control emotion, style, voice characteristics n perceptual implications of -parameterization (LPC, PSOLA) -discontinuities (spectral, temporal, prosody) n improve naturalness (prosody!) n active adaptation to other conditions -hyper/hypo, noise, comm. channel, listener impairment n systematic evaluation

Desired pre-processor characteristics in Automatic Speech Recognition n basic sensitivity for stationary and dynamic sounds n robustness to degraded speech -rather insensitive to spectral and temporal smearing n robustness to noise and reverberation n filter characteristics -is BP, PLP, MFCC, RASTA, TRAPS good enough? -lateral inhibition (spectral sharpening); dynamics n what can be neglected? -non-linearities, limited dynamic range, active elements, co-modulation, secondary pitch, etc.

Caricature of present-day speech recognizer n trained with a variety of speech input -much global information, no interrelations n monaural, uni-modal input n pitch extractor generally not operational n performs well on average behavior -does poorly on any type of outlier (OOV, non-native, fast or whispered speech, other communication channel) n neglects lots of useful (phonetic) information n heavily relies on language model

Useful (phonetic) knowledge neglected so far n pitch information n (systematic) durational variability n spectral reduction/coarticulation (other than multiphone) n intelligent selection from multiple features n quick adaptation to speaker, style & channel n communicative expectations n multi-modality n binaural hearing

Useful information: durational variability Adopted from Wang (1998)

Useful information: durational variability Adopted from Wang (1998) normal rate=95 primary stress=104 word final=136 utterance final=186 overall average=95 ms

Useful information: V and C reduction, coarticulation n spectral variability is not random but, at least partly, speaker-, style-, and context-specific n read - spontaneous; stressed - unstressed n not just for vowels, but also for consonants -duration -spectral balance -intervocalic sound energy difference -F2 slope difference -locus equation

Mean consonant durationMean error rate for C identification Adopted from van Son & Pols (Eurospeech’97) C-duration C error rate 791 VCV pairs (read & spontan.; stressed & unstr. segments; one male) C-identification by 22 Dutch subjects

Other useful information: n pronunciation variation (ESCA workshop) n acoustic attributes of prominence (B. Streefkerk) n speech efficiency (post-doc project R. v. Son) n confidence measure n units in speech recognition -rather than PLU, perhaps syllables (S. Greenberg) n quick adaptation n prosody-driven recognition / understanding n multiple features

Speech efficiency n speech is most efficient if it contains only the information needed to understand it: “Speech is the missing information” (Lindblom, JASA ‘96) n less information needed for more predictable things: -shorter duration and more spectral reduction for high- frequent syllables and words -C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log 2 (Prob(x)) in bits (see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

Correlation between consonant confusion and 4 measures indicated Adopted from van Son et al. (Proc. ICSLP’98) Dutch male sp. 20 min. R/S 12 k syll. 8k words 791 VCV R/S - 308 lex. str. - 483 unstr. C ident. 22 Ss

Computational Phonetics (R. Moore, ICPhS’95 Stockholm) n duration modeling n optimal unit selection (like in concatenative synthesis) n pronunciation variation modeling n vowel reduction models n computational prosody n information measures for confusion n speech efficiency models n modulation transfer function for speech

Discussion / Conclusions n speech technology needs further improvement for certain tasks (flexibility, robustness) n phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that n phonetics and speech/language technology should work together more closely, for their mutual benefit n this conference is the ideal platform for that

Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University.

Similar presentations

Presentation on theme: "Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University.

Similar presentations

Presentation on theme: "Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University."— Presentation transcript:

Similar presentations

About project

Feedback