Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University.

Slides:



Advertisements
Similar presentations
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Advertisements

Function words are often reduced or even deleted in casual conversation (Fig. 1). Pairs may neutralize: he’s/he was, we’re/we were What sources of information.
Analyses on IFA corpus Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) Project meeting INTAS.
Speech perception 2 Perceptual organization of speech.
From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)
Development of Speech Perception. Issues in the development of speech perception Are the mechanisms peculiar to speech perception evident in young infants?
Nuclear Accent Shape and the Perception of Prominence Rachael-Anne Knight Prosody and Pragmatics 15 th November 2003.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
Profile of Phoneme Auditory Perception Ability in Children with Hearing Impairment and Phonological Disorders By Manal Mohamed El-Banna (MD) Unit of Phoniatrics,
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Pavel Skrelin (Saint-Petersburg State University) Some Principles and Methods of Measuring Fo and Tempo.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Chapter three Phonology
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
The Description of Speech
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Kinect Player Gender Recognition from Speech Analysis
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Speech Perception 4/4/00.
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Human and Machine Performance in Speech Processing Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies:
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Epenthetic vowels in Japanese: a perceptual illusion? Emmanual Dupoux, et al (1999) By Carl O’Toole.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
A Fully Annotated Corpus of Russian Speech
Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
IIT Bombay {pcpandey,   Intro. Proc. Schemes Evaluation Results Conclusion Intro. Proc. Schemes Evaluation Results Conclusion.
AMSP : Advanced Methods for Speech Processing An expression of Interest to set up a Network of Excellence in FP6 Prepared by members of COST-277 and colleagues.
Laboratory for Experimental ORL K.U.Leuven, Belgium Dept. of Electrotechn. Eng. ESAT/SISTA K.U.Leuven, Belgium Combining noise reduction and binaural cue.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Katherine Morrow, Sarah Williams, and Chang Liu Department of Communication Sciences and Disorders The University of Texas at Austin, Austin, TX
Performance Comparison of Speaker and Emotion Recognition
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Speech Perception.
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Dialect Simulation through Prosody Transfer: A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Danielle Werle Undergraduate Thesis Intelligibility and the Carrier Phrase Effect in Sinewave Speech.
Guided By, DINAKAR DAS.C.N ( Assistant professor ECE ) Presented by, ARUN.V.S S7 EC ROLL NO: 2 1.
LISTENING: QUESTIONS OF LEVEL FRANCISCO FUENTES NICOLAS VALENZUELA.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Liverpool Keele Contribution.
A Smartphone App-Based
Speech Perception (acoustic cues)
Speech Communications
Presentation transcript:

Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University of Amsterdam The Netherlands

IFA Herengracht 338 Amsterdam My pre-predecessor: Louise Kaiser Secretary of First International Congress of Phonetic Sciences Amsterdam, 3-8 July 1932 welcome

Amsterdam ICPhS’32 Jac. van Ginneken, president L. Kaiser, secretaryA. Roozendaal, Treasurer Subjects: - physiology of speech and voice (experimental phonetics in its strict meaning) - study of the development of speech and voice in the individual; their evolution in the history of mankind; the influence of heridity - anthropology of speech and voice - phonology - linguistic psychology 136 participants - pathology of speech and voice from 16 countries - comparative physiology of the sounds of animals 43 plenary papers - musicology 24 demonstrations

Amsterdam ICPhS’32 Some of the participants: prof. Daniel Jones, London: The theory of phonemes, and its importance in Practical Linguistics Sir Richard Paget, London: The Evolution of Speech in Men prof. R.H. Stetson, Oberlin: Breathing Movements in Speech prof. Prince N. Trubetzkoy, Wien: Charakter und Methode der systematischen phonologischen Darstellung einer gegebenen Sprache dr. E. Zwirner, Berlin-Buch: - Phonetische Untersuchungen an Aphasischen und Amusischen - Quantität, Lautdauerschätzung und Lautkurvenmessung (Theorie und Material) nd, London ‘35; 3rd, Ghent’38; 4th, Helsinki ‘61; 5th, Münster ‘64;

Overview n Phonetics and speech technology n Do recognizers need ‘intelligent ears’? n What is knowledge? n How good is human/machine speech recogn.? n How good is synthetic speech? n Pre-processor characteristics n Useful (phonetic) knowledge n Computational phonetics n Discussion/conclusions

Phonetics  Speech Technology

Do recognizers need intelligent ears? n intelligent ears  front-end pre-processor n only if it improves performance n humans are generally better speech processors than machines, perhaps system developers can learn from human behavior n robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)

What is knowledge? n phonetic knowledge n probabilistic knowledge from databases n fixed set of features vs. adaptable set n trading relations, selectivity n knowledge of the world, expectation n global vs. detailed  see video (with permission from Interbrew Nederland NV)

Video is a metaphor for: n from global to detail (world  Europe  Holland  North Sea coast  Scheveningen  beach  young lady  drinking Dommelsch beer) n sound  speech  speaker  English  utterance n ‘recognize speech’ or ‘wreck a nice beach’ n zoom in on whatever information is available n make intelligent interpretation, given context n beware for distracters!

Human auditory sensitivity n stationary vs. dynamic signals n simple vs. spectrally complex n detection threshold n just noticeable differences n see Table 3 in paper

Detection thresholds and jnd multi-harmonic, simple, stationary signals single-formant-like periodic signals 3 - 5% 1.5 Hz % frequency F2 BW

DL for short speech-like transitions Adopted from van Wieringen & Pols (Acta Acustica ’98) complex simple short longer trans.

How good is human / machine speech recognition?

n machine SR surprisingly good for certain tasks n machine SR could be better for many others -robustness, outliers n what are the limits of human performance? -in noise -for degraded speech -missing information (trading)

Human word intelligibility vs. noise Adopted from Steeneken (1992) recognizers have trouble! humans start to have some trouble

Robustness to degraded speech n speech = time-modulated signal in frequency bands n relatively insensitive to (spectral) distortions -prerequisite for digital hearing aid -modulating spectral slope: -5 to +5 dB/oct, Hz n temporal smearing of envelope modulation -ca. 4 Hz max. in modulation spectrum  syllable -LP>4 Hz and HP<8 Hz little effect on intelligibility n spectral envelope smearing -for BW>1/3 oct masked SRT starts to degrade (for references, see paper in Proc. ICPhS’99)

Robustness to degraded speech and missing information n partly reversed speech (Saberi & Perrott, Nature, 4/99) -fixed duration segments time reversed or shifted in time -perfect sentence intelligibility up to 50 ms (demo: every 50 ms reversedoriginal) -low frequency modulation envelope (3-8 Hz) vs. acoustic spectrum -syllable as information unit? (S. Greenberg) n gap and click restoration ( Warren ) n gating experiments

How good is synthetic speech? n good enough for certain applications n could be better in most others n evaluation:application-specific or multi-tier required n interesting experience: Synthesis workshop at Jenolan Caves, Australia, Nov. 1998

Workshop evaluation procedure n participants as native listeners n DARPA-type procedures in data preparations n balanced listening design n no detailed results made public n 3 text types -newspaper sentences -semantically unpredictable sentences -telephone directory entries n 42 systems in 8 languages tested

Screen for newspaper sentences

Some global results n it worked!, but many practical problems (for demo see n this seems the way to proceed and to expand n global rating (poor to excellent) -text analysis, prosody & signal processing n and/or more detailed scores n transcriptions subjectively judged -major/minor/no problems per entry n web site access of several systems (

Phonetic knowledge to improve speech synthesis (suppose concatenative synthesis) n control emotion, style, voice characteristics n perceptual implications of -parameterization (LPC, PSOLA) -discontinuities (spectral, temporal, prosody) n improve naturalness (prosody!) n active adaptation to other conditions -hyper/hypo, noise, comm. channel, listener impairment n systematic evaluation

Desired pre-processor characteristics in Automatic Speech Recognition n basic sensitivity for stationary and dynamic sounds n robustness to degraded speech -rather insensitive to spectral and temporal smearing n robustness to noise and reverberation n filter characteristics -is BP, PLP, MFCC, RASTA, TRAPS good enough? -lateral inhibition (spectral sharpening); dynamics n what can be neglected? -non-linearities, limited dynamic range, active elements, co-modulation, secondary pitch, etc.

Caricature of present-day speech recognizer n trained with a variety of speech input -much global information, no interrelations n monaural, uni-modal input n pitch extractor generally not operational n performs well on average behavior -does poorly on any type of outlier (OOV, non-native, fast or whispered speech, other communication channel) n neglects lots of useful (phonetic) information n heavily relies on language model

Useful (phonetic) knowledge neglected so far n pitch information n (systematic) durational variability n spectral reduction/coarticulation (other than multiphone) n intelligent selection from multiple features n quick adaptation to speaker, style & channel n communicative expectations n multi-modality n binaural hearing

Useful information: durational variability Adopted from Wang (1998)

Useful information: durational variability Adopted from Wang (1998) normal rate=95 primary stress=104 word final=136 utterance final=186 overall average=95 ms

Useful information: V and C reduction, coarticulation n spectral variability is not random but, at least partly, speaker-, style-, and context-specific n read - spontaneous; stressed - unstressed n not just for vowels, but also for consonants -duration -spectral balance -intervocalic sound energy difference -F2 slope difference -locus equation

Mean consonant durationMean error rate for C identification Adopted from van Son & Pols (Eurospeech’97) C-duration C error rate 791 VCV pairs (read & spontan.; stressed & unstr. segments; one male) C-identification by 22 Dutch subjects

Other useful information: n pronunciation variation (ESCA workshop) n acoustic attributes of prominence (B. Streefkerk) n speech efficiency (post-doc project R. v. Son) n confidence measure n units in speech recognition -rather than PLU, perhaps syllables (S. Greenberg) n quick adaptation n prosody-driven recognition / understanding n multiple features

Speech efficiency n speech is most efficient if it contains only the information needed to understand it: “Speech is the missing information” (Lindblom, JASA ‘96) n less information needed for more predictable things: -shorter duration and more spectral reduction for high- frequent syllables and words -C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log 2 (Prob(x)) in bits (see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

Correlation between consonant confusion and 4 measures indicated Adopted from van Son et al. (Proc. ICSLP’98) Dutch male sp. 20 min. R/S 12 k syll. 8k words 791 VCV R/S lex. str unstr. C ident. 22 Ss

Computational Phonetics (R. Moore, ICPhS’95 Stockholm) n duration modeling n optimal unit selection (like in concatenative synthesis) n pronunciation variation modeling n vowel reduction models n computational prosody n information measures for confusion n speech efficiency models n modulation transfer function for speech

Discussion / Conclusions n speech technology needs further improvement for certain tasks (flexibility, robustness) n phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that n phonetics and speech/language technology should work together more closely, for their mutual benefit n this conference is the ideal platform for that