Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań Humboldt-Kolleg, Słubice November 2008
Spoken Language Technologies: Introduction (1) The need for and increasing interest in SLT systems: oral information is more efficient than a written message speech is the easiest and fastest way of communication (man – man, man – machine) Progress in the field: technological advances in computer science availability of specialized speech analysis and processing tools collection and management of large speech corpora investigation of acoustic dimensions of speech signals fundamental frequency (F0), duration, intensity and spectral characteristicsIntroduction
Spoken Language Technologies: Introduction (2) Speech synthesis (TTS, text-to-speech) systems generate speech signal for a given input text example: BOSS (Polish module developed at Dept. of Phonetics in cooperation with IKP, Uni Bonn) ECESS (European Centre of Excellence in Speech Synthesis): standards of development of language resources, tools, modules and systems Automatic speech recognition (ASR) systems provide text of the input speech signal example: Jurisdic (first Polish ASR system for needs of Police, Public Prosecutors and Administration of Justice) The tasks of SLT systems (TTS and ASR)
Spoken Language Technologies: Application areas Application areas Speech synthesis telecommunications (access to textual information over the telephone) information retrieval measurement and control systems fundamental & applied research on speech and language a tool of communication e.g. for the visually handicapped Speech recognition & related technologies text dictation information retrieval & management AZAR man machine communication (together with speech synthesis): - dialogue systems, - speech-to-speech translation, - Computer Assisted Language Learning, CALL (e.g. the AZAR tutoring system developed in the scope of the EURONOUNCE project)
Spoken Language Technologies: Performance of TTS and ASR systems Performance Speech synthesis high intelligibility and naturalness in limited domains (e.g. broadcasting news) Speech recognition the best results for small vocabulary tasks the state-of-the-art speaker-independent LVCSR systems achieve a word-error rate of 3% Generally, the output quality is high as regards generation/recognition of the linguistic propositional content of speech
Limitations Spoken Language Technologies: Limitations of TTS and ASR systems insufficient knowledge about methods for processing the non-verbal content of speech i.e. affective information – speaker’s attitude, emotional state, mood, interpersonal stances & personality traits Speech synthesis lack of variability in speaking style which encodes affective information can be detrimental to communication (e.g. in speech-to- speech translation) data-driven approach to conversational, expressive speech synthesis is inflexible and quite costly Speech recognition transcription of conversational and expressive speech – substantially higher word-error rate
Humboldt-Kolleg, Słubice November 2008 Progress the need of modeling the non-verbal content of speech i.e. affective information Applications: high-quality conversational and emotional speech synthesis (for dialogue or speech-to-speech translation systems) commerce – monitoring of the agent-customer interactions, information retrieval and management (e.g. QA5) public security, criminology – secured area access control (speaker verification), truth-detection invesitgation (e.g. Computer Voice Stress Analyzer, Layered Voice Analysis) Spoken Language Technologies: Progress in the field (1)
Humboldt-Kolleg, Słubice November 2008 Progress Spoken Language Technologies: Progress in the field (2) Prosodic features: fundamental frequency (F0 – the central acoustic variable that underlies intonation), intensity, duration and voice quality -> encoding and decoding of affective information Emotion: Anger, Fear, Elation higher mean F0 higher F0 variability higher intensity increased speaking rate Emotion: Sadness, Boredom lower mean F0 lower F0 variability lower intensity decreased speaking rate Intonation models: hierarchical, sequential, acousitc-phonetic, phonological, etc. linguistic variation – well handled affective, emotional variation – unaccounted for
The comprehensive intonation model: Components a module of F0 contour analysis a module of F0 contour synthesis description of intonation discrete tonal categories (higher-level, access to the meaning of the utterance) acoustic parameters (low-level) intonation description F0 generation (decoding) analysis (encoding)
The comprehensive intonation model: Analysis and Synthesis Automatic analysis of F0 contours Summary results comparable to inter-labeler consistency in manual annotation of intonation high accuracy achieved using small vectors of acoustic features statistical modeling techniques application: 1) automatic labeling of speech corpora, 2) lexical & semantic content, 3) ambiguous parses, 4) estimation of F0 targets Automatic synthesis of F0 contours Summary estimation of F0 values with a regression model results comparable to those reported in the literature natural (similar to the original ones) F0 contours for synthesis of a high quality and comprehensible speech (confirmed in perception tests)
Audio (1): Mean opinion in the perception test: no audible difference The comprehensive intonation model: Synthesis example (1)
The comprehensive intonation model: Synthesis example (2) Audio (2): Mean opinion in the perception test: very good quality
Humboldt-Kolleg, Słubice November 2008 Future research contribution from other knowledge domains (psychology) affective speech data collection classification of affective states types of acoustic parameters measurement of affective inferences Spoken Language Technologies: Future research issues Extensive and systematic investigation of the mechanisms in voice production and perception of affective speech:
THANK YOU FOR YOUR ATTENTION!