1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Normal Aspects of Articulation. Definitions Phonetics Phonology Articulatory phonetics Acoustic phonetics Speech perception Phonemic transcription Phonetic.
Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
Automatic Speech Recognition Slides now available at
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
5-Text To Speech (TTS) Speech Synthesis
PHONETICS AND PHONOLOGY
Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
EE 225D, Section I: Broad background Synthesis/vocoding history (chaps 2&3) Recognition history (chap 4) Machine recognition basics (chap 5) Human recognition.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.
Synthetic Audio A Brief Historical Introduction Generating sounds Synthesis can be “additive” or “subtractive” Additive means combining components (e.g.,
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Chapter three Phonology
1 ENGLISH PHONETICS AND PHONOLOGY Lesson 3A Introduction to Phonetics and Phonology.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Phonetics Linguistics for ELT B Ed TESL 2005 Cohort 2.
The Description of Speech
TEXT TO SPEECH SYNTHESIS
Phonology, phonotactics, and suprasegmentals
Speech synthesis Recording and sampling Speech recognition Apr. 5
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.
Speech & Language Development 1 Normal Development of Speech & Language Language...“Standardized set of symbols and the knowledge about how to combine.
Phonetics and Phonology
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
1 Computational Linguistics Ling 200 Spring 2006.
Applied Speech Sciences 4/11/00. Speech Science Application Speech production via computers Forensics- criminal investigations; voice prints Assessing.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Speech Or can you hear me now?. Linguistic Parts of Speech Phone Phone Basic unit of speech sound Basic unit of speech sound Phoneme Phoneme Phone to.
Korea Maritime and Ocean University NLP Jung Tae LEE
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
Failed, because: Discriminability alone is not enough; code on speech needs to be compatible with speech. Minimally, must have the speed of speech. Lessons:
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Introduction to Computational Linguistics
© 2013 by Larson Technical Services
Sounds and speech perception Productivity of language Speech sounds Speech perception Integration of information.
English Phonetics 许德华 许德华. Objectives of the Course This course is intended to help the students to improve their English pronunciation, including such.
Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.
Phonetics, part III: Suprasegmentals October 19, 2012.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Phonetics, part III: Suprasegmentals October 18, 2010.
How We Organize the Sounds of Speech 김종천 김완제 위이.
Chapter 7 LANE 350 Ms. Abrar Mujaddidi Phonic/graphic and prosodic Issues in Translation.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
11 How we organize the sounds of speech 12 How we use tone of voice 2009 년 1 학기 담당교수 : 홍우평 언어커뮤니케이션의 기 초.
G. Anushiya Rachel Project Officer
Text-To-Speech System for English
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Speech Generation: From Concept and from Text
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
ENGLISH PHONETICS AND PHONOLOGY Week 2
Indian Institute of Technology Bombay
Presentation transcript:

1 Speech synthesis

2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say and how to say it How is it approached? –Two main approaches, both with pros and cons How good is it? –Excellent, almost unnoticeable at its best How much better could it be? –marginally

3 Input type Concept-to-speech vs text-to-speech In CTS, content of message is determined from internal representation, not by reading out text –E.g. database query system –No problem of text interpretation

4 Text-to-speech What to say: text-to-phoneme conversion is not straightforward –Dr Smith lives on Marine Dr in Chicago IL. He got his PhD from MIT. He earns $70,000 p.a. –Have toy read that book? No I’m still reading it. I live in Reading. How to say it: not just choice of phonemes, but allophones, coarticulation effects, as well as prosodic features (pitch, loudness, length)

5 Text-to-phoneme module Architecture of TTS systems Text input Grapheme-to- phoneme conversion Prosodic modelling Acoustic synthesis Abbreviation lexicon Text in orthographic form Exceptions lexicon Orthographic rules Phoneme string Normalization Grammar rules Phoneme string + prosodic annotation Prosodic model Synthetic speech output Phoneme-to-speech module Various methods

6 Text normalization Any text that has a special pronunciation should be stored in a lexicon –Abbreviations (Mr, Dr, Rd, St, Middx) –Acronyms (UN but UNESCO) –Special symbols (&, %) –Particular conventions (£5, $5 million, 12°C) –Numbers are especially difficult ,995 

7 Grapheme-to-phoneme conversion English spelling is complex but largely regular, other languages more (or less) so Gross exceptions must be in lexicon Lexicon or rules? –If look-up is quick, may as well store them –But you need rules anyway for unknown words MANY words have multiple pronunciations –Free variation (eg controversy, either) –Conditioned variation (eg record, import, weak forms) –Genuine homographs

8 Grapheme-to-phoneme conversion Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean) Much harder for others (English, French) Especially if writing system is only partially alphabetic (Arabic, Urdu) Or not alphabetic at all (Chinese, Japanese)

9 Syntactic (etc.) analysis Homograph disambiguation requires syntactic analysis –He makes a record of everything they record. –I read a lot. What have you read recently? Analysis also essential to determine appropriate prosodic features

10 Text-to-phoneme module Architecture of TTS systems Text input Grapheme-to- phoneme conversion Prosodic modelling Acoustic synthesis Abbreviation lexicon Text in orthographic form Exceptions lexicon Orthographic rules Phoneme string Normalization Grammar rules Phoneme string + prosodic annotation Prosodic model Synthetic speech output Phoneme-to-speech module Various methods

11 Prosody modelling Pitch, length, loudness Intonation (pitch) –essential to avoid monotonous robot-like voice –linked to basic syntax (eg statement vs question), but also to thematization (stress) –Pitch range is a sensitive issue Rhythm (length) –Has to do with pace (natural tendency to slow down at end of utterance) –Also need to pause at appropriate place –Linked (with pitch and loudness) to stress

12 Acoustic synthesis Alternative methods: –Articulatory synthesis –Formant synthesis –Concatenative synthesis –Unit selection synthesis

13 Articulatory synthesis Simulation of physical processes of human articulation Wolfgang von Kempelen ( ) and others used bellows, reeds and tubes to construct mechanical speaking machines Modern versions simulate electronically the effect of articulator positions, vocal tract shape, etc. Too much like hard work

14 Formant synthesis Reproduce the relevant characteristics of the acoustic signal In particular, amplitude and frequency of formants But also other resonances and noise, eg for nasals, laterals, fricatives etc. Values of acoustic parameters are derived by rule from phonetic transcription Result is intelligible, but too “pure” and sounds synthetic

15 Formant synthesis Demo: –In control panel select “Speech” icon –Type in your text and Preview voice –You may have a choice of voices

16 Concatenative synthesis Concatenate segments of pre-recorded natural human speech Requires database of previously recorded human speech covering all the possible segments to be synthesised Segment might be phoneme, syllable, word, phrase, or any combination Or, something else more clever...

17 Diphone synthesis Most important for natural sounding speech is to get the transitions right (allophonic variation, coarticulation effects) These are found at the boundary between phoneme segments “diphones” are fragments of speech signal cutting across phoneme boundaries If a language has P phones, then number of diphones is ~P 2 (some combinations impossible) – eg 800 for Spanish, 1200 for French, 2500 for German) m y n u m b er

18 Diphone synthesis Most systems use diphones because they are –Manageable in number –Can be automatically extracted from recordings of human speech –Capture most inter-allophonic variants But they do not capture all coarticulatory effects, so some systems include triphones, as well as fixed phrases and other larger units (= USS)

19 Concatenative synthesis Input is phonemic representation + prosodic features Diphone segments can be digitally manipulated for length, pitch and loudness Segment boundaries need to be smoothed to avoid distortion

20 Unit selection synthesis (USS) Same idea as concatenative synthesis, but database contains bigger variety of “units” Multiple examples of phonemes (under different prosodic conditions) are recorded Selection of appropriate unit therefore becomes more complex, as there are in the database competing candidates for selection

21 Speech synthesis demo

22 Speech synthesis demo