Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.

Slides:

Advertisements

Similar presentations

TEL: FAX: WEBSITE: © 2002 iFLYTEK. All rights reserved. This presentation is for informational.

Advertisements

Page 1. Page 2 Virtual Speaker: A Virtual Studio The software: Virtual Speaker is a package that automatically creates your voice files, prompts or any.

Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.

Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.

S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

This is an audio presentation. Please turn on your computer speakers. Press to start the presentation.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.

Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.

Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.

Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)

Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.

Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.

1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.

A PRESENTATION BY SHAMALEE DESHPANDE

Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.

Talk, Translate, and Voice By: Jill Gruttadauro, Amanda Swetish, Porter Waung.

English Phonetics arifsuryopriyatmojo.com. Questions to consider? what is a language? how many languages are there? why do people need a language? how.

Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.

Public 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Development Challenges of Multilingual Text-to-Speech Systems Kimmo Pärssinen

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Du “Text-to-Speech” au multilinguïsme Isabel Meurisse Babel Technologies

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Chapter 7. BEAT: the Behavior Expression Animation Toolkit

Prepared by: Waleed Mohamed Azmy Under Supervision:

Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.

1 Computational Linguistics Ling 200 Spring 2006.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

Korea Maritime and Ocean University NLP Jung Tae LEE

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)

LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.

Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Speech Perception 4/4/00.

Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Introduction to Computational Linguistics

Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.

Performance Comparison of Speaker and Emotion Recognition

Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.

2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.

TECHNICAL SEMINAR ON IMPLEMENTATION OF PHONETICS IN CRYPTOGRAPHY BY:- VICKY AGARWAL (4JN03CS078) GUIDED BY:- SREEDEVI.S LECTURER DEPT OF CS&E.

G. Anushiya Rachel Project Officer

Efficient Inference on Sequence Segmentation Models

Mr. Darko Pekar, Speech Morphing Inc.

Text-To-Speech System for English

Speech Technology for Language Learning

Speech and Language Processing

Dialog Design 4 Speech & Natural Language

EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES

Indian Institute of Technology Bombay

Presentation transcript:

Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006

TTS in the Navigation Domain  Flexibility  Perfect pronunciation  Cost savings  Expressivity What kind of expectations does the Navigation market have on a TTS system when applying it in its domain?

Key aspects of a TTS system  Unit Selection Technique  Domain-dependent customization  Phonetic Input Interface  Native vs. Foreign Pronunciation

The importance of Unit Selection TTS  Loquendo TTS is a general purpose (Unrestricted) Text-to-Speech system based on the Unit Selection Speech Synthesis Technique  The Unit Selection Technique is at present the only existing technique that enables natural sounding speech  Its basic idea is to concatenate long phoneme sequences extracted from large DataBases of speech recorded by a single voice talent  Advantages: little speech processing, the natural voice timbre is preserved  Drawbacks: scarce control, speech quality depends on the contents of the database

Input text phonemes&prosodic labels (&prosodic values) speech TEXT ANALYZER SPEECH SYNTHESIZER ITALIAN SPANISH... SPAGNOL O... ITALIAN female voice ITALIAN Voice-2 SPAGNOL O... SPANISH male voice Statistical Tools for Database Design Tools for: Speech Acquisition Phoneme Segmentation Signal Analysis Dense text corpus Vocal DataBases Domain- representative large text corpus Language Libraries... ENGLISH Unit Selection TTS

p r ò v a d i s ì n t e s i P P P P P P P P C C C C C W define a synthesis TARGET..r i n: ò v a d i s ò l i t o... P PP P P P P P P P P P P P P PPP S S S S SS SS UU p r ò v a d i s ì n t e s i match phonetic and prosodic labels look for similar f0 values at unit junctions preferably cut units at diphone boundaries concatenate waveforms adjust f0 to remove jumps (pitch scaling)... p r ò s a e l e g à n t e... P P PP P P P P P P PP PPP P P P P PPPPPP...l a f o t o s ì n t e s i PPP P PPPP PP PPPP C CCCCCCCCCWW Vocal DataBase The Synthesis Algorithm

Domain Dependent Customization  Loquendo TTS can read any text  In order to improve accuracy and fluency on an application domain it can be customized –Lexical customization: adapting Text-to- Phoneme conversion via Pronunciation Lexicon, Phonetic Input –Vocal customization: enhancing the Vocal DataBase with application prompts and recurrent phrases

Domain Dependent Customization  The navigation domain: –A limited number of recurring phrases: Ex. “turn left”, “…” –A peculiar and huge lexical domain: addresses, names and toponyms: due to their they historical/foreign origin, they do not necessarily follow the standard grapheme-to-phoneme rules of the language  Customizing Loquendo TTS for Navigation: –Navigation Vocal Add-On: the recurring phrases can be recorded in case their synthesis does not sound perfect –The exact pronunciation of names and addresses can be specified via the Phonetic Input Interface or via the Pronunciation Lexicon (this does not prevent from possible acoustic defects)

Phonetic Input Interface  Loquendo TTS let the User control the exact pronunciation of words  An escape sequence in the input text allows skipping the grapheme-to-phoneme process driven by the Language Library…  And inputting the desired phoneme sequence directly to the Speech Synthesis process based on the Vocal Database

Input text phonemes&prosodic labels (&prosodic values) speech TEXT ANALYZER SPEECH SYNTHESIZER ITALIAN SPANISH... SPAGNOL O... ITALIAN female voice ITALIAN Voice-2 SPAGNOL O... SPANISH male voice Vocal DataBases Language Libraries ENGLISH Phonetic Input Interface Input Phonetic Transcription (language independent SAMPA Alphabet)

Phonetic Input Interface  The special feature is the language independence  The input phonemes are Mapped on the Phonemes of the Voice Via the Phonetic Interface, an Italian voice can be made to pronounce \SAMPA="san$te$na for the Italian name “Santena”, overriding the the default san“tena. But also \SAMPA=% le#"gRa~Z For the French “Les Granges”, otherwise transcribed les#"grandges. The foreign R and a~ will be mapped on the Italian r and a

Native vs Foreign Pronunciation  Each Voice has its own native language  Each Vocal DataBase has its own: –Phoneme Set (the db does not contain foreign phonemes) –Coverage of Phoneme Sequences (the db contains only sequences that frequent in its native language) The Foreign Pronunciation feature of Loquendo TTS maps foreign phonemes onto the most similar native phonemes. But it can’t avoid obtaining phoneme sequences not present in the Vocal DB, requiring the concatenation of shorter speech units. Foreign Pronunciation is: Plausible Approximated Sometimes choppy

Foreign Pronunciation: the Phoneme Mapping Algorithm The algorithm centers around a Phonetic Similarity Function (PSF) that computes a similarity score of two phonemes that only depends on their phonetic- articulatory features Such an approach overlooks:  Finer language specific aspects of speech perception  Pragmatic/cultural aspects that may affect foreign pronunciation But… makes comparison between phonemes free from any language-specific knowledge!

Phonetic Similarity Function This approach requires, as a first step, the definition of a vector of phonetic-articulatory features for each phoneme Each feature has a different weight in the computation of phonetic similarity Values of a non-binary feature are placed in a scale of perceptual distance `avoweltonicopencentral… Non-binary features Binary feature Tongue Position Tonicity

Foreign Pronunciation: Demos English Text: “Good afternoon everybody! This is a live demo of Loquendo's lifelike text-to-speech technology, using the mixed language feature” Katrin (German Voice) Jorge (Spanish Voice) Bernard (French Voice) Interactive demos are available on

Foreign Pronunciation: Navigation Demos Dave (American Voice) Kate (British Voice) “In 250 yards, bear left, at the roundabout, take the second exit, and follow the signpost indicating, Charles de Gaulle airport.” “In 100 yards, stay in the right lane, and take the main road. After the tunnel, turn left, and follow the signpost indicating, Fiumicino, Leonardo da Vinci airport.”

Future Works for the Foreign Pronunciation The problem of the the unusual sequences of phones could be partially solved by :  Augmenting the speech databases with ad- hoc vocal material  Rewriting the PMM output by reducing the number of unusual phoneme sequences without modifying the plausibility of the reading  Adding new spectral features to better describe each speech database phone