DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A SPEECH SYNTHESIS SYSTEM Craig Olinsky Media Lab Europe / University College Dublin.

Slides:



Advertisements
Similar presentations
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
 about 5,000-6,000 different languages spoken in the world today  English is far the most world wide in its distribution  1/4 to 1/3 of the people.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.
Perception of syllable prominence by listeners with and without competence in the tested language Anders Eriksson 1, Esther Grabe 2 & Hartmut Traunmüller.
PHONETICS AND PHONOLOGY
Languages Dialect and Accents
General Problems  Foreign language speakers of a target language cause a great difficulty to native speakers because the sounds they produce seems very.
Best-First Search: Agendas
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Psych 56L/ Ling 51: Acquisition of Language Lecture 8 Phonological Development III.
Construction of phoneme-to-phoneme converters
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Chapter three Phonology
1 ENGLISH PHONETICS AND PHONOLOGY Lesson 3A Introduction to Phonetics and Phonology.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Phonetics and Phonology.
Lecture 1 Introduction: Linguistic Theory and Theories
ACE TESOL Diploma Program – London Language Institute OBJECTIVES You will understand: 1. A process for teaching the receptive and productive sides of pronunciation.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Language: the Key to Literacy Language and Reading Have a Unique Relationship.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Communicative Language Teaching Vocabulary
Phonetics and Phonology
Language By Chevon Garrard. Language Definition Language is a communication of thoughts and feelings through a system of arbitrary signals such as voice.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
The Great Vowel Shift Continued The reasons behind this shift are something of a mystery, and linguists have been unable to account for why it took place.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
PED 392 Child Growth and Development. Definitions Language A symbolic system: a series of sounds or gestures in which words represent an idea, object.
SPEECH AND WRITING. Spoken language and speech communication In a normal speech communication a speaker tries to influence on a listener by making him:
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
Developmental Word Knowledge
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Lecture 2 Phonology Sounds: Basic Principles. Definition Phonology is the component of linguistic knowledge concerned with rules, representations, and.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Performance Comparison of Speaker and Emotion Recognition
LANGUAGE, DIALECT, AND VARIETIES
Goal :Communicative Competence
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Objectives of session By the end of today’s session you should be able to: Define and explain pragmatics and prosody Draw links between teaching strategies.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
Language choice in multilingual communities
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Bilingualism, Code-Switching, Code Mixing, Pidgin, Creole Widhiyanto 1Subject: Topics in Applied Linguistics.
Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Chapter 10 Language acquisition Language acquisition----refers to the child’s acquisition of his mother tongue, i.e. how the child comes to understand.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Vocabulary Acquisition in a Second Language: Do Learners Really Acquire Most Vocabulary by Reading? Some Empirical Evidence Batia Laufer.
TECHNICAL SEMINAR ON IMPLEMENTATION OF PHONETICS IN CRYPTOGRAPHY BY:- VICKY AGARWAL (4JN03CS078) GUIDED BY:- SREEDEVI.S LECTURER DEPT OF CS&E.
G. Anushiya Rachel Project Officer
PSYC 206 Lifespan Development Bilge Yagmurlu.
Theories of Language Development
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Reading Strategies “The only guide you'll ever need to Reading Chinese,” accessed at Zizzle Learn Chinese
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
ENGLISH PHONETICS AND PHONOLOGY Week 2
Auditory Morphing Weyni Clacken
Presentation transcript:

DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A SPEECH SYNTHESIS SYSTEM Craig Olinsky Media Lab Europe / University College Dublin

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS Many of the areas which could most benefit from community-focused IT resource development have very high illiteracy rates among their populace. For such users, speech-based systems provide the most obvious and natural mechanism for them to interface with computers. Without the widespread available of high quality speech databases, computer-readable lexicons, and other pre-processed linguistic information that is available for, for instance, standard dialects of French or German, it is expensive and difficult to build such systems. (“learning from sample” case in other presentation)

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS Even within a particular language (including those major ones), the personalization of a speech Synthesis system for a particular use, market, and especially accent can provide much benefit to a deployed system. Recent articles have suggested, in fact, that humans connect better as listeners with a speaker and voice who sound like them, not only finding it easier to listen to and understand what is said, but also finding it more natural to assign emotional state and judge such factors as authority and honesty, and even intelligibility.

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS Perhaps the system can LISTEN to the user, and then CHANGE ITS OUTPUT to sound more like what it hears? Instead of creating a dedicated system for every purpose, set up a number of “baseline” systems (along different languages, language families, etc.) and set them learning. We benefit from the work put in developing the baseline system, while requiring a (minimum?) of additional focused training data. Assumption: Learning “Accent”, “Dialect”, “Language” – not a distinct process, but all a matter of degree?

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS HUMAN ANALOGUE: People who live for a period of time in an area where a different accent or dialect of their language is spoken often (involuntarily) start to pick up the local manners of speech. SPEECH RECOGNITION ANALOGUE: “Speaker Adaptation” -- a procedure in which the acoustic model of the recognition system (or in limited cases the language mode as well), after being fully trained, is provided with additional speech data. Based upon this data, the values, parameters, nodes, weights, or other coefficients representing the acoustic model are shifted “towards” the new information such that the system should exhibit improved performance on data representing the new training data, even though such data may not have been  included in its initial training procedure.

BACKGROUND: SPEAKER ADAPTATION FOR SPEECH RECOGNITION SYSTEMS QUICK PROCEDURE OVERVIEW: Given a set of recording target utterances and associated transcripts: Generate synthesized utterance from transcript using current synthesizer (letter-to-sound rules, phones, speech database, etc.) Compare target recording to generated source form to determine how the two pronunciations differ. Re-organize the phone units and speech unit selection process to incorporate differences and info from target recording units. Modify the lexical entries and letter-to-sound rules of the existing synthesizer to produce output that closer resembles the target utterance.

VARIATION AND ADAPTATION Ignoring for a moment issues such as vocabulary choice and other semantic issues of usage, it is possible to consider variation from accent, “dialect”, and even across languages as a difference in degree of variation in a few key areas: the phonetic inventory which comprises the basic building blocks in which things are pronounced; a set of pronunciation rules or examples which dictate how the phonetic units are put together to assign a pronunciation to an orthographic form, and subsequently speak the desired text, and a collection of conventionalized stress and intonational patterns which help provide structure and syntactic/semantic context to the overall produced utterances.

VARIATION AND ADAPTATION Cross-Speaker Adaptation. In such a mode, a generalized speech synthesizer is adapted towards the voice of a single user of the system. This can be done in one of two ways: Assuming that the original “voice” of the synthesizer is that of a professional speaker, either qualities of the user’s voice can be applied to the default voice, while still retaining the database of sound samples of the original speaker for use as the concatenative synthetic voice; conversely, the database can be expanded (or replaced) with samples of the user’s voice, while some abstract “quality” of the original professional voice is nonetheless retained, ideally providing some measure of the clearness and understandability for which the original speaker was initially retained. The ability to create natual-sounding speech from concatenation of samples drawn from a speech database comprised of recordings from multiple users, and/or of multiple quality, would also help encourage an open-source “bazaar” of decentralized users attempting to amass the large number of recorded forms necessary for a multi-purpose unit-selection synthesizer.

VARIATION AND ADAPTATION Cross-Dialect Adaptation. This is almost exactly the case expressed above, except for that the “default” voice form and the specific user’s voice different in dialect, or to some greater degree than the average set of native speakers from a given area. That is, we would expect not only quality of voice variation, but also limited difference, in vocabulary, phonetic inventory, distinguishable minimal-pairs, accent, and the like. The result is that not only the unit-selection database, but also those components which assign phonetic realizations to the given text: the letter-to-sound rules and the pronunciation dictionary or lexicon, may need alteration. Cross-Language Adaptation. In this case, we retain some degree of phonetic inventory similarity between the source and destination language, but our letter-to-sound rules and lexicon need gross modification, or may even be unusable (even some language pairs where are very similar in pronunciation, such as Japanese and Korean, could nonetheless use unrelated orthographic form, or voice versa).

VARIATION AND ADAPTATION Cross-Language Adaptation, Single Speaker Variant. In this case, we have recordings from a single speaker (i.e., the user), which we want to be able to speak naturally in languages in which the user is not a native speaker. We thus want to use information about these other languages to adapt the synthesizer of the user’s voice to speak multilingually. (This is especially significant in our global community, where many proper nouns of personal names and locations cannot be properly pronounced simply by following the phonological rules of a single language). Language “Acquisition”. In the extreme case, we wish to bootstrap an “empty” synthesizer (with no lexicon or knowledge of pronunciation rules whatsoever) to speak like us simply by speaking to it, without hard-cording direct linguistic or phonetic knowledge. This is a task that a non-technical, non-expert native speaker user should be able to perform.

VARIATION AND ADAPTATION Ignoring for a moment issues such as vocabulary choice and other semantic issues of usage, it is possible to consider variation from accent, “dialect”, and even across languages as a difference in degree of variation in a few key areas: the phonetic inventory which comprises the basic building blocks in which things are pronounced; a set of pronunciation rules or examples which dictate how the phonetic units are put together to assign a pronunciation to an orthographic form, and subsequently speak the desired text, and a collection of conventionalized stress and intonational patterns which help provide structure and syntactic/semantic context to the overall produced utterances.

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS Synthesis adds an additional problem to recognition adaptation: the fact that the database of recorded segments themselves is itself used for concatentation. This means that we can not just merge the entire set of recorded data together – there would be noticeable discrepancies between concatenative units taken from each individual speaker. On the other hand, if we just use the new set of segments, we aren’t adapting; we’re just building a new synthesizer. For this study, we take the new target data to be a small data set; not enough to be a good set of units for synthesis on its own.

OVERVIEW: ADAPTATION IN SPEECH SYNTHESIS We are thus required to use existing (source) units for synthesis. However, these source recordings and their associated existing synthetic voice have a specific accent/dialect, with a pre-defined phone set. Even with a proper dictionary and proper letter-to-sound rules providing use with a “proper” pronunciation taking into account pronunciation variation for our target accent., stringing the “best match” units together likely won’t sound like a native speaker of that accent. The vowel quality might be vastly different, or phones might be missing in the source language (e.g., a French /r/). We want to adapt for this. Overall, we want to sound native in the target accent/dialect/language, using units recorded from the speaker of a different one.

PHONE UNIT ADAPTATION If the variation between source and target speech is large enough, it is likely that describe the target speech with a different phone set than that of the source speech. We may still find that the pronunciation of a particular phone in the target corresponds more closely with that of a different one than our source pronunciation lexicon would suggest (for instance, schwa reduction). Or we might have an existing target pronunciation lexicon or pronunciation rules with a predefined phone set we with to use.   To utilize data from our source synthesizer in such a case, we need to assign appropriate mappings between source and target phones. This can be seen as a matter of degree as to how much effort or knowledge is incorporated into creating the mapping, how closely such a mapping corresponds to the observered data, and thus our (assumed) rating of the quality of such a mapping.

Figure 1: Degrees of Phoneme Mapping: PHONE UNIT ADAPTATION Figure 1: Degrees of Phoneme Mapping:   (alleged) WORST (alleged) BEST Source Naïve Mapping Linguistically-Motivated Data-Driven Target Phoneme Set Phoneme Mapping Mapping Phoneme Set  

PHONE UNIT ADAPTATION na(t)ive approach: this approach follows the principle a non-native would follow when speaking a second language: he basically has the phonetic inventory of the first language and partially uses that inventory when speaking the second language…. phonetic approach: this strategy follows principles in the production of sounds in the human vocal tract … that sound that agrees in the most phonetic features with the untrained one is taken instead of the unknown one of the goal language…. data-driven approach: this approach determines the similarity among phones with the data given by the trained recognizer… according to a distance measure the most similar units may be joined.

PRONUCIATION ADAPTATION Typically taken for granted in multilingual speech adaptation studies is the presences of a pronunciation dictionary and/or rules for the target language – On the far extremes, we assume the existing of well-targeted pronunciation rules: in the worst case, one designed for the source speech, and the best case, one specifically designed for the target. In between, we use a number of methods to derive or create a pronunciation module, based either upon the existing source-language methods, the target speech data itself, or some combination.

PRONUNCIATION ADAPTATION Figure 2: Letter to sound rules/ lexicon (alleged) WORST (alleged) BEST Principled “Foreign Langua Trrained Principled Source-Only Approximation” Neutral from Target data Target-Only  

PRONUNCIATION ADAPTATION Principled Source-Only: this approach merely uses pronunciation methods specifically designed for the source speech to generate a pronunciation form for the target. This approach can result in extremely inaccurate pronunciation approximations, such as one might inspect from a native English attempt at a native pronunciation of an unusual foreign “Foreign Approximation”: this approach can be seen as akin to the na(t)ive approach of phone mapping as discussed above. In this case, the speaker recognizes that the word being pronounced is not a native one, and relaxes some of the language-specific rules or attempts to move the pronunciation closer to that of the “assumed” language of the word in question. The result is closer, but still inaccurate and strongly accented.

PRONUNCIATION ADAPTATION  Language-Neutral: this approach purely ignores all language-specific information, assuming either a set of very generic or regular pronunciation rules, proposing a (relatively) direct relation between orthographic form and pronunciation. Such rules would closely resembles those used for a language with artificially few pronunciation exceptions, such as Esperanto, rather than that of English. Trained from Target Data: in this method, an aligned text and speech signal are provided to a recognizer, along with (possibly) a limited set of pronunciation transcriptions as training data. In some automatic way, the system learns a set of pronunciation rules and/or a lexicon of pronunciations which closely matches the training data. Principled Target-Only: this approach assumes a provided pronunciation modules specifically designed to generate correct pronunciations for our target language/dialect/accent.

UNIT DATABASE COMPOSITION Figure 3: Methods of Comprising Unit Database (alleged) WORST (alleged) BEST Source Speaker Union of al Source Speaker Set of Digitally Target Speaker Only Recordings + uncovered phones Altered Segments Only (unprincipled) from target only  

ADAPTATION FROM MIMICRY We know from the beginning that our source unit database is of the best quality (in terms of recording, segmentation, labelling, etc.) But we can’t directly synthesize from the source database, because we will get accented, non-native sounding speech. Is there a way to generate in a non-accented or differently-accented way from a single speech database? Try to find a “neutrally” accented speaker? (What does this mean – someone heavily polylingual? Someone geographically in between the two languages or accents?) Look at mimicry studies – how someone (intentionally) modifies their voice to sound like a different speaker.

ADAPTATION FROM MIMICRY Anders Eriksson and Pär Wretling – “How Flexible is the Human Voice? – A Case Study of Mimicry” Close mimicry of global speech rate No change for timing at segmental level Mean fundamental frequency and variation matched timing closely Formant frequencies attained with variant success: Vowel imitation intermediate between voice and target   “Fundamental frequency changes were more successful than changes in timing”

STAGES OF THE EXPERIMENT Our development efforts and systems will follow the four modes listed in the research overview in order of ascribed complexity. For the Cross-Speaker Adaptation case, we will utilize a base voice and training speaker of native American English. For the Cross-Dialect Adaptation study, we will retain the use of English for the basic case, adapting over a selection of American, British, and Irish English dialects. We will then finish with two data sets for Cross-Language Adaptation, proceeding in order of linguistic variation – variation over the set of Celtic languages still in current use (Irish, Scottish Gaelic, and, slightly more distantly, Welsh) and a selection of Asian Indian Languages, including (at least) Bengali and Hindi.