Speech and Language Processing

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.
Logo Prosodic Manipulation Advanced Signal Processing, SE David Ludwig
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.
Syllables Most of us have an intuitive feeling about syllables No doubt about the number of syllables in the majority of words. However, there is no agreed.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
5-Text To Speech (TTS) Speech Synthesis
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 14: Waveform Synthesis (in Concatenative TTS) Lots.
Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Looking at Spectrogram in Praat cs4706, Jan 30 Fadi Biadsy.
Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Praat Fadi Biadsy.
LSA 352 Speech Recognition and Synthesis
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
LE 460 L Acoustics and Experimental Phonetics L-13
Speech synthesis Recording and sampling Speech recognition Apr. 5
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
The role of prosody in dialect synthesis and authentication Kyuchul Yoon Division of English Kyungnam University Spring 2008 Joint Conference of KSPS.
Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
Speech analysis with Praat Paul Trilsbeek DoBeS training course June 2007.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
The role of prosody in dialect authentication Simulating Masan dialect with Seoul speech segments Kyuchul Yoon Division of English, Kyungnam University.
Dialect Simulation through Prosody Transfer: A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
영어교육에 있어서의 영어억양의 역할 (The role of prosody in English education) Korea Nazarene University Kyuchul Yoon English Division Kyungnam University.
Chapter 15 Recording and Editing Sound
G. Anushiya Rachel Project Officer
CS 224S / LINGUIST 285 Spoken Language Processing
Mr. Darko Pekar, Speech Morphing Inc.
Text-To-Speech System for English
Speech Generation: From Concept and from Text
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Audio Books for Phonetics Research
Informatique et Phonétique
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Indian Institute of Technology Bombay
A System for Hybridizing Vocal Performance
Looking at Spectrogram in Praat cs4706, Jan 30
Presentation transcript:

Speech and Language Processing Chapter 8 of SLP Speech Synthesis / Waveform synthesis

Waveform Synthesis Given: Generate: String of phones Prosody Waveforms Desired F0 for entire utterance Duration for each phone Stress value for each phone, possibly accent value Generate: Waveforms 11/12/2018 Speech and Language Processing Jurafsky and Martin

Diphone TTS architecture Training: Choose units (kinds of diphones) Record 1 speaker saying 1 example of each diphone Mark the boundaries of each diphones, cut each diphone out and create a diphone database Synthesizing an utterance, grab relevant sequence of diphones from database Concatenate the diphones, doing slight signal processing at boundaries use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones 11/12/2018 Speech and Language Processing Jurafsky and Martin

Diphones Mid-phone is more stable than edge: 11/12/2018 Speech and Language Processing Jurafsky and Martin

Diphones mid-phone is more stable than edge Need O(phone2) number of units Some combinations don’t exist (hopefully) ATT (Olive et al. 1998) system had 43 phones 1849 possible diphones Phonotactics ([h] only occurs before vowels), don’t need to keep diphones across silence Only 1172 actual diphones May include stress, consonant clusters So could have more Lots of phonetic knowledge in design Database relatively small (by today’s standards) Around 8 megabytes for English (16 KHz 16 bit) 11/12/2018 Slide from Richard Sproat Speech and Language Processing Jurafsky and Martin

Voice Speaker Diphone database Called a voice talent Called a voice 11/12/2018 Speech and Language Processing Jurafsky and Martin

MBROLA Difoon synthese systeem (open source) (Thierry Dutoit, Mons, België) Als ingrediënten opgeven, voor elke klank: Foneem Toonhoogte Duur

MBROLA procedure Nodig: MBROLA difoonset Stuurgegevens in .pho fil fonemen, toonhoogtes, duren MBROLA maakt .wav file $mbrola mbrola/nl2/nl2 woord.pho woord.wav

MBROLA synthese – duur (ms) – toonhoogte (Hz) – % ; Utterance: "Hallo!“ _ 100 100 120 h 96 A 48 l 76 5 100 75 120 o 224 25 85 _ 100 40 70 percentages

Prosodic Modification Modifying pitch and duration independently Changing sample rate modifies both: Chipmunk speech Duration: duplicate/remove parts of the signal Pitch: resample to change pitch Text from Alan Black 11/12/2018 Speech and Language Processing Jurafsky and Martin

Speech as Short Term signals 11/12/2018 Alan Black Speech and Language Processing Jurafsky and Martin

Duration modification Duplicate/remove short term signals Slide from Richard Sproat 11/12/2018 Speech and Language Processing Jurafsky and Martin

Duration modification Duplicate/remove short term signals 11/12/2018 Speech and Language Processing Jurafsky and Martin

Pitch Modification Move short-term signals closer together/further apart Slide from Richard Sproat 11/12/2018 Speech and Language Processing Jurafsky and Martin

TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Very efficient No FFT (or inverse FFT) required Can modify Hz up to two times or by half Slide from Richard Sproat 11/12/2018 Speech and Language Processing Jurafsky and Martin

TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Windowed Pitch-synchronous Overlap-and-add Very efficient Can modify Hz up to two times or by half 11/12/2018 Speech and Language Processing Jurafsky and Martin

Unit Selection Synthesis Generalization of the diphone intuition Larger units From diphones to sentences Many many copies of each unit 10 hours of speech instead of 1500 diphones (a few minutes of speech) 11/12/2018 Speech and Language Processing Jurafsky and Martin

Unit Selection Intuition Given a big database Find the unit in the database that is the best to synthesize some target segment What does “best” mean? “Target cost”: Closest match to the target description, in terms of Phonetic context F0, stress, phrase position “Join cost”: Best join with neighboring units Matching formants + other spectral characteristics Matching energy Matching F0 11/12/2018 Speech and Language Processing Jurafsky and Martin

Targets and Target Costs Target cost T(ut,st): How well the target specification st matches the potential unit in the database ut Features, costs, and weights Examples: /ih-t/ +stress, phrase internal, high F0, content word /n-t/ -stress, phrase final, high F0, function word /dh-ax/ -stress, phrase initial, low F0, word “the” 11/12/2018 Speech and Language Processing Jurafsky and Martin

Target Costs Comprised of k subcosts Target cost for a unit: Stress Phrase position F0 Phone duration Lexical identity Target cost for a unit: 11/12/2018 Speech and Language Processing Jurafsky and Martin Slide from Paul Taylor

difoonaansluiting(skosten) pa ka ta Dit zijn meestal zes verschillende opnamen, maar dat geeft spectrale verschillen bij de aansluiting: pa – ap, pa – ak, pa – at ka – ap, ka – ak, ka – at ta – ap, ta – ak, ta – at ap ak at

Join (Concatenation) Cost Measure of smoothness of join Measured between two database units (target is irrelevant) Features, costs, and weights Comprised of k subcosts: Spectral features F0 Energy Join cost: 11/12/2018 Slide from Paul Taylor Speech and Language Processing Jurafsky and Martin

Total Costs Hunt and Black 1996 We now have weights (per phone type) for features set between target and database units Find best path of units through database that minimize: Standard problem solvable with Viterbi search with beam width constraint for pruning 11/12/2018 Speech and Language Processing Jurafsky and Martin Slide from Paul Taylor

11/12/2018 Speech and Language Processing Jurafsky and Martin

Unit Selection Summary Advantages Quality is far superior to diphones Natural prosody selection sounds better Disadvantages: Quality can be very bad in places HCI problem: mix of very good and very bad is quite annoying Synthesis is computationally expensive Can’t synthesize everything you want: Diphone technique can move emphasis Unit selection gives good (but possibly incorrect) result 11/12/2018 Slide from Richard Sproat Speech and Language Processing Jurafsky and Martin