© 2013 by Larson Technical Services Outline Grammar-based speech recognition Statistical language model-based recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services
Speech Synthesis (Text-To-Speech, TTS) Structure Rules Structure Analysis Abbreviation and Acronym Database Text Normalization Pronunciation Lexicon Text-to-phoneme Conversion Prosody Rules Prosody Analysis Phoneme-to-sound Database Waveform Production © 2013 by Larson Technical Services
Concatenated vs. Parameter-based Speech Synthesis Isolate Phonemes “The dog barked” “red car” Concatenate er ed d k ah er dh eh d ao g b ah er k eh d “red car” Generate Speech er ed d k ah er Voice Parameters © 2013 by Larson Technical Services
© 2013 by Larson Technical Services Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: p, s Non-markup behavior: infer structure by automated text analysis Critic: I don’t want to use to use TTS, it’s too difficult to understand Response: Developers can replay audio files, use TTS, or a combination of both. Developers can rely upon defaults from the TTS engine, or specify commands to override defaults © 2013 by Larson Technical Services
Before and after Structure Analysis Before structure analysis Dr. Smith lives at 214 Elm Dr. He weights 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 19 lb. bass. After structure analysis <p> <s> Dr. Smith lives at 214 Elm Dr. </s> He weights 214 lb. <s> He plays bass guitar. </s> He also likes to fish; last week he caught a 19 lb. bass. </p> © 2013 by Larson Technical Services
© 2013 by Larson Technical Services Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: p, s Non-markup behavior: infer structure by automated text analysis Critic: I don’t want to use to use TTS, it’s too difficult to understand Response: Developers can replay audio files, use TTS, or a combination of both. Developers can rely upon defaults from the TTS engine, or specify commands to override defaults Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs © 2013 by Larson Technical Services
After Text Normalization <p> <s> <sub alias= "doctor">Dr. </sub> Smith lives at 214 Elm <sub alias = "drive">Dr. </sub> </s> He weights 214<sub alias= "pounds"> lb. </sub> He plays bass guitar. He also likes to fish; last week he caught a 19 <sub alias= "pound"> lb. </sub> bass. </p> © 2013 by Larson Technical Services
© 2013 by Larson Technical Services Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: p, s Non-markup behavior: infer structure by automated text analysis Critic: I don’t want to use to use TTS, it’s too difficult to understand Response: Developers can replay audio files, use TTS, or a combination of both. Developers can rely upon defaults from the TTS engine, or specify commands to override defaults Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs © 2013 by Larson Technical Services
After Text-to-Phoneme Conversion <p> <s> <sub alias = "doctor">Dr.</sub> Smith lives at <say-as interpret-as = “address"> 214 </sayas> Elm <sub alias = "drive">Dr. </sub> </s> He weighs <sayas interpret-as = "number">214 </sayas> <sub alias= "pounds"> lb.</sub> He plays <phoneme alphabet = "ipa" ph="beɪs">bass</phoneme> guitar. He also likes to fish; last week he caught a <sayas interpret-as= "number">19 </sayas> <sub alias= "pound"> lb. </sub> <phoneme alphabet = "ipa" ph="bæs">bass</phoneme>. </p> © 2013 by Larson Technical Services
Pronunciation Specification Within the text replace "creek" by “krik” With the phoneme commands <phoneme alphabet = "ipa" ph="krik"> creek </phoneme> In the pronunciation lexicon <lexeme> <grapheme>creek</grapheme> <phoneme>"krik" </phoneme> </lexeme> Designer has preference for how words should be spoken, e.g., creek, aluminum Phonetic spellings sometimes don’t have the desired effect © 2013 by Larson Technical Services
© 2013 by Larson Technical Services Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: p, s Non-markup behavior: infer structure by automated text analysis Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax Critic: I don’t want to use to use TTS, it’s too difficult to understand Response: Developers can replay audio files, use TTS, or a combination of both. Developers can rely upon defaults from the TTS engine, or specify commands to override defaults Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs © 2013 by Larson Technical Services
Prosody Analysis (Initial text) <prompt> Environmental control menu. Do you want to adjust the lighting or temperature? </prompt> © 2013 by Larson Technical Services
© 2013 by Larson Technical Services Prosody Analysis <prompt> Environmental control menu <break/> <emphasis level = "reduced" > do you want to adjust the </emphasis> <emphasis level = "strong"> lighting </emphasis> <break/> or <emphasis level = "strong"> temperature? </emphasis> </prompt> © 2013 by Larson Technical Services
© 2013 by Larson Technical Services Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: voice, audio* Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: paragraph, sentence Non-markup behavior: infer structure by automated text analysis *audio icons, branding, advertising Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax Critic: I don’t want to use to use TTS, it’s too difficult to understand response: Developers can replay audio files, use TTS, or a combination of both. Developers can rely upon defaults from the TTS engine, or specify commands to override defaults Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs © 2013 by Larson Technical Services
Prerecorded messages vs. Speech Synthesis Natural sounding Easy to understand Static data Tedious to record and tag Prerecorded messages Artificial sounding May be difficult to understand Computer-generated data Easy to specify Speech Synthesis (TTS)