Acknowledgements Prof. Mctear, Natural Language Processing, University of Ulster.

Slides:



Advertisements
Similar presentations
VoiceXML: Application and Session variables, N- best and Multiple Interpretations.
Advertisements

Sound in multimedia How many of you like the use of audio in The Universal Machine? What about The Universal Computer? Why or why not? Does your preference.
Speech Synthesis Markup Language V1.0 (SSML) W3C Recommendation on September 7, 2004 SSML is an XML application designed to control aspects of synthesized.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.
1 SSML The Internationalization of the W3C Speech Synthesis Markup Language SpeechTek 2007 – C102 – Daniel C. Burnett.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Assistive Technology Training Online (ATTO) University at Buffalo – The State University of New York USDE# H324M Write:Outloud.
Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.
1 Università di Cagliari Corso di Laurea in Economia e Gestione Aziendale Economia e Finanza Economia e Finanza Lingue e Culture per la Mediazione Programma.
Making & marking text for synthesis Caroline Henton 10 August 2006.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
Voice Guidelines 1© 2013 by Larson Technical Services.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
The State of the Art in VoiceXML Chetan Sharma, MS Graduate Student School of CSIS, Pace University.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Natural Language Processing AI - Weeks 19 & 20 Natural Language Processing Lee McCluskey, room 2/07
Introduction to VXML. What is VXML? Voice Extensible Markup Language Used in telephone-based speech applications voice browsing of the web.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Tutorial 11 Creating XML Document
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Introduction to XML This material is based heavily on the tutorial by the same name at
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Natural Language Processing and Speech Enabled Applications by Pavlovic Nenad.
Speech Synthesis Markup Language -----Aim at Extension Dr. Jianhua Tao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese.
VoiceXML Builder Arturo Ramirez ACS 494 Master’s Graduate Project May 04, 2001.
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
Conversation Partnering Directions Guided Project Anthropology 105 Language & Culture.
VoiceXML: Speech Recognition Grammars
Conversational Applications Workshop Introduction Jim Larson.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
WEEK 3 AND 4 USING CLIENT-SIDE SCRIPTS TO ENHANCE WEB APPLICATIONS.
ITCS 6010 SALT. Speech Application Language Tags (SALT) Speech interface markup language Extension of HTML and other markup languages Adds speech and.
Introduction to HTML. HTML Hyper-Text Markup Language: the foundation of the World-Wide Web Design goals:  Platform independence: pages can be viewed.
VoiceXML: Forms, Menus, Grammars, Form Interpretation Algorithm.
Chapter 2 Overview of C Part I J. H. Wang ( 王正豪 ), Ph. D. Assistant Professor Dept. Computer Science and Information Engineering National Taipei University.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
INTRODUCTION. What is HTML? HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language,
Using Client-Side Scripts to Enhance Web Applications 1.
Speech Technology. HOT! What are the big players in the area up to? Google – technology.htmlhttp://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-
The Voice-Enabled Web: VoiceXML and Related Standards for Telephone Access to Web Applications 14 Feb Christophe Strobbe K.U.Leuven - ESAT-SCD-DocArch.
Outline Grammar-based speech recognition Statistical language model-based recognition Speech Synthesis Dialog Management Natural Language Processing ©
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Spoken Dialog Systems and Voice XML Lecturer: Prof. Esther Levin.
Creating User Interfaces [Continue presentations as needed] Speech recognition. Speech synthesis Homework: Report on current products. Register on Tellme.
Introduction to Computational Linguistics
Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.
© 2013 by Larson Technical Services
Internet & World Wide Web How to Program, 5/e © by Pearson Education, Inc. All Rights Reserved.
Internet & World Wide Web How to Program, 5/e © by Pearson Education, Inc. All Rights Reserved.
Talking to Robots Using MS Speech SDK 5.1 in C# Sebastian van Delden USC Upstate
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
© 2013 by Larson Technical Services
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Natural Language Processing (NLP)
Speech Recognition Created By : Kanjariya Hardik G.
XML Schema – XSLT Week 8 Web site:
INTONATION And IT’S FUNCTIONS
VoiceXML Tutorial: Part 1 Introduction and User Interaction with DTMF
Chapter 6 JavaScript: Introduction to Scripting
Prosody and Non- Verbal Communication
Text-To-Speech System for English
Introduction to Scripting
Dialog Design 4 Speech & Natural Language
WEB PROGRAMMING JavaScript.
JavaScript: Introduction to Scripting
Presentation transcript:

VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio

Acknowledgements Prof. Mctear, Natural Language Processing, http://www.infj.ulst.ac.uk/nlp/index.html, University of Ulster.

Overview Speech Synthesis Markup Language (SSML) Phases of Text to Speech Synthesis Structure analysis Text normalisation Text to phoneme conversion Prosody analysis Waveform production Recorded speech

SSML Speech Synthesis Markup Language Stages: enables developers to override default specifications Stages: Structure analysis Text normalisation Text to phoneme conversion Prosody analysis Waveform production

Structure Analysis Division of text into basic elements e.g. sentence, paragraph to support more natural phrasing <s> - sentence <p> - paragraph Structure inferred from punctuation and formatting, but … Dr. Lewis works at the clinic on Sunset Dr. in western Portland. Dr. Smith lives at 214 Elm Dr.  He weighs 214 lb. He plays bass guitar.  He also likes to fish; last week he caught a 20 lb. bass. <p>     <s>Dr. Smith lives at 214 Elm Dr. </s>     <s>He weighs 214 lb.</s>     <s>He plays bass guitar. </s>     <s>He also likes to fish; last week he caught a 20 lb. bass.</s> </p>

Text Normalisation Annotation of text so that it is spoken correctly Ambiguous examples: 1/2 - may be spoken as “half,” “January second,” “February first,” or “one of two.”  Dr. – may be ‘doctor’ or ‘drive’ e.g. Dr. John Dr.” is rewritten as “Doctor John Drive” St. – may be ‘saint; or ‘street’ e.g. St. John St. is written as “Saint John Street.” Acronyms e.g. ACM or IEEE should be spelled out, others are pronounced as words e.g. RAM, ROM Email addresses: e.g. catazman@bee.com First part: “Cat Azman,” “C.A.Tazman,” or “C. Atazman?”  Last part: “Bee dot com” or “B.E.E. dot com?”

<sub> New in VoiceXML 2.0. Speech Synthesis Markup. Syntax <sub alias="substituteText" > OriginalText </sub> Description Language element whose alias attribute provides substitute text to be spoken instead of the contained text. This allows the document to contain both a written and a spoken form for a string

<sub> <sub alias = "doctor">Dr.</sub> Smith lives at <sub alias = "two fourteen ">214 </sub> Elm <sub alias = "drive">Dr. </sub>    He weighs <sub alias = "two hundred and fourteen">214 </sub> <sub alias = "pounds"> lb.</sub>    He plays bass guitar. He also likes to fish; last week he caught a <sub alias = "twenty">20 </sub> <sub alias = "pound"> lb. </sub> bass.    <sub alias ="doctor">Dr. </sub> Smith lives at 214 Elm <sub alias = "drive">Dr. </sub> He weighs 214 <sub alias = "pounds"> lb. </sub>     He plays bass guitar.     He also likes to fish; last week he caught a 20 <sub alias = "pound"> lb. </sub> bass.

<say-as> Speak enclosed text in the given style Implemented (with limitations) in some platforms Example: numbers Contained text can be interpreted as a number. The allowed number formats are ordinal, cardinal, and digits. <say-as type="number:ordinal">12</say-as> is spoken as "twelfth“ <say-as type="number:digits">12</say-as> is spoken as "one two". Other types: acronyms, currency, time, date, duration, measures, telephone, spell-out, names, and net. Bevocal provides a set of extended tags for items such as: airline, equity, street, city, state, citystate, address

Text to phoneme conversion Specify pronunciation of words that are difficult to pronounce, e.g. read = ‘reed’ / ‘red’ wind: Wind the watch when you face into the wind <phoneme> - uses the standard phonetic alphabet, the International Phonetic Alphabet (IPA).  He plays        <phoneme alphabet = "ipa" ph="U0062 U0258 U0073"> bass </phoneme> guitar. He also likes to fish; last week he caught a <sub alias = "twenty">20 </sub>         <sub alias = "pound"> lb. </sub>         <phoneme alphabet = "ipa" ph="U0062 U00E6 U0073"> bass </phoneme>. Unicode numbers

Attributes of <phoneme> alphabet—The phonetic alphabet used to specify the pronunciation of the word contained in the <phoneme> element ph—The phonetic spelling of this word expressed using the alphabet. The only valid values for this attribute are ph="ipa" and vendor-defined strings of the form ph = "x-organization" or ph = "x-organization-alphabet ". Using the IPA requires some linguistic training.  For an excellent tutorial on the IPA symbols and sounds, see http://www.unil.ch/ling/english/phonetique/table-eng.html.  For an overview of the IPA and a full chart of symbols, see http://www.arts.gla.ac.uk/IPA/ipa.html.  The sounds used in English and their IPA symbols are illustrated in http://www.antimoon.com/how/pronunc-soundsipa.htm. You can hear each sound by clicking the word that contains the sound.  To identify the corresponding Unicode number, go to http://web.uvic.ca/ling/resources/ipa/charts/unicode_intro.htm, move the cursor above the IPA symbol, and the Unicode value will appear.  

Prosody analysis Pitch (intonation or melody), timing (rhythm), pauses, speech rate, emphasis on words, and the relative timing of segments and pauses.  most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech.  For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed.  Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory.  Prosody rules and algorithms are not perfect and are a topic of ongoing research.  Prosody rules for different spoken national languages may be quite different.  For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different. 

<prosody> : pitch refers to the “highness or lowness” of speech (currently not implemented in bevocal cafe) measured by the frequency (Hz, vibrations per second) of the sound can be specified with: A number followed by “Hz” A relative change expressed as a percentage:  for example, "+18.2%" or "-10.3%" A relative change as a relative number: for example, "+10" or "-8.7" One of the following words: "x-high", "high", "medium", "low", "x-low", or "default"

<prosody> : range Range - specifies the variability of the pitch.  specified using the same options as pitch e.g. (currently not implemented in bevocal cafe) <prosody pitch = "medium" range = "x-low">      

<prosody>: contour describes the actual pitch contour for the text.  (currently not implemented in bevocal cafe) set of time segments with a target pitch specified for each time segment.  Each time segment is defined as a percentage of the total time for speaking the contained text e.g. (25%, 25%, 25%, 25%) would speak the contained text in four equal segments.  An interpolation algorithm smoothes the transitions between the time segments.  For example, a contour can be used to describe the increase in pitch at the end of a question as follows: <prosody contour = "(90%, medium) (10%, high)"> You said what? </prosody>

<prosody> : rate, duration Rate.  The speaking rate expressed using words-per-minute (currently not implemented in bevocal cafe), specified using any of the following: A number A relative change expressed as a percentage;  for example, "+18.2%" or "-10.3%" A relative change as a relative number; for example, "+10" or "-8.7" One of the following words: "x-fast", "fast", "medium", "slow", "x-slow", or "default" The student’s name is <prosody rate=“-10%"> John Scott </prosody> Duration.  A value in seconds or milliseconds for the desired time to read the element contents e.g. <prosody duration = "10s">

<prosody> : volume Volume.  Specifies how loudly or quietly the words are spoken, specified by: A number in the range from 0.0 to 100.0 A relative change expressed as a percentage  for example; "+18.2%" or "-10.3%" A relative change as a relative number; for example, "+10" or "-8.7" One of the following words: "loud", "medium", "soft", "low", "x-soft", or "silent" <prosody volume = "loud"> text to be spoken  </prosody>

<emphasis> formerly <emph> level: values “strong” “moderate,” “none” and “reduced”.  “none” used to prevent the speech synthesis processor from emphasizing words that it might typically emphasize <emphasis level = "strong">help</emphasis>  

<break> specifies when to insert silence (or pause) in text strength - the strength of the prosodic break.  Values are "none" "x-small", "small","“medium" (the default value), "large", or "x-large" time – e.g. "250ms", "3s". Welcome to the Student System <break time = "250ms"/> Please say one of the following: …

Waveform Production Process of converting a textual representation to acoustical sounds which humans hear and interpret as human-like speech. <voice> - uses a different voice from the default specified for TTS <voice age=“3" gender="female"> text to speak </voice> <audio> - specifies what audio to present to user <desc> - specifies text-only output describing the audio output (e.g. dog barking)

Other SSML elements <speak> - defines a container for a speech synthesis document not required when SSML tags are used in PCDATA within VoiceXML. <lexicon> - specifies a pronunciation lexicon document which the speech synthesis engine uses to generate the pronunciation of words.  format not yet defined, see documentation of VoiceXML browser vendor <mark> - places a marker into the text to be processed by the speech synthesis engine, e.g. <mark name = "pause"/> When encountered, the speech synthesis pauses and throws an event referencing the marker name. A built-in event handler processes the event and causes the speech synthesis engine to resume.

<audio>: playing prerecorded audio files Output can consist of a combination of prerecorded files, audio streams, or synthesised speech e.g. <prompt> Welcome to the Student System <audio src = “AudioSample.wav” /> How can I help you? </prompt> <audio> can have alternative content in case the audio sample is not available e.g. <audio src = “welcome.wav” > Welcome to the Student System </audio>

Recording speech input using <record> <record> is a form element similar to <field> It is used to collect a recording from the user that can be played back or submitted to a server It has a <prompt> element and can have a <filled> element It can have a grammar for a spoken command to terminate the recording

Attributes of <record> name - The name of a variable that holds the value of the recorded item.  expr - The value of the recorded item variable.  beep—There are two possible values: beep = "true" and beep = "false" If true, a beep tone is presented to the user just before the recording begins.  The default is false. maxtime—The maximum duration of the recording, beginning when the recording starts. For example, maxtime = "10s" where "10s" means 10 seconds.  finalsilence—The interval of silence indicating the end of speech.   For example, finalsilence = "3s" (not implemented in IBM Voice Server SDK) dtmfterm—There are two possible values: dtmfterm = "true“ and dtmfterm = "false" If true, then any DTMF key press not matched by an active grammar will terminate the input. The default is true.  type—Media format of the resulting recording.  A media type is a file format written in the form type/subtype.  For audio files, the type is always audio. 

Example using <record> <form> <record name = "msg" beep = "true" maxtime = "5s” finalsilence = "5000ms" dtmfterm = "true" type = "audio/x-wav”> <prompt timeout = "5s"> Record your message after the beep. </prompt> </record> <filled> <!-- when recording is completed, replay recorded message –-> <prompt> You said <audio expr="msg"/> </prompt> </filled> </form>

Submitting recording to the server In this example, a recording has been stored in the variable ‘msg’ and the system confirms if the user wishes to keep it: <field name="confirm“ type = “boolean”> <prompt> Your message is <audio expr="msg"/>. </prompt> <prompt> To keep it, say yes. To discard it, say no. </prompt> <filled> <if cond="confirm"> <submit next="save_message.jsp" enctype="multipart/form-data" method="post" namelist="msg"/> </if> <clear/> </filled> </field>

<record> shadow variables (1) NB: ‘name’ represents the name of the form item variable name$.duration - The duration of the recording in milliseconds name$.size - The size of the recording in bytes name$.termchar - The DTMF key used by the caller to terminate the recording.  This variable is undefined if a key was not used to terminate the audio. name$.maxtime - true indicates the recording was terminated because the maxtime duration was reached.  false indicates the recording was not terminated due to maxtime.

<record> shadow variables (2) name$.utterance - The string of words spoken by the user if the recording was terminated by speech recognition input. This shadow variable is undefined if the recording was not terminated by speech recognition input. name$.confidence - The confidence level (0.0 – 1.0) if the recording was terminated by speech. This shadow variable is undefined if the recording was not terminated by speech recognition input.  The confidence level refers to the speech recognizer's estimate of the accuracy of its results, in this case the accuracy of the contents of name$.utterance.

Dealing with user hang up during recording When a user hangs up during recording, the recording terminates and a connection.disconnect.hangup event is thrown. Audio recorded up until the hangup is available through the <record> variable e.g. <catch event=“connection.disconnect.hangup”> … action such as submit recording to server… </catch>

Exercise: SSML markup Create a file using some SSML markup for TTS. Examples: He drove his new car, <prosody pitch="-10%" range="-20%" volume="-20%">not his ugly old car</prosody>, because he wanted to seem more <emphasis level=“strong”> impressive </emphasis> My user number is <say-as interpret-as=“digits”> 145678 </say-as> Sample file: tts.vxml

Exercise: recording and using audio files Create a simple application that includes a field in which you ask the user to speak some information, such as name and address, that is recorded by the system for later playback. Play back a pre-recorded file (music to be played as introduction)