(c) 2007 Larson Technical Services1 VoiceXML Overview James A. Larson Intel Corporation
(c) 2007 Larson Technical Services2 Outline Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1
(c) 2007 Larson Technical Services3 VoiceXML in the Marketplace VoiceXML 2.0 is now ratified as a Recommendation (e.g., official standard) by the W3C Hundreds of millions of VoiceXML calls are answered every day VoiceXML is the standard for building speech-enabled applications
(c) 2007 Larson Technical Services4 Motivation for Speech Applications Users access Web sites from any telephone, anywhere, any time. Speaking and listening are the natural usage modes for phones.
(c) 2007 Larson Technical Services5 Strength of VoiceXML Applications Traditional system-directed dialogs for novice users Mixed initiative dialogs for experienced users Novice users smoothly become experienced users at their own pace
(c) 2007 Larson Technical Services6 Limitations of VoiceXML Applications No special analysis of speech input –Not suitable for training speech skills— Reading, ESL, singing, etc. VUI conversational bandwidth is slower than GUI conversational bandwidth –Using a VUI is like drinking from Lake Superior with a straw
(c) 2007 Larson Technical Services7 Exercise 1 Name or describe a speech application you could use at work. Name or describe a speech application you or family member can use at home.
(c) 2007 Larson Technical Services8 XML XML = eXtensible Markup Language Elements are surrounded by tags Welcome to the voice system Elements may be nested Welcome to Ajax Travel we have the cheapest fares Elements may have attributes Because “ ”, and “&” have special meanings “<” in place of “<” “>” in place of “>” “&” in place of “&”.
(c) 2007 Larson Technical Services9 Outline Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1
(c) 2007 Larson Technical Services10 DB Multimedia Files Audio Files Web Server HTML Scripts VoiceXML Scripts Grammars Speech Server/Gateway Web Browser Capture Voice ASR DTMF Replay Audio TTS Database Server Voice Browser Documents
(c) 2007 Larson Technical Services11 W3C Speech Interface Framework Speech Synthesis GrammarOther VoiceXML 2.0 Call Control Semantic Interpretation
(c) 2007 Larson Technical Services12 Status of W3C Speech Interface Languages Voice XML 2.0 Grammar (SRGS) Synthesis (SSML) Call Control (CCXML) Semantic Interpret- Ration (SISR) Recommendation Proposed Recommendation Candidate Recommendation Last Call Working Draft Requirements Working Draft Voice XML 2.1 V3
(c) 2007 Larson Technical Services13 Outline Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1
(c) 2007 Larson Technical Services14 Example of VoiceXML 2.0 Fragment … Which account savings or checking savings checking CD certificate of deposit $ = “CD” …. … Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI)
(c) 2007 Larson Technical Services15 Example of VoiceXML 2.0 Fragment … Which account savings or checking savings checking CD certificate of deposit $ = “CD” …. … Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI)
(c) 2007 Larson Technical Services16 Example of VoiceXML 2.0 Fragment … Which account savings or checking savings checking CD certificate of deposit $ = “CD” …. … Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI)
(c) 2007 Larson Technical Services17 Example of VoiceXML 2.0 Fragment … Which account savings or checking savings checking CD certificate of deposit new.account = “CD” …. … Dialog Language (VocieXML 2.0) Speech Synthesis Markup Language (SSML) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation (SI)
(c) 2007 Larson Technical Services18 VoiceXML 2.0 features Menus, forms, sub-dialogs –,, Inputs –Speech recognition –Recording –Keypad Output –Audio files –Text-to-speech Variables – Events –,,,, Transition and submission –, –Telephony –Connection control –, –Telephony information –Platform –Objects –Performance –Fetch
(c) 2007 Larson Technical Services19 Typical Form Fill-In Welcome to the electronic payment system. Please enter your credit card number? Please enter your expiration date
(c) 2007 Larson Technical Services20 Exercise 2 Capture “birth date” _____________________ _______________________________ ______________________________ ______________________________
(c) 2007 Larson Technical Services21 Event Handlers Deal with exceptional or error conditions Control mechanism for dialog turn retries – … – – … Shorthand notation available – …, etc. Scoped according to where they occur –,, etc.
(c) 2007 Larson Technical Services22 Adding Event Handlers When were you born? ….. ….. What month? …..
(c) 2007 Larson Technical Services23 Adding Event Handlers When were you born? ….. ….. What month? …..
(c) 2007 Larson Technical Services24 Adding Event Handlers When were you born? ….. ….. What month? …..
(c) 2007 Larson Technical Services25 Default Event Handlers Sorry, no help is available. I did not understand, please try again I did not hear anything, please speak again
(c) 2007 Larson Technical Services26 Exercise 3 Write event handlers for the month field ____________________ __________________________ ___________________________________
(c) 2007 Larson Technical Services27 Outline Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1
(c) 2007 Larson Technical Services28 Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: p, s Non-markup behavior: infer structure by automated text analysis
(c) 2007 Larson Technical Services29 Before and after Structure Analysis Before structure analysis –Dr. Smith lives at 214 Elm Dr. He weights 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 19 lb. bass. After structure analysis He plays bass guitar. He also likes to fish; last week he caught a 19 lb. bass. Dr. Smith lives at 214 Elm Dr. He weights 214 lb.
(c) 2007 Larson Technical Services30 Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Markup support: p, s Non-markup behavior: infer structure by automated text analysis
(c) 2007 Larson Technical Services31 After Text Normalization Dr. Smith lives at 214 Elm Dr. He weights 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 19 lb. bass.
(c) 2007 Larson Technical Services32 Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Markup support: p, s Non-markup behavior: infer structure by automated text analysis
(c) 2007 Larson Technical Services33 After text-to-phoneme conversion Dr. Smith lives at 214 Elm Dr. He weighs 214 lb. He plays bass guitar. He also likes to fish; last week he caught a 19 lb. bass.
(c) 2007 Larson Technical Services34 Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Markup support: p, s Non-markup behavior: infer structure by automated text analysis
(c) 2007 Larson Technical Services35 Prosody Analysis (Initial text) Environmental control menu. Do you want to adjust the lighting or temperature?
(c) 2007 Larson Technical Services36 Prosody Analysis Environmental control menu do you want to adjust the lighting or temperature?
(c) 2007 Larson Technical Services37 Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Markup support: voice, audio* Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs Markup support: paragraph, sentence Non-markup behavior: infer structure by automated text analysis *audio icons, branding, advertising
(c) 2007 Larson Technical Services38 Wave Form Production Environmental control menu. Do you want to adjust the lighting or temperature
(c) 2007 Larson Technical Services39 Exercise 4 ( insert SSML commands ) Welcome to Ajax Bank do you want to withdraw or deposit funds?
(c) 2007 Larson Technical Services40 Outline Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1
(c) 2007 Larson Technical Services41 Grammars Describe what the user may say at a point in the dialog Enable the speech recognition engine to work faster and more accurately Consist of one or more “rules”
(c) 2007 Larson Technical Services42 Example Grammar zero ten one two three four five six seven eight nine XML form of grammars
(c) 2007 Larson Technical Services43 Example Grammar zero ten one two three four five six seven eight nine Grammar processor should start with the “zero_to_ten” rule
(c) 2007 Larson Technical Services44 Example Grammar zero ten one two three four five six seven eight nine This is a grammar used by the speech recognizer. (There may also be grammars for DTMF recognizers.)
(c) 2007 Larson Technical Services45 Example Grammar zero ten one two three four five six seven eight nine Rule describing single digits Rule describing digits one through ten
(c) 2007 Larson Technical Services46 Example Grammar zero ten one two three four five six seven eight nine describes alternatives
(c) 2007 Larson Technical Services47 Example Grammar zero ten one two three four five six seven eight nine Rule element references another rule
(c) 2007 Larson Technical Services48 Example Grammar zero ten one two three four five six seven eight nine Exercise 5: Write a grammar for that recognizes the digits zero to nineteen
(c) 2007 Larson Technical Services49 More Grammar Elements Repeat and optional very good Sequence Twenty Garbage James Lewis
(c) 2007 Larson Technical Services50 Reusing existing grammars <grammar type = "application/srgs+xml" root = "size” src = “
(c) 2007 Larson Technical Services51 Outline Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1
(c) 2007 Larson Technical Services52 Semantic Interpretation Semantic Interpretation defines how to extract and modify the results returned by the speech recognition engine Semantic interpretation instructions contained in the element Two kinds of syntax for contents: –Semantic Literals (literal values) –Semantic Scripts (ECMAScript)
(c) 2007 Larson Technical Services53 Semantic Interpretation Semantic Literals example: coca cola coke cola coke black fizzy stuff coke coke
(c) 2007 Larson Technical Services54 Semantic Interpretation Semantic Literals example: coca cola coke cola coke black fizzy stuff coke coke Default Assignment
(c) 2007 Larson Technical Services55 No Semantic Scripts ASR Grammar with Semantic Interpretation Scripts Semantic Interpretation Processor VoiceXML Interpreter text ECMAScript object fourteen
(c) 2007 Larson Technical Services56 No Semantic Interpretation ASR Grammar with Semantic Interpretation Scripts VoiceXML Interpreter text fourteen ECMAScript object Semantic Interpretation Processor
(c) 2007 Larson Technical Services57 Semantic Interpretation ASR Grammar with Semantic Interpretation Scripts VoiceXML Interpreter text fourteen fourteen new.quantity=“14”; ECMAScript object Semantic Interpretation Processor
(c) 2007 Larson Technical Services58 Semantic Interpretation ASR Grammar with Semantic Interpretation Scripts VoiceXML Interpreter text fourteen { quantity: “14” } fourteen new.quantity=“14”; ECMAScript object Semantic Interpretation Processor
(c) 2007 Larson Technical Services59 Semantic Interpretation Semantic Scripts employ ECMAScript Advantages: Richer structure (objects) Ability to perform computations
(c) 2007 Larson Technical Services60 Semantic Interpretation Example grammar rule with Script Syntax: small out.size = "small"; medium out.size = "medium"; large out.size = “large"; green out.color = "green"; blue out.color = "blue"; white out.color = "white"; ECMAScript structure: action: { size: "large" color: "white" } Large white
(c) 2007 Larson Technical Services61 Semantic Interpretation Example grammar rule with Script Syntax: What is $.total = $digit; plus $.total = $.total + $digit; ECMAScript structure: calculator: { total: 6 } What is ?
(c) 2007 Larson Technical Services62 Exercise 6 Fill in the contents of Grammar rule: from savings ________________________ checking ________________________ to savings ________________________ checking ________________________ ECMAScript structure: transfer: { source_account: "savings" target_account: “checking" } From savings to checking
(c) 2007 Larson Technical Services63 Outline Motivation for VoiceXML W3C Speech Interface Framework Languages Dialog—VoiceXML 2.0 Speech Synthesis—SSML Grammars—SRGS Semantic Interpretation—SI VoiceXML 2.1
(c) 2007 Larson Technical Services64 VoiceXML 2.1 VoiceXML’s success and popularity resulted in many implementations early in the standardization process Additional, innovative features were conceived after VoiceXML 2.0 content was agreed Goals of VoiceXML 2.1: –Ensure portability by specifying a set of commonly implemented extensions –Backwards-compatible with VoiceXML 2.0 –Follow a “fast track” to standardization
(c) 2007 Larson Technical Services65 VoiceXML 2.1 Standardized extensions: –Locate barge-in occurrences within prompts –Access recognition utterances for analysis –Increase performance be reducing server round-trips –Extended call transfer types
(c) 2007 Larson Technical Services66 Summary W3C Speech Interface Framework –Dialog—VoiceXML –Grammar—SRGS –Synthesis—SSML –Semantic Interpretation—SI –Call Control—CCXML Can work together or separately See for detailshttp://
(c) 2007 Larson Technical Services67 Industry Organizations World Wide Web Consortium – W3C Voice Browser Working Group – W3C Multi-Modal Working Group – VoiceXML Forum – SALT Forum: – Speech Technology Magazine –
(c) 2007 Larson Technical Services68 Books James A. Larson, VoiceXML—An Introduction to Developing Speech Applications, 2002, Upper Saddle River, NJ: Prentice Hall. Eve Astrid Andersson, et.al., Early Adopter Voice, 2001, Birmingham UK: Vrox. Bruce Balentine & David P. Morgan, How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues, 1999, San Ramon, CA: Enterprise Integration Group. Rick Beasley et. al., Voice Application Development with Voice, 2002, Indianapolis: Sams. Bob Edgar, The Voice Handbook, 2001, New York: CMP. Susan Weinschenk & Dean T. Barker, Designing Effective Speech Interfaces, 2000, New York: John Wiley & Sons. Chetan Sharma & Jeff Kunins, Voice: Strategies and Techniques for Effective Voice Application Development with Voice 2.0, 2002, New York: John Wiley. Michael H. Cohen, James P. Giangola, & Jennifer Balogh, Voice User Interface Design, 2004, Addison Wesley.
(c) 2007 Larson Technical Services69 Other Resources The VoiceXML Guide –
(c) 2007 Larson Technical Services70 Tutorials and Articles VoiceXML Forum – VoiceXML Review – World of VoiceXML –
(c) 2007 Larson Technical Services71 Online Voice SDKs NameURL BeVocal Cafehttp://cafe.bevocal.com Tellme Studiohttp://studio.tellme.com VoiceGenie Developer Workshop Voxpilot voxbuilderhttp://
(c) 2007 Larson Technical Services72 Questions? ?
(c) 2007 Larson Technical Services73 Thanks for your attention
(c) 2007 Larson Technical Services74 Answer to Exercise 2 When were you born? What month? What day of the month? What year
(c) 2007 Larson Technical Services75 Answer to Exercise 3 Write event handlers for the month field In what month were you born? Which month, for example, January February, or March? Say the name of the month you were born in
(c) 2007 Larson Technical Services76 Answer to Exercise 4 Welcome to Ajax Bank do you want to withdraw or deposit funds?
(c) 2007 Larson Technical Services77 Answer to Exercise 5 Write a grammar for zero to nineteen zero ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen one two three four five six seven eight nine
(c) 2007 Larson Technical Services78 Answer to Exercise 6 From savings to checking Grammar rule: from savings out.source_account = “savings"; checking out.source_account = “checking"; to savings out.target_account = “savings"; checking out.target_account = “checking"; ECMAScript structure: transfer: { source_account: "savings" target_account: “checking" }