Speech Technology
What are the big players in the area up to? Google – technology.htmlhttp://googleblog.blogspot.com/2010/12/can-we-talk-better-speech- technology.html Microsoft – enabled-world/ enabled-world/ Apple – heats-up-the-race-toward-a-voice-activated/ / heats-up-the-race-toward-a-voice-activated/ / IBM – Nuance – speechify-apps/ speechify-apps/ Voxeo
Apple, and the case of Siri Siri: B06O4 B06O4 Review of Siri: SkAU7c&feature=watch_response SkAU7c&feature=watch_response
Types of dialog systems by modality –text-based –spoken –graphical user interface –multi-modal by device –telephone-based systems –PDA systems –in-car systems –robot systems –desktop/laptop systems native in-browser systems in-virtual machine –in-virtual environment –robots by style –command-based –menu-driven –natural language by initiative –system initiative –user initiative –mixed initiative by application –information service –command-and-control –entertainment –education/tutorial –edutainment –reminder systems –companion systems –healthcare –eldercare –assistive/access systems
More about application types Information providing systems: –weather reports –stock quotes –timetables –... Transaction-based systems: –calendar functions –shopping –financial transactions –travel reservations –...
Why Voice?
Why voice? Wireless devices have small screens and limited input capabilities. Telephone keypad can give users only a limited number of choices. Speech technology is improving. The exchange of information between a person and a computer is becoming more like a real conversation. Users want hands-free or eyes-free use. From a business viewpoint, voice applications open up a host of new revenue opportunities. There exist many more telephones than computers with the potential to access the Internet.
Traditional Interactive Voice Response (IVR)
Speech versus Touch Tone
Architecture 1
Architecture 2
Today Presentation of project ideas TTS evaluation Short intro to XML Speech technology standards overview Speech Synthesis Markup Language (SSML) Presentation of home assignment 3: ASR evaluationASR evaluation
Project ideas?
Intro to XML
W3C Speech Standards Torbjörn Lager
VoiceXML – a part of the web Web servers VoiceXML browser (ASR, TTS, interpreter) Voice XML HTML browser HTML
The place of speech technology … speech technology itself has a very long way to go. … the most important thing may turn out to be not the speech technology itself, but the way in which speech technology connects to all the other technologies. Tim Berners-Lee
The What and Why of Standards Software standards include terminology, languages and protocols specified by committees of experts for widespread use in the software industry. Software standards have both advantages and disadvantages. Advantages: –developers can create applications using the standard languages that are portable across a variety of platforms; –products from different vendors are able to interact with each other; –a community of experts evolves around the standard and is available to develop products and services based on the standard. Disadvantages: –some developers feel that standards may inhibit creativity and stall the introduction of superior technology. However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough. Emerging speech-technology standards could give a boost to an industry hampered by proprietary software and hardware.
World Wide Web Consortium
W3C Speech Standards Speech Recognition Grammar Specification (SRGS) – What the user can say Semantic Interpretation for Speech Recognition (SISR) – What the user means Speech Synthesis Markup Language (SSML) – What the user hears VoiceXML – Dialog management: What the system is to do
Speech Recognition Grammar Specification (SRGS) Covers both speech and DTMF (Dual-Tone Multi-Frequency) input. (DTMF is valuable in noisy conditions or when the social context makes it awkward to speak.) Grammars can be specified in either an XML or an equivalent augmented BNF (ABNF) syntax. –Speech recognition is an inherently uncertain process. Recognizers may report confidence values. –If the utterance has several possible parses, the recognizer may be able to report the most likely alternatives (N-best results). What about statistical language models? Not covered by SRGS!
Semantic Interpretation for Speech Recognition (SISR) yes yeah yes you bet yes oui yes no nope no way no
Semantic Interpretation for Speech Recognition (SISR) I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }
I would like a out.drink = new Object(); out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize; and out.pizza=rules.pizza; coke pepsi coca cola out="coke"; out="medium"; small out="small"; medium large out="large"; regular out="medium"; out=new Array; out.push(rules.top); and out.push(rules.top); anchovies pepperoni mushroom out="mushrooms"; mushrooms out.drinksize=rules.foodsize; out.type=rules.kindofdrink; out.pizzasize=rules.foodsize; out.number=rules.number; pizzas with out.topping=rules.tops; out=1; a one two out=2; three out=3; I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] } }
Foundational Grammar (CFG, PSG) Automata theory (FSMs, FSTs, etc) Logic Phonetics Linguistics Computer science
Speech Synthesis Markup Language (SSML) The key concepts of SSML are –interoperability, or interacting with other markup languages (VoiceXML, etc.); –consistency, or providing predictable control of voice output across platforms and across speech synthesis implementations; and –internationalization, or enabling speech output in a large number of languages within or across documents.
Speech Synthesis Markup Language (SSML) – An Example For English, press one. Para español, oprima el dos.
Text Structure: p and s Elements A p element represents a paragraph. An s element represents a sentence. This is the first sentence of the paragraph. Here's another sentence.
The phoneme Element The phoneme element provides a phonemic/phonetic pronunciation for the contained text. tomato
The sub Element The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. W3C
The voice Element The voice element is a production element that requests a change in speaking voice. A selection of attributes is: –gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral". –age: optional attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. –name: optional attribute indicating a processor-specific voice name to speak the contained text. Mary had a little lamb, Its fleece was white as snow. I want to be like Mike.
The emphasis Element The emphasis element requests that the contained text be spoken with emphasis. That is a big car! That is a huge bank account!
The break Element The break element is an empty element that controls the pausing or other prosodic boundaries between words. Take a deep breath then continue. Press 1 or wait for the tone. I didn't hear you! Please repeat.
The prosody Element The prosody element permits control of the pitch, speaking rate and volume of the speech output.prosody The attributes, all optional, are: –pitch: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels.numberrelative change –contour: sets the actual pitch contour for the contained text. The format is specified in Pitch contour below.Pitch contour –range: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x- low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges.numberrelative change –rate: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.relative changenumberrelative change –duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [CSS2], e.g. "250ms", "3s".CSS2 –volume: the volume for the contained text in the range 0.0 to (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.numberrelative change
The prosody Element (cont’d) Pitch contour. The pitch contour is defined as a set of white space- separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor- specific. In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value). good morning
Today Project reminder Presentation of the results of the TTS evaluation Speech Synthesis Poetry Slam Wrapping up TTS (stages of TTS) Presentation of home assignment 3: ASR evaluation Automatic speech recognition (ASR) Natural language understanding (NLU) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation for Speech Recognition (SISR) Thursday's Lab session
Architecture 1
Wrapping up TTS Stages of TTS: –Structure analysis (sentence splitting) –Text normalisation –Text to phoneme conversion –Prosody analysis –Waveform production Speech Synthesis Markup Language –enables developers to override default behavior
TTS stages and SSML elements StageSSML elements Structure analysis (sentence splitting),,., ?, ! Text normalisation, Text to phoneme conversion Prosody analysis,,,., ?, ! Waveform production,
Prosody analysis Pitch (intonation or melody), timing (rhythm), pauses, speech rate, emphasis on words, and the relative timing of segments and pauses. most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech. For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed. Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory. Prosody rules and algorithms are not perfect and are a topic of ongoing research. Prosody rules for different spoken national languages may be quite different. For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different.
Speech Recognition (ASR)
Architecture 1
ASR Input and Output A speech recognizer is a component with the following inputs and outputs: Input –A grammar or multiple grammars as defined by the SRGS specification. These grammars inform the recognizer of the words and patterns of words to listen for. –An audio stream that may contain speech content that matches the grammar(s). –Parameters: timeouts, recognition thresholds, or N-best result counts. Output –Descriptions of results that indicate details about the speech content detected by the speech recognizer. Recognizers will include at least a transcription of any detected words. –Errors and other performance information such as confidence
hello s -> "hello"
SRGS hello goodbye s -> "hello" s -> "goodbye" s -> "hello" | "goodbye"
SRGS hello how are you s -> "hello" ("how are you")
SRGS hello s -> "hello" s -> "hello" s s -> "hello"+ NOTE: Listing is no longer possible
SRGS hello goodbye
SRGS hello goodbye s -> greeting+ greeting -> "hello" | "goodbye"
SRGS Boston Philadelphia Fargo Florida North Dakota New York
SRGS + SISR hello hello hi
SRGS + SISR yes yeah yes you bet yes oui yes no nope no way no
SISR I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }
I would like a out.drink={}; out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize; and out.pizza=rules.pizza; coke pepsi coca cola out="coke"; out="medium"; small out="small"; medium large out="large"; regular out="medium"; out=[]; out.push(rules.top); and out.push(rules.top); anchovies pepperoni mushroom out="mushrooms"; mushrooms out.drinksize=rules.foodsize; out.type=rules.kindofdrink; out.pizzasize=rules.foodsize; out.number=rules.number; pizzas with out.topping=rules.tops; out=1; a one two out=2; three out=3; I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] } }