Download presentation
Presentation is loading. Please wait.
1
Improving Dialogs with EMMA
Deborah A. Dahl Principal, Conversational Technologies Chair, W3C Multimodal Interaction Working Group SpeechTEK 2009 August 24-27, 2009 New York
2
What information does speech recognition contribute to a dialog system?
3
What the person said and what it means
The basics: What the person said and what it means But there’s a lot more information available! VoiceXML represents some of this – Other alternatives Confidence
4
Information about the Context
When they said it How long it took to say Other possibilities about what they might have said Recognizer’s confidence This additional information can be extremely useful We’ll talk about a few ideas today
5
How can we get more information?
Low level recognition API’s like SAPI or JSAPI can provide a lot of other information But not all recognizers support these API’s They can be very complex to use They only support speech, not multimodal inputs EMMA can provide much more information
6
A New W3C Standard: EMMA EMMA (Extensible MultiModal Annotation) provides a Standard XML-based Multimodal way to represent detailed information about user inputs and their contexts from speech, handwriting, typing, biometrics, haptics and many other modalities
7
EMMA adds Timestamps Processor Source Signal Endpoints
Application-specific information Grammar Groupings of related inputs Stages of interpretation – speech, natural language Alternatives – nbest or lattices
8
An EMMA Document from a Speech Recognizer
<emma:emma version="1.0" xmlns=" xmlns:emma=" xmlns:xsi=" xsi:schemaLocation=" <emma:info> <application>music</application> </emma:info> <emma:one-of emma:dialog-turn="1" emma:duration="1860" emma:end=" " emma:function="dialog" emma:grammar-ref="gram-4" emma:lang="en-us" emma:medium="acoustic" emma:mode="speech" emma:start=" " emma:verbal="true" id="oneof2"> <emma:interpretation emma:confidence=" " emma:tokens="Beethoven third symphony" id="interp10"> <composer>ludwig_van_beethoven</composer> <name>opus 55</name> </emma:interpretation> <emma:interpretation emma:confidence=" " emma:tokens="Beethoven's ninth symphony" id="interp8"> <name>opus 125</name> </emma:one-of> </emma:emma>
9
Not very human-friendly…
10
But, since it’s an XML language,
many standard tools can process it and provide useful visualizations
11
How can we use the additional information available in EMMA to improve dialogs?
12
1. Improving dialogs in real time
Example: Timestamps, along with the words spoken, can be used to compute speech rate dialog can then be speeded up or slowed down to accommodate the user
13
Log Analysis Timestamps can tell you how long it took for the person to start talking after the end of the prompt a longer time might indicate that the person was confused by the prompt Looking at confidence across semantics could indicate problems with specific words
14
Test Example 300 EMMA documents from a demo music-playing application
Play something by Beethoven I’d like to hear Mozart Brandenburg Concertos EMMA documents imported into Excel 2007
15
File of Music Queries in Excel
Duration Start End Confidence Tokens Composer Action Name Artist 2123 something by Beethoven ludwig_van_beethoven anything I want to hear something by Beethoven play I want to listen to Beethoven
16
Problem: Speech Rate Some users think that the application speaks too slowly Other users think that it speaks too fast If we dynamically adjust the system’s speech rate to the user’s speech rate, we can accommodate both kinds of users
17
Speech Rate Data from Music Example
Bimodal distribution may point to two kinds of users Match UI to different kinds of users Slow down for novices Speed up For experts Words per minute
18
Calculating Users’ Speech Rate in real time from EMMA
EMMA provides the words that the user spoke and the duration of the user’s speech # tokens/duration in minutes = words per minute We can measure the user’s speech rate and match the system’s speech rate in real time
19
Log Analysis Simple yet powerful log analysis can be done by using EMMA with common tools like spreadsheets
20
Problem: Application Performance in a Specific State has Deteriorated
More misrecognitions More noinputs
21
EMMA timestamps tell when the user has started speaking
If we know when the prompt ended, we can measure the lag between the end of the prompt and the beginning of speech Longer times indicate uncertainty
22
Original Distribution of Start of Speech vs. Prompt
ASR timeout Barge-in Start of speech relative to prompt Length of full prompt
23
Distribution with New Prompt
Uh oh! People are waiting too long to speak! ASR timeout ASR is timing out too often Barge-in Start of speech relative to prompt Length of full prompt
24
Conclusion – the new prompt may be complex or confusing
25
Another Example: Confidence for Different Semantics
What responses have consistently lower confidences? Can the grammar or dictionary be tuned, or prompts clarified to get better inputs?
26
Problem: Accuracy needs to be improved
27
EMMA makes it easy to compare confidences for different semantics
Maybe the problem lies just with certain requests? If so, are those requests frequent?
28
Confidence for Different Semantics
Should there be a dictionary entry for “Beethoven”?
29
Wolfgang Amadeus Mozart
Frequency of Requests Johann Christian Bach 12% Wolfgang Amadeus Mozart 14% Johann Sebastian Bach 8% Ludwig van Beethoven 66%
30
So, we do want to look more closely at Beethoven!
31
Conclusion The additional information in EMMA can be useful for both real time and later analysis
32
There’s still more information in EMMA!
Alternatives the nbest list is traditional I want to hear beethoven play beethoven play bach I’d like mozart But nbest is very verbose and coarse-grained
33
Two Ways to Represent Alternatives in EMMA
Traditional Nbest Lattice: A compact representation of many alternatives
34
A lattice can provide detailed information about each word or concept, including
Start and end time Confidence Semantic interpretation
35
Finally, Extensions Information not in standard EMMA can be added using <info> <emma:info> <state>ask_for_music</state> </emma:info> <prompt start= “ ” end=“ ” tokens=“what music would you like?”/>
36
Why do we need a standard
Why do we need a standard? These kinds of analysis can be done in proprietary ways but are much easier with a standard With a standard, you aren’t tied to a certain recognizer’s analysis tools Third party analysis tools are feasible
37
Organizations with EMMA Implementations
ATT Avaya Conversational Technologies Deustche Telekom DFKI Kyoto Institute of Technology Loquendo Microsoft Nuance University of Trento
38
Available Implementations
AT&T Speech Mashup -- Cloud-based ASR Conversational Technologies NLWorkbench– tools for illustrating principles of natural language processing At SpeechTEK, Thursday’s Natural Language Processing Tutorial
39
More Information EMMA specification
40
Summary EMMA provides a rich, standard, and easy to use representation of users’ inputs This information can be exploited to improve dialogs Improvements can be made in both real-time and after the fact We’ve seen a few examples, but there are many more possibilities
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.