Improving Dialogs with EMMA

Improving Dialogs with EMMA
Deborah A. Dahl Principal, Conversational Technologies Chair, W3C Multimodal Interaction Working Group SpeechTEK 2009 August 24-27, 2009 New York

What information does speech recognition contribute to a dialog system?

What the person said and what it means
The basics: What the person said and what it means But there’s a lot more information available! VoiceXML represents some of this – Other alternatives Confidence

Information about the Context
When they said it How long it took to say Other possibilities about what they might have said Recognizer’s confidence This additional information can be extremely useful We’ll talk about a few ideas today

How can we get more information?
Low level recognition API’s like SAPI or JSAPI can provide a lot of other information But not all recognizers support these API’s They can be very complex to use They only support speech, not multimodal inputs EMMA can provide much more information

A New W3C Standard: EMMA EMMA (Extensible MultiModal Annotation) provides a Standard XML-based Multimodal way to represent detailed information about user inputs and their contexts from speech, handwriting, typing, biometrics, haptics and many other modalities

EMMA adds Timestamps Processor Source Signal Endpoints
Application-specific information Grammar Groupings of related inputs Stages of interpretation – speech, natural language Alternatives – nbest or lattices

An EMMA Document from a Speech Recognizer
<emma:emma version="1.0" xmlns=" xmlns:emma=" xmlns:xsi=" xsi:schemaLocation=" <emma:info> <application>music</application> </emma:info> <emma:one-of emma:dialog-turn="1" emma:duration="1860" emma:end=" " emma:function="dialog" emma:grammar-ref="gram-4" emma:lang="en-us" emma:medium="acoustic" emma:mode="speech" emma:start=" " emma:verbal="true" id="oneof2"> <emma:interpretation emma:confidence=" " emma:tokens="Beethoven third symphony" id="interp10"> <composer>ludwig_van_beethoven</composer> <name>opus 55</name> </emma:interpretation> <emma:interpretation emma:confidence=" " emma:tokens="Beethoven's ninth symphony" id="interp8"> <name>opus 125</name> </emma:one-of> </emma:emma>

Not very human-friendly…

But, since it’s an XML language,
many standard tools can process it and provide useful visualizations

How can we use the additional information available in EMMA to improve dialogs?

1. Improving dialogs in real time
Example: Timestamps, along with the words spoken, can be used to compute speech rate dialog can then be speeded up or slowed down to accommodate the user

Log Analysis Timestamps can tell you how long it took for the person to start talking after the end of the prompt a longer time might indicate that the person was confused by the prompt Looking at confidence across semantics could indicate problems with specific words

Test Example 300 EMMA documents from a demo music-playing application
Play something by Beethoven I’d like to hear Mozart Brandenburg Concertos EMMA documents imported into Excel 2007

File of Music Queries in Excel
Duration Start End Confidence Tokens Composer Action Name Artist 2123 something by Beethoven ludwig_van_beethoven anything I want to hear something by Beethoven play I want to listen to Beethoven

Problem: Speech Rate Some users think that the application speaks too slowly Other users think that it speaks too fast If we dynamically adjust the system’s speech rate to the user’s speech rate, we can accommodate both kinds of users

Speech Rate Data from Music Example
Bimodal distribution may point to two kinds of users Match UI to different kinds of users Slow down for novices Speed up For experts Words per minute

Calculating Users’ Speech Rate in real time from EMMA
EMMA provides the words that the user spoke and the duration of the user’s speech # tokens/duration in minutes = words per minute We can measure the user’s speech rate and match the system’s speech rate in real time

Log Analysis Simple yet powerful log analysis can be done by using EMMA with common tools like spreadsheets

Problem: Application Performance in a Specific State has Deteriorated
More misrecognitions More noinputs

EMMA timestamps tell when the user has started speaking
If we know when the prompt ended, we can measure the lag between the end of the prompt and the beginning of speech Longer times indicate uncertainty

Original Distribution of Start of Speech vs. Prompt
ASR timeout Barge-in Start of speech relative to prompt Length of full prompt

Distribution with New Prompt
Uh oh! People are waiting too long to speak! ASR timeout ASR is timing out too often Barge-in Start of speech relative to prompt Length of full prompt

Conclusion – the new prompt may be complex or confusing

Another Example: Confidence for Different Semantics
What responses have consistently lower confidences? Can the grammar or dictionary be tuned, or prompts clarified to get better inputs?

Problem: Accuracy needs to be improved

EMMA makes it easy to compare confidences for different semantics
Maybe the problem lies just with certain requests? If so, are those requests frequent?

Confidence for Different Semantics
Should there be a dictionary entry for “Beethoven”?

Wolfgang Amadeus Mozart
Frequency of Requests Johann Christian Bach 12% Wolfgang Amadeus Mozart 14% Johann Sebastian Bach 8% Ludwig van Beethoven 66%

So, we do want to look more closely at Beethoven!

Conclusion The additional information in EMMA can be useful for both real time and later analysis

There’s still more information in EMMA!
Alternatives the nbest list is traditional I want to hear beethoven play beethoven play bach I’d like mozart But nbest is very verbose and coarse-grained

Two Ways to Represent Alternatives in EMMA
Traditional Nbest Lattice: A compact representation of many alternatives

A lattice can provide detailed information about each word or concept, including
Start and end time Confidence Semantic interpretation

Finally, Extensions Information not in standard EMMA can be added using <info> <emma:info> <state>ask_for_music</state> </emma:info> <prompt start= “ ” end=“ ” tokens=“what music would you like?”/>

Why do we need a standard
Why do we need a standard? These kinds of analysis can be done in proprietary ways but are much easier with a standard With a standard, you aren’t tied to a certain recognizer’s analysis tools Third party analysis tools are feasible

Organizations with EMMA Implementations
ATT Avaya Conversational Technologies Deustche Telekom DFKI Kyoto Institute of Technology Loquendo Microsoft Nuance University of Trento

Available Implementations
AT&T Speech Mashup -- Cloud-based ASR Conversational Technologies NLWorkbench– tools for illustrating principles of natural language processing At SpeechTEK, Thursday’s Natural Language Processing Tutorial

More Information EMMA specification

Summary EMMA provides a rich, standard, and easy to use representation of users’ inputs This information can be exploited to improve dialogs Improvements can be made in both real-time and after the fact We’ve seen a few examples, but there are many more possibilities

Improving Dialogs with EMMA

Similar presentations

Presentation on theme: "Improving Dialogs with EMMA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Dialogs with EMMA

Similar presentations

Presentation on theme: "Improving Dialogs with EMMA"— Presentation transcript:

Similar presentations

About project

Feedback