Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.

Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog

Presentation Overview 1.Potential “Non-Recognizer” Errors 2.System Evaluation Metrics 3.Tuning/Testing Tools Speech applications face perception problems that revolve around numerous “non-recognizer” errors.

Types of Potential Errors  Grammar Errors  Prompt Errors  User Errors  Pronunciation Errors

Types of Potential Errors: Grammar Errors Program Error – Application fails to perform as designed (perhaps a bug or incorrect linkage) Prediction Error – Grammar set does not accurately reflect all the words a caller might use

Types of Potential Errors: Prompt Errors Vague Prompts – Prompt doesn’t provide enough information to help the caller complete their goal Redundant Prompts – Prompt continually repeats information that can easily be inferred Lengthy Prompts – Prompt is too long so that a customer loses track of their options Misleading Prompts – Prompt presents choices where the language may lead a call to an incorrect/inappropriate response

Types of Potential Errors: Prompt/Grammar Coordination Problems - Prompts and grammar sets do not match Pronunciation Variation – Unanticipated pronunciations cause mismatches

Types of Potential Errors: Issues on the Caller’s Side Loud Background Noises Speech directed at a person in the background instead of the system Bad phone or connection Unintentional speech (like exclamations) or speech-like noise (coughs, breath)

System Evaluation Metrics 1.Non-Programmatic Evaluations Customer Satisfaction Word Error Rate (WER) Dialog System Metrics 2.Programmatic Evaluations PARADISE M. A. Walker, D. Litman, C. A. Kamm, and A. Abella. PARADISE: A general framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association of ComputationalLinguistics, ACL/EACL 97, pages 271-280, 1997.

Non-Programmatic Evaluation Customer Satisfaction Relies on customer surveys Poses problem with accuracy and missing/inaccurate information Low Response Rate Typically, the most neglected evaluation, yet the most important

Non-Programmatic Evaluation Word Error Rate (WER)/Word Accuracy (WA) WER measures the percentage of words incorrectly recognized WA measures the percentage of words correctly recognized, regardless of insertions, deletions, etc. WER and WA do not measure CPU load, response times, task completion rates, etc., all critical measures for a dialog system

Non-Programmatic Evaluation Dialog System Metrics These are not related to customer satisfaction or WER which generally require someone to count frequency Some typical questions: How often do callers hang-up in the middle of the call flow? How often/soon do callers “pound out”? How often does a caller use a word not in the grammar? How many times does the system ask for confirmation from the caller or the caller corrects the system?

Programmatic Evaluation Research Approach: PARADISE PARADISE is a complex, research approach to perform the same analysis as the non- programmatic approach. It uses a combination of customer satisfaction surveys along with detailed statistical models to produce the measure of the dialog system performance.

Programmatic Evaluation Things You Can Do In PARADISE Measure performance correlations between customer satisfaction and different dialog strategies Measure performance correlations between parts of the system and all of the system Measure performance correlations between very simple and very complex dialog systems/tasks

Programmatic vs Non-Programmatic Solutions PARADISE requires a fairly sophisticated level of programming support, as well as extensive knowledge of statistical analysis The Non-Programmatic Metrics are relatively straightforward to calculate and require very little programming

Tuning/Testing Tools Does the speech recognition company you either currently use or plan on using, provide them? Will the tools be detailed enough to allow someone within your company to evaluate potential errors and general caller satisfaction? If these tools are not available, your company will need to rely upon caller feedback and the diligence of your application provider?

Thank You!

Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.

Similar presentations

Presentation on theme: "Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog.

Similar presentations

Presentation on theme: "Tuning Your Application: The Job’s Not Done at Deployment Monday, February 3, 2003 Developing Applications: It’s Not Just the Dialog."— Presentation transcript:

Similar presentations

About project

Feedback