Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino Gonzalez, Ronald DeMara University of Central Florida.

Slides:

Advertisements

Similar presentations

Process Database and Process Capability Baseline

Advertisements

Chapter 11 user support. Issues –different types of support at different times –implementation and presentation both important –all need careful design.

Problem solving methodology Information Technology Units Adapted from VCAA Study Design - Information Technology Byron Mitchell, November.

Probabilistic Adaptive Real-Time Learning And Natural Conversational Engine Seventh Framework Programme FP7-ICT

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

Pertemuan 16 Matakuliah: A0214/Audit Sistem Informasi Tahun: 2007.

User Interface Testing. Hall of Fame or Hall of Shame?  java.sun.com.

© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.

ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.

Heuristic Evaluation Evaluating with experts. Discount Evaluation Techniques  Basis: Observing users can be time- consuming and expensive Try to predict.

Evaluating with experts

1 CSC-3324: Chapter 4 Title: What is a requirement? Mandatory reading: Sommerville 6, 7 th ed., Chap.: 7.

MUSCLE Multimodal e-team related activity Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Prof. Alex Potamianos Technical.

ICS 463, Intro to Human Computer Interaction Design: 8. Evaluation and Data Dan Suthers.

Short Course on Introduction to Meteorological Instrumentation and Observations Techniques QA and QC Procedures Short Course on Introduction to Meteorological.

Should Intelligent Agents Listen and Speak to Us? James A. Larson Larson Technical Services

How get your project management or professional services organization ISO 9001 certified.

Systems Analysis and Design in a Changing World, 6th Edition

Systems Analysis and Design in a Changing World, 6th Edition

Twenty-First Century Automatic Speech Recognition: Meeting Rooms and Beyond ASR 2000 September 20, 2000 John Garofolo

Systems Analysis and Design in a Changing World, 6th Edition

Software System Engineering: A tutorial

Chapter 6 : Software Metrics

Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.

1. INTERNET MARKET RESEARCH 2. OPERATIONAL DATA TOOLS Info. for Competitive Marketing Advantages Maher ARAFAT, June, 2010.

Director of Evaluation and Accountability Manager, UW’s Grand Rapids, Michigan Robert McKown, CIRS Director of Evaluation and Accountability Sherri.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System Kate Forbes-Riley and Diane Litman and Scott Silliman.

Chapter 11 Analysis Concepts and Principles

Scientific Research in Biotechnology 5.03 – Demonstrate the use of the scientific method in the planning and development of an experimental SAE.

Automatic Detection of Plagiarized Spoken Responses Copyright © 2014 by Educational Testing Service. All rights reserved. Keelan Evanini and Xinhao Wang.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Research Seminars in IT in Education (MIT6003) Research Methodology I Dr Jacky Pow.

Spoken Dialog Systems and Voice XML Lecturer: Prof. Esther Levin.

Problem solving methodology Information Technology Units Adapted from VCAA Study Design - Information Technology Byron Mitchell, November.

Towards A Context-Based Dialog Management Layer for Expert Systems Victor Hung, Avelino Gonzalez & Ronald DeMara Intelligent Systems Laboratory University.

Designing Speech Interfaces for Kiosks Max Van Kleek Buddhika Kottahachchi Tyler Horton Paul Cavallaro.

ENTERFACE 08 Project 1 “MultiParty Communication with a Tour Guide ECA” Mid-term presentation August 19th, 2008.

1 Natural Language Processing Lecture Notes 14 Chapter 19.

Towards a Method For Evaluating Naturalness in Conversational Dialog Systems Victor Hung, Miguel Elvir, Avelino Gonzalez & Ronald DeMara Intelligent Systems.

Dialogue systems Volha Petukhova Saarland University 03/07/2015 Einführung in Diskurs and Pragmatik, Sommersemester

Chap#11 What is User Support?

Enabling Access to Sound Archives through Integration, Enrichment and Retrieval Annual Review Meeting - Introduction.

PRESENTED BY TARUN CHUGH ROLL NO: DATE OF PRESENTATION :-29/09/2010 ARTIFICIAL PASSENGER.

Introduction to Interactive Media Interactive Media Tools: Authoring Applications.

Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001.

HITIQA: Scenario Based Question Answering Tomek Strzalkowski, et al The State University of New York at Albany Paul Kantor, et al Rutgers University Boris.

1 Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents S. Kawamoto, et al. October 27, 2004.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Fall 2002CS/PSY Predictive Evaluation (Evaluation Without Users) Gathering data about usability of a design by a specified group of users for a particular.

Colby Smart, E-Learning Specialist Humboldt County Office of Education

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Analysis. This involves investigating what is required from the new system and what facilities are available. It would probably include:

1 Chapter 1 Introduction to Accounting Information Systems Chapter 2 Intelligent Systems and Knowledge Management.

Virtual Track Surveying Paul Furniss. An introduction to Omnicom Engineering A few words from Gary Sanford the UK Sponsor What is the OmniSurveyor3D System?

Chapter 11 user support. Overview Users require different types of support at different times. There are four main types of assistance that users require:

Using Speech Recognition to Predict VoIP Quality

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Spoken Dialogue Systems

Call Center Metrics: Best Practices in Performance Measurement and Management to Maximize Quitline Efficiency and Quality by Penny Reynolds The Call Center.

Spoken Dialogue Systems

Software Metrics “How do we measure the software?”

Towards lifelike Computer Interfaces that learn

Chapter 11 user support.

Dialogue State Tracking & Dialogue Corpus Survey

Kittiya Poonsilp, Rujijan Vichivanives, Attakorn Poonsilp

Presentation transcript:

Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino Gonzalez, Ronald DeMara University of Central Florida

Introduction Approach Evaluation Results Conclusions Agenda

Introduction General Problem – Elevate the level of speech-based discourse to a new level of naturalness in Embodied Conversation Agents (ECA) carrying an open-domain dialog Specific Problem – Overcome Automatic Speech Recognition (ASR) limitations – Domain-independent knowledge management Training Agent Design – Conversational input with robustness to ASR and adaptable knowledge base

Approach Build a dialog manager that: – Handles ASR limitations – Manages domain-independent knowledge – Provides open dialog CONtext-driven Corpus-based Utterance Robustness (CONCUR) – Input Processor – Knowledge Manager – Discourse Model I/O Dialog Manager User Input Agent Response Input Processor Discourse Model Knowledge Manager

CONCUR Input Processor – Pre-process knowledge corpus via keyphrasing – Break down user utterance Input Processor Corpus Data Keyphrase Extractor WordNet NLP Toolkit User Utterance Knowledge Manager – 3 data bases – Encyclopedia-entry style corpus – Context-driven

CONCUR CxBR Discourse Model – Goal Bookkeeper Goal Stack (Branting et al, 2004) Inference Engine – Context Topology Agent Goals User Goals

Detailed CONCUR Block Diagram

Evaluation  Plagued by subjectivity  Gathering of both objective and subjective metrics  Qualitative and quantitative metrics:  Efficiency metrics  Total elapsed time  Number of user turns  Number of system turns  Total elapsed time per turn  Word-Error Rate (WER)  Quality metrics  Out-of-corpus misunderstandings  General misunderstandings  Errors  Total number of user goals  Total number of user goals fulfilled  Goal completion accuracy  Conversational accuracy  Survey data  Naturalness  Usefulness

Evaluation Instrument  Nine statements, judged on a 1-to-7 scale based on level of agreement  Naturalness  If I told someone the character in this tool was real they would believe me.  The character on the screen seemed smart.  I felt like I was having a conversation with a real person.  This did not feel like a real interaction with another person.  Usefulness  I would be more productive if I had this system in my place of work.  The tool provided me with the information I was looking for.  I found this to be a useful way to get information.  This tool made it harder to get information than talking to a person or using a website.  This does not seem like a reliable way to retrieve information from a database.

Data Acquisition  General data set acquisition procedure: User asked to interact with agent Natural, information-seeking Voice recording User asked to complete survey  Data analysis process: Voice transcriptions, ASR transcripts, internal data, and surveys analyzed Data SetDialog ManagerAgent StyleDomain Surveys/ Transcripts Collected 1 AlexDSS LifeLike AvatarNSF I/UCRC30/30 2CONCURLifeLike AvatarNSF I/UCRC30/20 3CONCUR Chatbot NSF I/UCRC0/20 4CONCURChatbot Current Events 20/20

Data Acquisition LifeLike Avatar Speech Recognizer CONCUR Dialog Manager Agent Externals ASR String Response String MicUser Voice Speaker Monitor Agent Voice Agent Image Monitor CONCUR Chatbot CONCUR Dialog Manager Jabber-based Agent Agent Text Output Keyboard User Text Input ECA Chatbot

Survey Baseline Agent Naturalness User RatingUsefulness User Rating Data Set 1: AlexDSS Avatar Data Set 2: CONCUR Avatar Amani (Gandhe et al, 2009) Hassan (Gandhe et al, 2009)  1. Both LifeLike Avatars established user assessments that exceeded other ECA efforts  2. Both avatar-based systems in the speech-based data sets established similar scores in Naturalness and Usefulness  Question 1: What are the expectations of naturalness and usefulness for the conversation agents in this study?  Question 2: How differently did users rate the AlexDSS Avatar with the CONCUR Avatar?

Survey Baseline If I told someone the character in this tool was real they would believe me. (Naturalness) I would be more productive if I had this system in my place of work. (Usefulness) The character on the screen seemed smart. (Naturalness) I felt like I was having a conversation with a real person. (Naturalness) The tool provided me with the information I was looking for. (Usefulness) I found this to be a useful way to get information. (Usefulness) This tool made it harder to get information than talking to a person or using a website. (Usefulness) This does not seem like a reliable way to retrieve information from a database. (Usefulness) This did not feel like a real interaction with another person. (Naturalness) Naturalness Usefulness Data Set 1: AlexDSS Avatar Data Set 2: NSF I/UCRC CONCUR Avatar Data Set 4: Current Events CONCUR Chatbot  3. ECA-based systems were judged similarly, both better than chatbot  Question 3: How differently did users rate the ECA systems with the chatbot system?

ASR Resilience Data Set 1: AlexDSS Avatar Data Set 2: CONCUR Avatar Efficiency Metrics WER 60.85%58.48% Quantitative Analysis Out-of-Corpus Misunderstanding Rate 0.29%6.37% Goal Completion Accuracy 63.29%60.48%  Question 1: Can a speech-based CONCUR Avatar’s goal completion accuracy measure up to the AlexDSS Avatar under a high WER?  1. A Speech-based CONCUR Avatar’s goal completion accuracy measures up to AlexDSS avatar with similarly high WER

ASR Resilience Data Set 2: CONCUR Avatar Data Set 3: CONCUR Chatbot Efficiency Metrics WER 58.48%0.00% Quantitative Analysis Out-of-Corpus Misunderstanding Rate 6.37%6.77% Goal Completion Accuracy 60.48%68.48%  Question 2: How does improving WER affect CONCUR’s goal completion accuracy?  2. Improved WER does not increase CONCUR’s goal completion accuracy because no new user goals were identified or corrected with the better recognition

ASR Resilience Agent Average WER Goal Completion Accuracy Data Set 2: CONCUR Avatar 58.48% 60.48% Digital Kyoto (Misu and Kawahara, 2007) 29.40%61.40%  Question 3: Can CONCUR’s goal completion accuracy measure up to other conversation agents in lieu of high WER?  3: CONCUR’s goal completion accuracy is similar to that of the Digital Kyoto system, with twice the WER.

ASR Resilience Data Set 1: AlexDSS Avatar Data Set 2: CONCUR Avatar Efficiency Metrics WER 60.85%58.48% Quantitative Analysis General Misunderstanding Rate 9.51%14.12% Error Rate 8.71%21.81% Conversational Accuracy 81.78%64.22%  Question 4: Can a speech-based CONCUR Avatar’s conversational accuracy measure up to the AlexDSS avatar under a high WER?  4. Speech-based CONCUR’s conversational accuracy does not measure up to an AlexDSS Avatar with similarly high WER. This can be attributed to general misunderstandings and errors caused by misheard user requests or specific question answering requests not common with menu-driven discourse models

ASR Resilience Data Set 2: CONCUR Avatar Data Set 3: CONCUR Chatbot Efficiency Metrics WER 58.48%0.00% Quantitative Analysis General Misunderstanding Rate 14.12%7.48% Error Rate 21.81%16.68% Goal Completion Accuracy 60.48%68.48% Conversational Accuracy 64.22%75.31%  Question 5: How does improving WER affect CONCUR’s conversational accuracy?  5. Improved WER increases CONCUR’s conversational accuracy by decreasing general misunderstandings

ASR Resilience Agent Average WER Conversational Accuracy Data Set 2: CONCUR Avatar 58.48% 64.22% TARA (Schumaker et al, 2007) 0.00%54.00%  Question 6: Can CONCUR’s conversational accuracy measure up to other conversation agents in lieu of high WER?  6: CONCUR’s conversational accuracy surpasses that of the TARA system, which is text-based.

Domain-Independence Data Set 2: NSF I/UCRC Avatar Data Set 3: NSF I/UCRC Chatbot Data Set 4: Current Events Chatbot Quantitative Analysis Out-Of-Corpus Misunderstanding Rate 6.15%6.77%17.45% Goal Completion Accuracy 60.48%68.48%48.08%  Question 1: Can CONCUR maintain goal completion accuracy after changing to a less specific domain corpus?  1. CONCUR’s goal completion accuracy does not remain consistent after a change to a generalized domain corpus. Changing domain expertise may increase out-of-corpus requests, which decreases goal completion

Domain-Independence Data Set 2: NSF I/UCRC Avatar Data Set 3: NSF I/UCRC Chatbot Data Set 4: Current Events Chatbot Quantitative Analysis General Misunderstanding Rate 14.49%7.48%0.00% Error Rate 21.81%16.68%16.46% Conversational Accuracy 64.22%75.34%83.54%  Question 2: Can CONCUR maintain conversational accuracy after changing to a less specific domain corpus?  2. After changing to a general domain corpus, CONCUR is capable of maintaining its conversational accuracy

Domain-Independence Dialog SystemMethodTurnover Time CONCURCorpus-based3 Days Marve (Babu et al, 2006) Wizard-of-Oz18 Days Amani (Gandhe et al, 2009) Question-Answer PairsWeeks AlexDSSExpert SystemWeeks Sergeant Blackwell (Robinson et al, 2008) Wizard-of-Oz7 Months Sergeant Star (Artstein et al, 2009) Question-Answer Pairs1 Year HMIHY (Béchet et al, 2004) Hand-modeled2 Years Hassan (Gandhe et al, 2009) Question-Answer PairsYears  3. CONCUR’s Knowledge Manager enables a shortened knowledge development turnover time as compared to other conversation agent knowledge management systems  Question 3: Can CONCUR provide a quick method of providing agent knowledge?

Conclusions Building Training Agents – Agent Design ECA preference over Chatbot format – ASR ASR improvements leads to better conversation-level processing High ASR not necessarily an obstacle for ECA design – Knowledge Management Tailoring domain expertise for an intended audience is more effective than a generalized corpus Separation of domain knowledge from agent discourse helps to maintain conversational accuracy and speed up agent development times