Download presentation
Presentation is loading. Please wait.
Published byThomasina Robinson Modified over 9 years ago
1
Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino Gonzalez, Ronald DeMara University of Central Florida
2
Introduction Approach Evaluation Results Conclusions Agenda
3
Introduction General Problem – Elevate the level of speech-based discourse to a new level of naturalness in Embodied Conversation Agents (ECA) carrying an open-domain dialog Specific Problem – Overcome Automatic Speech Recognition (ASR) limitations – Domain-independent knowledge management Training Agent Design – Conversational input with robustness to ASR and adaptable knowledge base
4
Approach Build a dialog manager that: – Handles ASR limitations – Manages domain-independent knowledge – Provides open dialog CONtext-driven Corpus-based Utterance Robustness (CONCUR) – Input Processor – Knowledge Manager – Discourse Model I/O Dialog Manager User Input Agent Response Input Processor Discourse Model Knowledge Manager
5
CONCUR Input Processor – Pre-process knowledge corpus via keyphrasing – Break down user utterance Input Processor Corpus Data Keyphrase Extractor WordNet NLP Toolkit User Utterance Knowledge Manager – 3 data bases – Encyclopedia-entry style corpus – Context-driven
6
CONCUR CxBR Discourse Model – Goal Bookkeeper Goal Stack (Branting et al, 2004) Inference Engine – Context Topology Agent Goals User Goals
7
Detailed CONCUR Block Diagram
8
Evaluation Plagued by subjectivity Gathering of both objective and subjective metrics Qualitative and quantitative metrics: Efficiency metrics Total elapsed time Number of user turns Number of system turns Total elapsed time per turn Word-Error Rate (WER) Quality metrics Out-of-corpus misunderstandings General misunderstandings Errors Total number of user goals Total number of user goals fulfilled Goal completion accuracy Conversational accuracy Survey data Naturalness Usefulness
9
Evaluation Instrument Nine statements, judged on a 1-to-7 scale based on level of agreement Naturalness If I told someone the character in this tool was real they would believe me. The character on the screen seemed smart. I felt like I was having a conversation with a real person. This did not feel like a real interaction with another person. Usefulness I would be more productive if I had this system in my place of work. The tool provided me with the information I was looking for. I found this to be a useful way to get information. This tool made it harder to get information than talking to a person or using a website. This does not seem like a reliable way to retrieve information from a database.
10
Data Acquisition General data set acquisition procedure: User asked to interact with agent Natural, information-seeking Voice recording User asked to complete survey Data analysis process: Voice transcriptions, ASR transcripts, internal data, and surveys analyzed Data SetDialog ManagerAgent StyleDomain Surveys/ Transcripts Collected 1 AlexDSS LifeLike AvatarNSF I/UCRC30/30 2CONCURLifeLike AvatarNSF I/UCRC30/20 3CONCUR Chatbot NSF I/UCRC0/20 4CONCURChatbot Current Events 20/20
11
Data Acquisition LifeLike Avatar Speech Recognizer CONCUR Dialog Manager Agent Externals ASR String Response String MicUser Voice Speaker Monitor Agent Voice Agent Image Monitor CONCUR Chatbot CONCUR Dialog Manager Jabber-based Agent Agent Text Output Keyboard User Text Input ECA Chatbot
12
Survey Baseline Agent Naturalness User RatingUsefulness User Rating Data Set 1: AlexDSS Avatar 4.024.47 Data Set 2: CONCUR Avatar 4.144.51 Amani (Gandhe et al, 2009) 3.093.24 Hassan (Gandhe et al, 2009) 3.554.00 1. Both LifeLike Avatars established user assessments that exceeded other ECA efforts 2. Both avatar-based systems in the speech-based data sets established similar scores in Naturalness and Usefulness Question 1: What are the expectations of naturalness and usefulness for the conversation agents in this study? Question 2: How differently did users rate the AlexDSS Avatar with the CONCUR Avatar?
13
Survey Baseline If I told someone the character in this tool was real they would believe me. (Naturalness) I would be more productive if I had this system in my place of work. (Usefulness) The character on the screen seemed smart. (Naturalness) I felt like I was having a conversation with a real person. (Naturalness) The tool provided me with the information I was looking for. (Usefulness) I found this to be a useful way to get information. (Usefulness) This tool made it harder to get information than talking to a person or using a website. (Usefulness) This does not seem like a reliable way to retrieve information from a database. (Usefulness) This did not feel like a real interaction with another person. (Naturalness) Naturalness Usefulness Data Set 1: AlexDSS Avatar 3.204.104.734.104.575.074.233.173.97 4.024.47 Data Set 2: NSF I/UCRC CONCUR Avatar 4.074.004.973.834.905.434.333.434.30 4.144.51 Data Set 4: Current Events CONCUR Chatbot 2.202.453.002.354.103.704.804.555.95 2.403.38 3. ECA-based systems were judged similarly, both better than chatbot Question 3: How differently did users rate the ECA systems with the chatbot system?
14
ASR Resilience Data Set 1: AlexDSS Avatar Data Set 2: CONCUR Avatar Efficiency Metrics WER 60.85%58.48% Quantitative Analysis Out-of-Corpus Misunderstanding Rate 0.29%6.37% Goal Completion Accuracy 63.29%60.48% Question 1: Can a speech-based CONCUR Avatar’s goal completion accuracy measure up to the AlexDSS Avatar under a high WER? 1. A Speech-based CONCUR Avatar’s goal completion accuracy measures up to AlexDSS avatar with similarly high WER
15
ASR Resilience Data Set 2: CONCUR Avatar Data Set 3: CONCUR Chatbot Efficiency Metrics WER 58.48%0.00% Quantitative Analysis Out-of-Corpus Misunderstanding Rate 6.37%6.77% Goal Completion Accuracy 60.48%68.48% Question 2: How does improving WER affect CONCUR’s goal completion accuracy? 2. Improved WER does not increase CONCUR’s goal completion accuracy because no new user goals were identified or corrected with the better recognition
16
ASR Resilience Agent Average WER Goal Completion Accuracy Data Set 2: CONCUR Avatar 58.48% 60.48% Digital Kyoto (Misu and Kawahara, 2007) 29.40%61.40% Question 3: Can CONCUR’s goal completion accuracy measure up to other conversation agents in lieu of high WER? 3: CONCUR’s goal completion accuracy is similar to that of the Digital Kyoto system, with twice the WER.
17
ASR Resilience Data Set 1: AlexDSS Avatar Data Set 2: CONCUR Avatar Efficiency Metrics WER 60.85%58.48% Quantitative Analysis General Misunderstanding Rate 9.51%14.12% Error Rate 8.71%21.81% Conversational Accuracy 81.78%64.22% Question 4: Can a speech-based CONCUR Avatar’s conversational accuracy measure up to the AlexDSS avatar under a high WER? 4. Speech-based CONCUR’s conversational accuracy does not measure up to an AlexDSS Avatar with similarly high WER. This can be attributed to general misunderstandings and errors caused by misheard user requests or specific question answering requests not common with menu-driven discourse models
18
ASR Resilience Data Set 2: CONCUR Avatar Data Set 3: CONCUR Chatbot Efficiency Metrics WER 58.48%0.00% Quantitative Analysis General Misunderstanding Rate 14.12%7.48% Error Rate 21.81%16.68% Goal Completion Accuracy 60.48%68.48% Conversational Accuracy 64.22%75.31% Question 5: How does improving WER affect CONCUR’s conversational accuracy? 5. Improved WER increases CONCUR’s conversational accuracy by decreasing general misunderstandings
19
ASR Resilience Agent Average WER Conversational Accuracy Data Set 2: CONCUR Avatar 58.48% 64.22% TARA (Schumaker et al, 2007) 0.00%54.00% Question 6: Can CONCUR’s conversational accuracy measure up to other conversation agents in lieu of high WER? 6: CONCUR’s conversational accuracy surpasses that of the TARA system, which is text-based.
20
Domain-Independence Data Set 2: NSF I/UCRC Avatar Data Set 3: NSF I/UCRC Chatbot Data Set 4: Current Events Chatbot Quantitative Analysis Out-Of-Corpus Misunderstanding Rate 6.15%6.77%17.45% Goal Completion Accuracy 60.48%68.48%48.08% Question 1: Can CONCUR maintain goal completion accuracy after changing to a less specific domain corpus? 1. CONCUR’s goal completion accuracy does not remain consistent after a change to a generalized domain corpus. Changing domain expertise may increase out-of-corpus requests, which decreases goal completion
21
Domain-Independence Data Set 2: NSF I/UCRC Avatar Data Set 3: NSF I/UCRC Chatbot Data Set 4: Current Events Chatbot Quantitative Analysis General Misunderstanding Rate 14.49%7.48%0.00% Error Rate 21.81%16.68%16.46% Conversational Accuracy 64.22%75.34%83.54% Question 2: Can CONCUR maintain conversational accuracy after changing to a less specific domain corpus? 2. After changing to a general domain corpus, CONCUR is capable of maintaining its conversational accuracy
22
Domain-Independence Dialog SystemMethodTurnover Time CONCURCorpus-based3 Days Marve (Babu et al, 2006) Wizard-of-Oz18 Days Amani (Gandhe et al, 2009) Question-Answer PairsWeeks AlexDSSExpert SystemWeeks Sergeant Blackwell (Robinson et al, 2008) Wizard-of-Oz7 Months Sergeant Star (Artstein et al, 2009) Question-Answer Pairs1 Year HMIHY (Béchet et al, 2004) Hand-modeled2 Years Hassan (Gandhe et al, 2009) Question-Answer PairsYears 3. CONCUR’s Knowledge Manager enables a shortened knowledge development turnover time as compared to other conversation agent knowledge management systems Question 3: Can CONCUR provide a quick method of providing agent knowledge?
23
Conclusions Building Training Agents – Agent Design ECA preference over Chatbot format – ASR ASR improvements leads to better conversation-level processing High ASR not necessarily an obstacle for ECA design – Knowledge Management Tailoring domain expertise for an intended audience is more effective than a generalized corpus Separation of domain knowledge from agent discourse helps to maintain conversational accuracy and speed up agent development times
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.