Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Similar presentations


Presentation on theme: "Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb."— Presentation transcript:

1 Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb

2 Overview www.companions-project.org Companions are targeted as persistent, collaborative, conversational partners Rather than singular tasks, Companions have a range of tasks Completion of tasks is important So is conversational performance

3 Metrics Objective measures –WER, CER, Turn Duration, Vocabulary… Subjective user measures –User satisfaction surveys Appropriateness

4 D. Traum, S. Robinson and J. Stephan. Evaluation of multi-party virtual reality dialogue interaction, in LREC, 2004. Alongside traditional measures, introduces concept of “response appropriateness” Created for ICT/ISI mission rehearsal exercise system

5 Initial Companion Evaluation 2 Companion prototypes –Health & Fitness –Senior Companion 8 users completed entire protocol All participants were native English speakers without strong accents Ages from 27 to 61 2 were female, 6 were male

6 Initial Companion Evaluation New version (2.0) of Senior Companion –12 new participants –9 male, 3 female (ages 21-38) Key changes –Facebook photographs (pre-tagged) –Loquendo TTS elements (cough, laugh) –Additional “chat” ability from a chatbot Improved metric results –Avg. words / utterance –4.27 (v1) to 6.1 (v2)

7 I found the Companion engaging v1.0 v2.0 v1.0 SC vs v2.0

8 The Companion demonstrated emotion at times v1.0 v2.0 v1.0 SC vs v2.0

9 Appropriateness Traum et al. devised an “appropriateness” coding scheme. Split system and user utterances. Users: –Response To System [RTS] –Gets RESponse [RES] –No Response: Appropriate [NRA] –No Response: Not appropriate [NRN]

10 For agents: – Filled Pause [FP] – Request for Repair [RR] – Appropriate Response [AR] – Appropriate Question [AQ] – Appropriate new INItiative [INI] – Appropriate CONtinuation [CON] –iNAPpropriate response, initiative or continuation [NAP] 3 rd Phase - Appropriateness

11 Scoring Intuitions Filled pauses generally human-like and good for virtual agents to perform but don’t add a lot (0) Appropriate responses and questions very good (+2), but initiatives that push the interaction back on track are better (+3) Extended contributions on topic somewhat good (+.5) Repairs and clarifications bad (-.5), but their use can still gain points by allowing subsequent appropriate response Inappropriate response bad (-1), no response worse (-2)

12 Appropriateness Evaluation 7 HFC, 13 SC dialogues 4 Pre-Chatbot SC, 9 Post-Chatbot All annotated by a single user To check consistency, one dialogue of each annotated by at least 3 annotators Inter-coder reliability scored at over 0.8 (Krippendorff’s Alpha) for all

13 Sample HFC Dialogue Alright. How about ordering lunch from a takeaway? [Appropriate Question] [Response To System] Sorry, I didn't understand. Can you re- phrase: How about ordering lunch from a takeaway? [Request Repair]

14 Sample SC Dialogue What is your relationship to Libby. [Appropriate Question] I'm sorry I didn't understand your relationship to Libby. [Inappropriate Response] … could be as my friend [Response To System]

15 Average Score

16 Per Utterance Score

17 Tag Distribution

18 Initial Conclusions Seems to correlate with improvement in user responses (needs further investigation) Reliably encoded by annotators Indicates problem areas in dialogue

19 Tools and Resources XML encoded dialogue corpus Corpus collection tool Appropriateness annotation guidelines Appropriateness annotation tool

20 Next Steps Refine appropriateness measures –Add NEW tags confirmation, politeness, emotion, –Modify existing tags specific inappropriate tags Don’t have upper bounds of performance – require WoZ models Need to monitor users behaviour over time Use scoring system to inform reinforcement learning


Download ppt "Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb."

Similar presentations


Ads by Google