Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Slides:



Advertisements
Similar presentations
Addressing Patient Motivation In Virtual Reality Based Neurocognitive Rehabilitation A.S.Panic - M.Sc. Media & Knowledge Engineering Specialization Man.
Advertisements

Wa Ying College History Panel Head Miss Yeung Sau-fung 20 June 2014 How to enhance English writing and presentation skills in History CDI
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Unsupervised Modeling of Twitter Conversations
Children’s subjective well-being Findings from national surveys in England International Society for Child Indicators Conference, 27 th July 2011.
ENTERFACE’08 Multimodal Communication with Robots and Virtual Agents.
Emotion Regulation of Others and Self Variability in emotions and emotion regulation Andy Lane, Paul Davis and Tracey Devonport.
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
CHAPTER 2 THE RESEARCH PROCESS. 1. Selection of topic  2. Reviewing the literature  3. Development of theoretical and conceptual frameworks  4.
Family Resource Center Association January 2015 Quarterly Meeting.
An Integration Platform of Social Networking Applications to Support Life Long Learning in Rural Territories: the “SoRuraLL Virtual Learning World” Environment.
Maria Angus, University of Hertfordshire. Streetlaw Takes legal information into the community Enhances employability Constitutes work-based learning.
Engaging networks can help you to grow your online community Outreach top 10.
Virtual Workbenches Richard Anthony The University of Greenwich
Senior Project Database: Design and Usability Evaluation Stephanie Cheng Rachelle Hom Ronald Mg Hoang Bao CSC 484 – Winter 2005.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Distractibility and Its Impacts Drs. Kevin S. Krahenbuhl & Gabe Mydland Dakota State University.
Obtaining reliable feedback from students about teaching
LINC 2007 M-Learning from a Cell Phone: Improving Students’ EMP Learning Experience through Interactive SMS Platform By: Jafar Asgari Arani
Graphic Organizers in the Classroom. What is it? A visual and graphical display of the relationships between facts, thoughts and ideas.
Strategic Plan 2012 Quality First Teaching 90% Good + Attendance 96% Science SC1 standards closer to age related in all year groups Progress of Vunerable.
Copyright © 2001 by The Psychological Corporation 1 The Academic Competence Evaluation Scales (ACES) Rating scale technology for identifying students with.
Dialogues in Context: An Objective User-Oriented Evaluation Approach for Virtual Human Dialogue Susan Robinson, Antonio Roque & David Traum.
Hacettepe University Usluel, Y. K., Mazman, S.G. & Arıkan, A. PROSPECTIVE TEACHERS’ AWARENESS OF COLLABORATIVE WEB 2.0 TOOLS WWW/INTERNET 2009.
The Vocabulary of Research. What is Credibility? A researcher’s ability to demonstrate that the study is accurate based on the way the study was conducted.
Implication of Gender and Perception of Self- Competence on Educational Aspiration among Graduates in Taiwan Wan-Chen Hsu and Chia- Hsun Chiang Presenter.
Nursing Care Makes A Difference The Application of Omaha Documentation System on Clients with Mental Illness.
MEDIN Data Guidelines. Data Guidelines Documents with tables and Excel versions of tables which are organised on a thematic basis which consider the actual.
INFuture Bosilj, Bubaš, Vrček 1 Neven Bosilj *, Goran Bubaš **, Neven Vrček *** User Experience with Advertising over Mobile Phone: A Pilot Study.
Project-Based Learning Training What is the problem with current education methods?
ABSTRACT METHODS RESULTS CONCLUSION Background: Georgia rates the quality of early childcare learning centers using a tiered quality improvement system.
MedlinePlus Trusted Health Information for You A service of the U.S. National Library of Medicine National Institutes of Health What’s new with MedlinePlus,
Working group meeting January Time sheets Accounting Topic sheets Handouts Quality plan Anything else? Topics for consideration.
Gill Main International Society for Child Indicators conference 2011.
OECD/INFE Tools for evaluating financial education programmes Adele Atkinson, PhD Policy Analyst OECD With the support of the Russian/World Bank/OECD Trust.
Viking Survey Results Report Team Assignment 11 Team 2-1.
Interactive Training Skills: Evaluation Study Executive Summary Presentation Service Merchandise Company Date: Wednesday, April 6, 2011 CONFIDENTIAL Consultants:
James G Ladwig Newcastle Institute for Research in Education The impact of teacher practice on student outcomes in Te Kotahitanga.
Grobman, K. H. "Confirmation Bias." Teaching about. Developmentalpsychology.org, Web. 16 Sept Sequence Fits the instructor's Rule? Guess.
General EAP writing instruction and transfer of learning Mark Andrew James Arizona State University
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
The effect of peer feedback for blogging on college Advisor: Min-Puu Chen Presenter: Pei- Chi Lu Xie, Y., Ke, F., & Sharma, P. (2008). The effect of feedback.
Social Media: The New Note Home Does Age Effect Responsiveness and acceptance to Social Media? By: David Yarbrough EDTC 5130.
1 Dialogue, Speech and Images: The Companions Project Data Set Yorick Wilks, David Benyon, Christopher Brewster, Pavel Ircing, and Oli Mival
Early Adolescent Behaviors in Disagreement with Best Friend Predictive of Later Emotional Repair Abilities Lauren Cannavo, Elenda T. Hessel, Joseph S.
Constructive Conversations START SMART LESSONS. CREATE DAY 2.
“You don’t have fun if you are not with your friends.” What students say about participating in physical education. Stuart Forsyth, Senior Lecturer Physical.
Finishing up: Statistics & Developmental designs Psych 231: Research Methods in Psychology.
Identifying Assessments
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
1 Information Systems Use Among Ohio Registered Nurses: Testing Validity and Reliability of Nursing Informatics Measurements Amany A. Abdrbo, RN, MSN,
An AAC Professional Learning Module Book Study based on the AAC publication Scaffolding for Student Success Scaffolding for Student Success Module 3: A.
1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.
Adventures in exploring the Parts of Speech and the Dance Elements.
From Managing Emotions to Improving Relationships: Higher Quality Best Friendships Predicted from Earlier Emotion Regulation. Elenda T. Hessel, Megan M.
FELICIAN UNIVERSITY Creating a Learning Community Using Knowledge Management and Social Media Dr. John Zanetich, Associate Professor Felician University.
Marden Primary School’s Curriculum Parent Forum
Leaping Ahead: 4-H Public Speaking Life Skills Evaluation Report by Ben Silliman, Youth Development Specialist.
An Analysis of the Grade 3 Department of Basic Education workbooks as curriculum tools Ursula Hoadley & Jaamia Galant University of Cape Town Presentation.
4 February 2012 MOHAMMED HABASH Leeds Metropolitan University GLOBAL TIMES 2012.
Romantic Partners Promotion of Autonomy and Relatedness in Adolescence as a Predictor of Young Adult Emotion Regulation. Elenda T. Hessel, Emily L. Loeb,
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Title: Validating a theoretical framework for describing computer programming processes 29 November 2017.
Town centres are fighting back
CAF Quarterly Meeting Measuring the Value of an EA Practice
Innovations in Tracking, Managing, & Reporting SNAP-Ed Impact Data
Engagement of Adult Learners
Presentation transcript:

Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb

Overview Companions are targeted as persistent, collaborative, conversational partners Rather than singular tasks, Companions have a range of tasks Completion of tasks is important So is conversational performance

Metrics Objective measures –WER, CER, Turn Duration, Vocabulary… Subjective user measures –User satisfaction surveys Appropriateness

D. Traum, S. Robinson and J. Stephan. Evaluation of multi-party virtual reality dialogue interaction, in LREC, Alongside traditional measures, introduces concept of “response appropriateness” Created for ICT/ISI mission rehearsal exercise system

Initial Companion Evaluation 2 Companion prototypes –Health & Fitness –Senior Companion 8 users completed entire protocol All participants were native English speakers without strong accents Ages from 27 to 61 2 were female, 6 were male

Initial Companion Evaluation New version (2.0) of Senior Companion –12 new participants –9 male, 3 female (ages 21-38) Key changes –Facebook photographs (pre-tagged) –Loquendo TTS elements (cough, laugh) –Additional “chat” ability from a chatbot Improved metric results –Avg. words / utterance –4.27 (v1) to 6.1 (v2)

I found the Companion engaging v1.0 v2.0 v1.0 SC vs v2.0

The Companion demonstrated emotion at times v1.0 v2.0 v1.0 SC vs v2.0

Appropriateness Traum et al. devised an “appropriateness” coding scheme. Split system and user utterances. Users: –Response To System [RTS] –Gets RESponse [RES] –No Response: Appropriate [NRA] –No Response: Not appropriate [NRN]

For agents: – Filled Pause [FP] – Request for Repair [RR] – Appropriate Response [AR] – Appropriate Question [AQ] – Appropriate new INItiative [INI] – Appropriate CONtinuation [CON] –iNAPpropriate response, initiative or continuation [NAP] 3 rd Phase - Appropriateness

Scoring Intuitions Filled pauses generally human-like and good for virtual agents to perform but don’t add a lot (0) Appropriate responses and questions very good (+2), but initiatives that push the interaction back on track are better (+3) Extended contributions on topic somewhat good (+.5) Repairs and clarifications bad (-.5), but their use can still gain points by allowing subsequent appropriate response Inappropriate response bad (-1), no response worse (-2)

Appropriateness Evaluation 7 HFC, 13 SC dialogues 4 Pre-Chatbot SC, 9 Post-Chatbot All annotated by a single user To check consistency, one dialogue of each annotated by at least 3 annotators Inter-coder reliability scored at over 0.8 (Krippendorff’s Alpha) for all

Sample HFC Dialogue Alright. How about ordering lunch from a takeaway? [Appropriate Question] [Response To System] Sorry, I didn't understand. Can you re- phrase: How about ordering lunch from a takeaway? [Request Repair]

Sample SC Dialogue What is your relationship to Libby. [Appropriate Question] I'm sorry I didn't understand your relationship to Libby. [Inappropriate Response] … could be as my friend [Response To System]

Average Score

Per Utterance Score

Tag Distribution

Initial Conclusions Seems to correlate with improvement in user responses (needs further investigation) Reliably encoded by annotators Indicates problem areas in dialogue

Tools and Resources XML encoded dialogue corpus Corpus collection tool Appropriateness annotation guidelines Appropriateness annotation tool

Next Steps Refine appropriateness measures –Add NEW tags confirmation, politeness, emotion, –Modify existing tags specific inappropriate tags Don’t have upper bounds of performance – require WoZ models Need to monitor users behaviour over time Use scoring system to inform reinforcement learning