Dialogues in Context: An Objective User-Oriented Evaluation Approach for Virtual Human Dialogue Susan Robinson, Antonio Roque & David Traum.

Slides:

Advertisements

Similar presentations

Volunteer Management and Supervision Volunteer Management and Supervision The Volunteer Centers of Santa Cruz County.

Advertisements

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.

First-Year Teacher Perceptions of Classroom Experiences and Teacher Induction in a Midwestern School District Cheryl Torok Fleming June, 2004.

Agent-Based Architecture for Intelligence and Collaboration in Virtual Learning Environments Punyanuch Borwarnginn 5 August 2013.

Oral Bagrut Practice Let’s do it!.

User Interface Design Yonsei University 2 nd Semester, 2013 Sanghyun Park.

Oral Bagrut Practice Let’s do it!. Oral Bagrut Practice Let’s do it!

Evaluating Human-Machine Conversation for Appropriateness David Benyon, Preben Hansen, Oli Mival and Nick Webb.

Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

HCI Methods for Pathway Visualization Tools Purvi Saraiya, Chris North, Karen Duca* Virginia Tech Dept. of Computer Science, Center for Human-Computer.

Using the Semantic Web for Web Searches Norman Piedade de Noronha, Mário J. Silva XLDB / LaSIGE, Faculdade de Ciências, Universidade de Lisboa.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

{ Helping English Learners Access Complex Text in the Common Core CATE-FACET: Yosemite Conference Alesha M. Ramirez, Tulare County Office of Education.

Building the Design Studio of the Future Aaron Adler Jacob Eisenstein Michael Oltmans Lisa Guttentag Randall Davis October 23, 2004.

TEL 355: Communication and Information Systems in Organizations Speech-Enabled Interactive Voice Response Systems Professor John F. Clark.

Michigan Common Core Standards

Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.

Education office, Evaz district, autumn 1393 Presenter: Rahmanpour CEF (Common European Framework): The basis of the new course book development in Iran.

Project Life Cycle Lecture - 18.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

MSF Requirements Envisioning Phase Planning Phase.

Standards! What are we writing? What are we practicing?

ELA Common Core Shifts. Shift 1 Balancing Informational & Literary Text.

GCSE Resistant Materials -

Simple Workflow Access Protocol (SWAP) Keith Swenson July 14, 1998.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Towards multimodal meaning representation Harry Bunt & Laurent Romary LREC Workshop on standards for language resources Las Palmas, May 2002.

Volunteer Management and Supervision Volunteer Management and Supervision The Volunteer Centers of Santa Cruz County.

The Disability Measurement Matrix Barbara M. Altman Jennifer Madans Elizabeth Rasch Angela Me Margaret Mbogoni Elena Palma.

Task-oriented approach to information handling support within web-based education Lora M. Aroyo 15 November 2001.

Issues in Multiparty Dialogues Ronak Patel. Current Trend  Only two-party case (a person and a Dialog system  Multi party (more than two persons Ex.

ENTERFACE 08 Project 1 “MultiParty Communication with a Tour Guide ECA” Mid-term presentation August 19th, 2008.

A Common Ground for Virtual Humans: Using an Ontology in a Natural Language Oriented Virtual Human Architecture Arno Hartholt (ICT), Thomas Russ (ISI),

CIS 112 Exam Review. Exam Content 100 questions valued at 1 point each 100 questions valued at 1 point each 100 points total 100 points total 10 each.

1 Natural Language Processing Lecture Notes 14 Chapter 19.

Target -Method Match Selecting The Right Assessment.

Chapter Thirteen Rhetorical and Critical Analyses: Understanding Text And Image In Words.

A university for the world real R © 2009, Chapter 9 The Runtime Environment Michael Adams.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Comprehensive Assessment Sarah Coutts

IAEA International Atomic Energy Agency Methodology and Responsibilities for Periodic Safety Review for Research Reactors William Kennedy Research Reactor.

College Career Ready Conference Today we will:  Unpack the PARCC Narrative and Analytical writing rubrics while comparing them to the standards.

Communicative Language Teaching

Performance Task Overview Introduction This training module answers the following questions: –What is a performance task? –What is a Classroom Activity?

Natural conversation “When we investigate how dialogues actually work, as found in recordings of natural speech, we are often in for a surprise. We are.

ENTERFACE 08 Project #1 “ MultiParty Communication with a Tour Guide ECA” Final presentation August 29th, 2008.

Unit – I Presentation. Unit – 1 (Introduction to Software Project management) Definition:-  Software project management is the art and science of planning.

Questions for practitionaires Clementina Ivan-Ungureanu Training: Essential SNA: Building the basics Addis Ababa, February 2012.

Colby Smart, E-Learning Specialist Humboldt County Office of Education

Performance Task and the common core. Analysis sheet Phases of problem sheet Performance task sheet.

Planning for and Attending an Important Meeting Advanced Social Communication High School: Lesson Seven.

COURSE AND SYLLABUS DESIGN

Assessing Math Looking closer at PARCC Task Types 2.

Conversational role assignment problem in multi-party dialogues Natasa Jovanovic Dennis Reidsma Rutger Rienks TKI group University of Twente.

2/12/04 Slide 1 SBC Laboratories User Needs and User Profiling.

Activity-based dialogue analysis as evaluation method Bilyana Martinovska Ashish Vaswani Institute Creative Technology University of Southern California.

Software Architecture Architecture represents different things from use cases –Use cases deal primarily with functional properties –Architecture deals.

Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.

To my presentation about:  IELTS, meaning and it’s band scores.  The tests of the IELTS  Listening test.  Listening common challenges.  Reading.

Measuring the Disability Continuum in a Policy Context Barbara M. Altman, PhD Disability Statistics Consultant Stephen P. Gulley, PhD Brandeis University.

PeerWise Student Instructions

English Language Secondary 3

Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario

Introduction To System Analysis and Design PART 2

What are the key parts of each theory you need to remember for Applied Ethics questions? Utilitarianism Deontology Virtue Ethics.

The Mid Tudors A2 Evaluation and enquiry questions

NC Tenth Grade Writing Test

MAPO: Mining and Recommending API Usage Patterns

Item 4.2 EUROMOD latest developments

Presentation transcript:

Dialogues in Context: An Objective User-Oriented Evaluation Approach for Virtual Human Dialogue Susan Robinson, Antonio Roque & David Traum

2 Overview  We present a method to evaluate the dialogue of agents in complex non-task oriented dialogues.

3 Staff Duty Officer Moleno

4 System Features  Agent communicates through text-based modalities (IM and chat)  Core response selection handled by statistical classifier NPCEditor (Leuski and Traum, P32 Sacra Infermeria Thurs 16:55-18:15)  To handle multi-party dialogue, Moleno: –Keeps a user model with username, elapsed time, typing status and location –Delays response when unsure about an utterance until no users are typing

5 Desired Qualities Ideally would have an evaluation method that: - Gives direct measurable feedback on the quality of the agent’s actual dialogue performance - Has sufficient detail to direct improvement of an agent’s dialogue at multiple phases of development - Is largely transferrable to the evaluation of multiple agents in different domains, and with different system architectures

6 Problems with Current Approaches  Component Performance –Difficulty comparing between systems –Does not directly evaluate dialogue performance  User Survey –Lacks objectivity and detail  Task Success –Problem when tasks are complex or success is hard to specify

7 Our Approach: Linguistic Evaluation  Evaluate from perspective of interactive dialogue itself –Allows evaluation metrics to be divorced from system-internal features –Allows for more objective measures than the user’s subjective experience –Allows detailed examination and feedback of dialogue success  Paired coding scheme –Annotate the dialogue action of the user’s utterances –Evaluate the quality of the agent’s response

8 Scheme 1: Dialogue Action Top CodeCategory (Subcategories) DDialogue Functions (Greeting / Closing / Politeness) CCritique (Positive/Negative of Agent / Domain) EExclamations - Emotive Expressions HHazing (Testing / Flaming) FFlirting– Playful question or offer QInformation Request (about Domain / Agent / Modality / Other) RRequests (for Help / Clarification / Motion / Domain-specific) AAnswers– responses to agent questions/offers SStatements (Social topic / Domain-related / Responsive / Other) LUtterance in another language GGarbage– Incomprehensible to annotator OOther Addressee

9 Scheme 1: Domain Actions  Increasingly detailed sub- categorization of acts relevant to domain activities and topics  Categories defined empirically and by need— what distinctions the agent needs to recognize to appropriately respond to the user’s actions CodeDomain Specific Information Request QDGGeneral Army Island Information QDLLocation/Navigational Questions QDESpecific Events/Activity Information QDJInformation about joining the army QDUAbout getting a uniform QDAOther questions about US Army QMBBiographical info about agent QMJAgent’s job/purpose QMPAgent’s preferences (favorite food, etc) QMIAgent’s immediate experience QMAAgent’s agency (are you a real person?)

10 Scheme 2: Evaluative Codes CODEVALUE 3Fully Appropriate Response NR3No Response (Appropriate Silence) 2Partially Appropriate Response – Adequate, but lacks some coherence or relevance RRRequests Repair NR1No Response (Inappropriate Silence) 1Inappropriate Response

11 Example Annotation SpeakerUtteranceCode MolenoGood evening, I’m Lt Moleno, the Army Island Duty Officer I Here is a calling card if you ever need to call meI How can I help you?I Userhold on [pause, 1 min. 26 seconds]RMW-NR3 UserHey i wanna go sky divingSDP-3 MolenoJumpmaster Neiro at the Skydiving Center will help you. Useri wanna fly that black thingSDP-RR MolenoCould you rephrase that? Userthe airplane do you computeSDP-1 MolenoI don’t know but I will find out and get back to you.

12 Agreement Measures

13 Results 1: Overview Appropriateness Rating: AR = (‘3’+ NR3) / Total = 0.56 Response Precision: RP = ‘3’/ (‘3’+’2’+’RR’+1) = 0.50 RatingResult (% Total) 3167 (24.6%) NR3211 (31.1%) 267 (9.9%) RR73 (10.8%) NR165 (9.6%) 195 (14%) Total678

14 Results2: Silence & Multiparty  Quality of Silences (AR nr ) = NR3/ (NR3 + NR1) =  By considering the 2 schemes together, can look at the performance on specific subsets of data. –Performance in Multiparty Dialogues on Utterances Addressed to Others:  Appropriate (AR) =  Precision (RP) = 0.147

15 Results 3: Combined Overview CategoryTotal#ARRP Dialogue General Answer/Acceptance Requests Information Requests Critiques Statements Hazing Exclamations/Emotive Other Addressee

16 Results 4: Domain Performance  461 utterances fell into ‘actual domain’  410 of these were actions (89%) covered in the agent’s design  51 of these were not anticipated in initial design; performance is much lower

17 Conclusion  General performance scores may be used to measure system progress over time  Paired coding method allows analysis to provide specific direction for agent improvement  General method may be applied to the evaluation of a variety of agents

18 Thank You  Questions?