Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Slides:



Advertisements
Similar presentations
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao Wei Fan Jing JiangJiawei Han University of Illinois at Urbana-Champaign IBM T. J.
Advertisements

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.
Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.
Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006.
Crowdsourcing research data UMBC ebiquity,
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.
Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.
Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Today Ensemble Methods. Recap of the course. Classifier Fusion
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Post-Ranking query suggestion by diversifying search Chao Wang.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
NTU & MSRA Ming-Feng Tsai
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
SUPPORTING SYNCHRONOUS SOCIAL Q&A THROUGHOUT THE QUESTION LIFECYCLE Matthew Richardson Ryen White Microsoft Research.
ICONIP 2010, Sydney, Australia 1 An Enhanced Semi-supervised Recommendation Model Based on Green’s Function Dingyan Wang and Irwin King Dept. of Computer.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.
Online Evolutionary Collaborative Filtering RECSYS 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
PostGraduate Research Excellence Symposium (PGReS) 2015
Data Mining, Neural Network and Genetic Programming
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Introductory Seminar on Research: Fall 2017
MMS Software Deliverables: Year 1
Spoken Dialogue Systems
Designing Search for Humans
Spoken Dialogue Systems
Lecture 6: Introduction to Machine Learning
Knowledge Transfer via Multiple Model Local Structure Mapping
Mingzhen Mo and Irwin King
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015

Roadmap Spoken Dialogue Systems (SDS) Evaluation The PARADISE Model Challenges: Judgments, Portability Data: Let’s Go!, Spoken Dialogue Challenge Crowdsourcing SDS Evaluation Predicting SDS scores: Semi-supervised learning Collaborative filtering

Evaluating Spoken Dialogue Systems Q: How good is a spoken dialogue system (SDS)? A: Ask users! Typically conduct survey after interaction Q: What factors contribute to user satisfaction? Combine measures spanning Task success: Found information, recognized user request, etc Dialogue cost: # turns, dialogue length, # ASR errors, etc…

PARADISE Model Q: How do we weight these components? Q: How can we create a predictive model? ‘Paradigm for Dialogue System Evaluation’ Walker et al (1997) Computes user satisfaction based on surveys Extracts success, cost measures from dialogues Employs multiple regression models to train weights Resulting model should predict score on new systems

Trouble in PARADISE Adaptation: Ideally, should train on Sys A, apply to Sys B In practice, models do not transfer very well from system to system Walker et al applied to variants of same system User satisfaction: How can we collect ratings? Real users don’t hang around for surveys Collecting ‘expert’ ratings: expensive, slow

Strategy Employ crowdsourcing to collect evaluations Lots, cheaply Employ machine learning to help generalize, adapt Semi-supervised learning, collaborative filtering

Let’s Go! Corpus Spoken Dialogue System provides bus info Developed at CMU, deployed in Pittsburgh ONLY phone transit info service after hours Real users!!! Simple information seeking dialogue Bus #, Start, Destination, Start time, Arrival time Success: Gives bus information Out-of-scope: Knows only subset of routes Failure: No information

Let’s Go! Example S: What can I do for you? U: Leaving Lawrenceville. S: Leaving from Lawrenceville. Did I get that right? U: Yes. S: Alright. Where do you wanna go? U: Downtown. S: Going to downtown. Is that correct? U: Yes. S: Right. When do you want to travel? U: Now.. S: he next 54C leaves MAIN STREET AT PENN at 1 10 a.m.

Crowdsourcing SDS Evaluation Amazon Mechanical Turk: Create HITs (Human Intelligence Tasks) for Mturkers Text transcript of dialogue + questionnaire Q1: Did you understand what the user wanted? Q2: Did the system provide the info the user wanted? Q3: Did the system work as you would expect? Q4: Overall, is this a good system? Q5: Was the task completed successfully? (detailed) Two HIT types: Overall evaluation: $0.05/dialog; 11,000 dialogues Inter-rater Agreement: $1.50 for 30 dialogues

Validation & Collection Issue: Quality control Automatic validation: Spent enough time? Didn’t give all same answers? Completion-related responses consistent? With each other? With heuristic classification? 11K HITS rated by 700 Mturkers in 45 days ~8,400 dialogues accepted; $350

Inter-rater Agreement Cohen’s weighted Kappa Fair-Good agreement on task-completion related item What constitutes a ‘good’ system is more nebulous

Learning to Evaluate Semi-supervised learning approaches Employ small amounts of labeled data Augmented with large amounts of unlabeled data To train evaluating classifiers Compare 2 semi-supervised approaches Semi-supervised variant of SVM (S3VM) Co-training: train 2 classifiers on different views of data Dialogue features; Meta-communication features Classifier is logistic regression Supervised techniques: Linear regr’n., logistic regr’n, kNN, SVM

Experiments Data: ~5K Mturked dialogues 500 held out for test; vary training Binary classification: Good/Bad Features: Task/Communication-related: #User turns, # System turns, Words per user utt, Average speaking rate, #User questions, #System questions Meta-communication-related: #Barge-in, #DTMF, #Help requests, Average recognition confidence All features automatically derived from logs

Results Training on 0.3% of available data At low training rate, Co-training best At high training rate, all similar

Results II Accuracy with increasing amounts of training data (% )

Conclusions & Future Work Crowdsourcing can support spoken dialogue system evaluation Semi-supervised approaches, esp. co-training, can enable adaptation to new domain quickly Compare crowdsourcing to expert SDS evaluation Contrast evaluation on transcripts to that on audio Explore more fine-grained (non-binary) classification

Thanks Human-Computer Communication Laboratory, Chinese University of Hong Kong: Helen Meng, Irwin King, Baichuan Li, Zhaojun Yang, Yi Zhu Spoken Dialogue Challenge Alan Black, Maxine Eskenazi