Spoken Dialogue System Architecture

Slides:



Advertisements
Similar presentations
Map of Human Computer Interaction
Advertisements

Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
Dialogue Policy Optimisation
Probabilistic Adaptive Real-Time Learning And Natural Conversational Engine Seventh Framework Programme FP7-ICT
Error Handling in the RavenClaw Dialog Management Framework Dan Bohus, Alexander I. Rudnicky Computer Science Department, Carnegie Mellon University (
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Course Overview Lecture 1 Spoken Language Processing Prof. Andrew Rosenberg.
U1, Speech in the interface:2. Dialogue Management1 Module u1: Speech in the Interface 2: Dialogue Management Jacques Terken HG room 2:40 tel. (247) 5254.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.
Speech recognition, understanding and conversational interfaces Alexander Rudnicky School of Computer Science
Spoken Dialog System Architecture
Madeleine, a RavenClaw Exercise in the Medical Diagnosis Domain Dan Bohus, Alex Rudnicky MITRE Workshop on Dialog Management, Boston, October 2003.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Processing and Speech Enabled Applications by Pavlovic Nenad.
Semantic Parsing for Robot Commands Justin Driemeyer Jeremy Hoffman.
31 st October, 2012 CSE-435 Tashwin Kaur Khurana.
SDS Architectures Julia Hirschberg COMS 4706 (Thanks to Josh Gordon for slides.) 1.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
An Evaluation Framework for Natural Language Understanding in Spoken Dialogue Systems Joshua B. Gordon and Rebecca J. Passonneau Columbia University.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
DQA meeting: : Learning more effective dialogue strategies using limited dialogue move features Matthew Frampton & Oliver Lemon, Coling/ACL-2006.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,
Center for Human Computer Communication Department of Computer Science, OG I 1 Designing Robust Multimodal Systems for Diverse Users and Mobile Environments.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
7-Speech Recognition Speech Recognition Concepts
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
APML, a Markup Language for Believable Behavior Generation Soft computing Laboratory Yonsei University October 25, 2004.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Introduction to CL & NLP CMSC April 1, 2003.
Spoken Dialog Systems and Voice XML Lecturer: Prof. Esther Levin.
Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.
Learning Automata based Approach to Model Dialogue Strategy in Spoken Dialogue System: A Performance Evaluation G.Kumaravelan Pondicherry University, Karaikal.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
16.0 Spoken Dialogues References: , Chapter 17 of Huang 2. “Conversational Interfaces: Advances and Challenges”, Proceedings of the IEEE,
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
ENTERFACE 08 Project 1 “MultiParty Communication with a Tour Guide ECA” Mid-term presentation August 19th, 2008.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
Introduction to Computational Linguistics
Dialogue systems Volha Petukhova Saarland University 03/07/2015 Einführung in Diskurs and Pragmatik, Sommersemester
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Cognitive Systems Foresight Language and Speech. Cognitive Systems Foresight Language and Speech How does the human system organise itself, as a neuro-biological.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Performance Comparison of Speaker and Emotion Recognition
© 2013 by Larson Technical Services
1 Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents S. Kawamoto, et al. October 27, 2004.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
CS223: Software Engineering
Speech Recognition Created By : Kanjariya Hardik G.
Chapter 21 Robotic Perception and action Chapter 21 Robotic Perception and action Artificial Intelligence ดร. วิภาดา เวทย์ประสิทธิ์ ภาควิชาวิทยาการคอมพิวเตอร์
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Natural Language Processing and Speech Enabled Applications
Artificial Intelligence for Speech Recognition
Issues in Spoken Dialogue Systems
Integrating Learning of Dialog Strategies and Semantic Parsing
Advanced NLP: Speech Research and Technologies
Core Platform The base of EmpFinesse™ Suite.
Map of Human Computer Interaction
Huawei CBG AI Challenges
Presentation transcript:

Spoken Dialogue System Architecture Joshua Gordon CS4706

Outline Goals of an SDS architecture Research challenges Practical considerations An end-to-end tour of a real world SDS

SDS Architectures Software abstractions that tie together orchestrate the many NLP components required for human-computer dialogue Conduct task-oriented, limited-domain conversations Manage the many levels of information processing (e.g., utterance interpretation, turn taking) necessary for dialogue In real-time, under uncertainty

Examples Information seeking, transactional Most common CMU – Bus route information Columbia – Virtual Librarian Google – Directory service Let’s Go Public

Examples Virtual Humans Multimodal input / output Prosody and facial expression Auditory and visual clues assist turn taking Many limitations Scripting Constrained domain http://ict.usc.edu/projects/virtual_humans

Examples Interactive Kiosks Multi-participant conversations! Surprises and challenges passersby to trivia games [Bohus and Horvitz, 2009]

Examples Robotic Interfaces www.cellbots.com Speech interface to a UAV [Eliasson, 2007]

Conversational skills SDS Architectures tie together: Speech recognition Turn taking Dialogue management Utterance interpretation Grounding Natural language generation And increasingly include Multimodal input / output Gesture recognition

Research Challenges in every area Speech recognition Accuracy in interactive settings, detecting emotion. Turn taking Fluidly handling overlap, backchannels. Dialogue management Increasingly complex domains, better generalization, multi- party conversations. Utterance interpretation Reducing constraints on what the user can say, and how they can say it. Attending to prosody, emphasis, speech rate.

A tour of a real-world SDS CMU Olympus Open source collection of dialogue system components Research platform used to investigate dialogue management, turn taking, spoken language interpretation Actively developed Many implementations Let’s go public, Team Talk, CheckItOut www.speech.cs.cmu.edu

Conventional SDS Pipeline Speech signals to words. Words to domain concepts. Concepts to system intentions. Intentions to utterances (represented as text). Text to speech.

Olympus under the hood: provider pattern

Speech recognition

The Sphinx Open Source Recognition Toolkit Pocket-sphinx Continuous speech, speaker independent recognition system Includes tools for language model compilation, pronunciation, and acoustic model adaptation Provides word level confidence annotation, n-best lists Efficient – runs on embedded devices (including an iPhone SDK) Olympus supports parallel decoding engines / models Typically runs parallel acoustic models for male and female speech http://cmusphinx.sourceforge.net/

Speech recognition challenge in interactive settings

Spontaneous dialogue is difficult for speech recognizers Poor in interactive settings compared to one-off applications like voice search and dictation Performance phenomena: backchannels, pause-fillers, false-starts… OOV words Interaction with an SDS is cognitively demanding for users What can I say and when? Will the system understand me? Uncertainty increases disfluency, resulting in further recognition errors

WER (Word Error Rate) Non-interactive settings Interactive settings: Google Voice Search: 17% deployed (0.57% OOV over 10k queries randomly sampled from Sept-Dec, 2008) Interactive settings: Let’s Go Public: 17% in controlled conditions vs. 68% in the field CheckItOut: Used to investigate task-oriented performance under worst case ASR - 30% to 70% depending on experiment Virtual Humans: 37% in laboratory conditions

Examples of (worst-case) recognizer noise S: What book would you like? U: The Language of Sycamores ASR: THE LANGUAGE OF IS .A. COMING WARS S: Hi Scott, welcome back! U: Not Scott, Sarah! Sarah Lopez. ASR: SCOTT SARAH SCOUT LAW

Error Propagation Recognizer noise injects uncertainty into the pipeline Information loss occurs when moving from an acoustic signal to a lexical representation Most SDSs ignore prosody, amplitude, emphasis Information provided to downstream components includes An n-best list, or word lattice Low level features: speech rate, speech energy…

Spoken Language Understanding

SLU maps from words to concepts Dialog acts (the overall intent of an utterance) Domain specific concepts (like a book, or bus route) Single utterances vs. across turns Challenging in noisy settings Ex. “Does the library have Hitchhikers Guide to the Galaxy by Douglas Adams on audio cassette?” Dialog Act Book Request Title The Hitchhikers Guide to the Galaxy Author Douglas Adams Media Audio Cassette

Semantic grammars Domain independent concepts [Yes], [No], [Help], [Repeat], [Number] Domain specific concepts [Book], [Author] [Quit] (*THANKS *good bye) (*THANKS goodbye) (*THANKS +bye) ; THANKS (thanks *VERY_MUCH) (thank you *VERY_MUCH) VERY_MUCH (very much) (a lot)

Grammars generalize poorly Useful for extracting fine-grained concepts, but… Hand engineered Time consuming to develop and tune Requires expert linguistic knowledge to construct Difficult to maintain over complex domains Lack robustness to OOV words, novel phrasing Sensitive to recognizer noise

SLU in Olympus: the Phoenix Parser Phoenix is a semantic parser, indented to be robust to recognition noise Phoenix parses the incoming stream of recognition hypotheses Maps words in ASR hypotheses to semantic frames Each frame has an associated CFG Grammar, specifying word patterns that match the slot Multiple parses may be produced for a single utterance The frame is forward to the next component in the pipeline

Statistical methods Supervised learning is commonly used for single utterance interpretation Given word sequence W, find the semantic representation of meaning M that has maximum a posteriori probability P(M|W) Useful for dialog act identification, determining broad intent Like all supervised techniques… Requires a training corpus Often is domain and recognizer dependent

Belief updating

Cross-utterance SLU U: Get my coffee cup and put it on my desk. The one at the back. Difficult in noisy settings Mostly new territory for SDS [Zuckerman, 2009]

Dialogue Management

The Dialogue Manager Represents the system’s agenda Many techniques Hierarchal plans, state / transaction tables, Markov processes System initiative vs. mixed initiative System initiative has less uncertainty about the dialog state, but is clunky Required to manage uncertainty and error handing Belief updating, domain independent error handling strategies

Task Specification, Agenda, and Execution [Bohus, 2007]

Domain independent error handling [Bohus, 2007]

Error recovery strategies Error Handling Strategy (misunderstanding) Example Explicit confirmation Did you say you wanted a room starting at 10 a.m.? Implicit confirmation Starting at 10 a.m. ... until what time? Error Handling Strategy (non-understanding) Example Notify that a non-understanding occurred Sorry, I didn’t catch that . Ask user to repeat Can you please repeat that? Ask user to rephrase Can you please rephrase that? Repeat prompt Would you like a small room or a large one?

Statistical Approaches to Dialogue Management Learning management policy from a corpus Dialogue can be modeled as Partially Observable Markov Decision Processes (POMDP) Reinforcement learning is applied (either to existing corpora or through user simulation studies) to learn an optimal strategy Evaluation functions typically reference the PARADISE framework

Interaction management

The Interaction Manager Mediates between the discrete, symbolic reasoning of the dialog manager, and the continuous real-time nature of user interaction Manages timing, turn-taking, and barge-in Yields the turn to the user on interruption Prevents the system from speaking over the user Notifies the dialog manager of Interruptions and incomplete utterances

Natural Language Generation and Speech Synthesis

NLG and Speech Synthesis Template based, e.g., for explicit error handling strategies Did you say <concept>? More interesting cases in disambiguation dialogs A TTS synthesizes the NLG output The audio server allows interruption mid utterance Production systems incorporate Prosody, intonation contours to indicate degree of certainty Open source TTS frameworks Festival - http://www.cstr.ed.ac.uk/projects/festival/ Flite - http://www.speech.cs.cmu.edu/flite/

Asynchronous architectures Lemon, 2003 Backup recognition pass enables better discussion of OOV utterances Blaylock, 2002 An asynchronous modification of TRIPS, most work is directed toward best-case speech recognition

Problem-solving architectures FORRSooth models task-oriented dialogue as cooperative decision making Six FORR-based services operating in parallel Interpretation Grounding Generation Discourse Satisfaction Interaction Each service has access to the same knowledge in the form of descriptives

Thanks! Questions?