Spoken Dialogue Systems Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003
Outline Discourse Research Issues Spoken Dialogue Systems Pragmatics (dialogue acts) Dialogue management Multimodal Systems Examples
Definitions Discourse Monologue Dialogue
Discourse: Research Issues Reference resolution, e.g., “That was a lie” Anaphora, e.g., “John left …. He was bored.” Co-reference, e.g., “John” and “He” refer to the same entity Text coherence, e.g., Coherence: “John left early. He was tired” Incoherence: “John left early. He likes spinach”
Spoken Dialogue Systems: Concepts Turn-taking Dialogue Segmentation Grounding Backchannel, e.g., ‘Mm Hmm’ Acknowledgment Explicit/implicit confirmation Implicature “What time are you flying” “Well, I have a meeting at three” Initiative “What time are you flying?” “Don’t feel like booking the flight right now. Lets look at hotels”
Speech, Dialogue and Application Acts Speech Acts (Austin 1962, Searle 1975) Assertive (conclude), Directive (ask, order), Commissive (promise), Expressive(apologize, thank), Declarations Dialogue Acts Statement, Info-Request, Wh-Question, Yes-No Question, Opening, Closing, Open-Option, Action-Directive, Offer, Commit, Agree etc. Application Acts Domain specific but general, e.g., Info-Request into system’s semantic state, Info-Request into database, Info-Request into database results
Dialogue/Application Act Classification Semantic Parsing follows by deterministic rules, e.g., ‘what’, ‘when’, ‘where’, ‘who’ starts a Wh-Question Bayesian Formulation Given a sentence W the most probable dialogue act A is argmax P(A|W) = argmax P(W|A) P(A) P(W|A) can be an n-gram model one for each dialogue act P(A) can also be an n-gram model of dialogue actsn-gram model of dialogue acts
Dialogue Management 1 Frame-based, e.g., DeptCity “From what city are you leaving?” GRM_CITY ArrCity “Where are you flying to” GRM_CITY DeptTime “What time would you like to fly?” GRM_TIME DeptDate “When are you flying?” GRM_DATETIME DeptTime Finite state machine dialogue manager Mostly system-initiated dialogue VXML-like dialogue structure (forms and frames)
Dialogue Management 2 Application Independent Flow Chart structure Generic dialogue/application manager (really this is more like a controller)
Dialogue Management 3 Generalized Finite State Machine Dialogue Management Application Dependent but General Dialogue Superstates Fill: adaptive dialogue module, uses dynamic e-forms to elicit AV pairs from the user; resolves value and tree-position ambiguities Navigate: presents database results and lets the user select the appropriate ones Fill Verify Create Query Navigate No Yes Is Full Is Correct
Advanced Dialogue Systems Mixed Initiative: Allow user to say anything (global grammar active at all states), e.g., “What date are you flying” “I am flying next Tuesday in the morning” Allow user to navigate the systems state machine, e.g., “I would like to look at hotels first” Open prompts, give user the initiative, e.g., “What next?” Advanced dialogue features Corrections, e.g., “No not Boston, Atlanta” Negation, e.g., “Anything but Olympic” Complex semantic expressions, e.g., “tomorrow evening or Sunday morning” Ambiguity resolution and representation, e.g., “next Tuesday” Persistent Semantics, e.g., “Info about his organization”
Ambiguity Representation Position ambiguity Value ambiguity
Error Correction Mechanisms Sys: How can I help you? Usr: I’d like to fly from Austin to Newark on August fifth in the morning Asr: I’d like to fly from Boston to Newark on August fifth in the morning Sys: I've got you leaving Boston on Sunday, August fifth in the morning and arriving in Newark airport. What is your preferred airline?
Usr: Leaving from Austin, Texas Sys: Sorry …, are you leaving from Austin, Texas, or from Boston? Usr 1 : Austin, Texas Sys: Leaving from Austin, Texas. Usr 2 : Change the departure city to Austin, Texas Alternate: use error correction
Spoken Dialogue System Architecture Controller Database Parser TTS Platform ASR Telephony Generation App. Controller DM/Initiative Interpreter/Context Tr. AI …
System Architecture and Portability Ambiguity representation Pragmatic Confidence Scores Application dependent Application independent Dialogue Manager SemanticsPragmaticsGeneration Parser Semantic Interpreter Context Tracker Pragmatic Interpreter Expert Domain Knowledge Initiative Tracking Utterance Planner Surface Realizer Controller
Advantages of application- centric system design: Increased modularity. Flexible multi-stage data collection. Extensible to multi-modal input (universal access).
Multimodal Systems Definition Input Modalities/Output Media Research Issues User Interface Design Semantic Module Examples
Input Modalities/Output Media Unimodal: Speech input/Speech output. Multimodal: Speech+DTMF input/Speech output. Speech input/Speech and GUI output. Speech and pen input/Speech and GUI output. Definitions: Pen input: buttons, pull-down menus, graffiti, pen gestures. GUI output: text and graphics SDPS+ D S+ P S G S+G
Issues Semantic/Pragmatic Module: Merging semantic information from different modalities, e.g., “Draw a line from here to there” Ambiguity representation and resolution User Interface: Synergies between input modalities Turn-taking and appropriate mix of modalities Maintain interface consistency Focus/context visualization System issues: Synchronization and latency
July fifth 7/10 NL ParserGUI Parser Pragmatic Analysis Update Semantic Tree & Pragmatic Scores Context Tracking GUI InterpreterNL InterpreterGUI InterpreterNL Interpreter “fifth” “July” “10” “7” “/” {“date”, “Jul 5, 2002”}{“date”, “Jul 10, 2002”} {“travel.flight.leg1.departure. date”, “Jul 5, 2002”} {“travel.flight.leg1.departure. date”, “Jul 10, 2002”} {“travel.flight.leg1.departure. date”, “Jul 5, 2002”, 0.4} {“travel.flight.leg1.departure. date”, “Jul 10, 2002”, 0.9} Semantic and Pragmatic Module
departure travel flight leg 1 arrival citydate city {“BOS”, 0.5} {“Jul 5, 2002”, 0.4} {“Jul 10, 2002”, 0.9} {“NYC”, 0.5}
Multi-Modal User Interface Emphasis on synergies between modalities: Value(s) of attributes are displayed graphically Erroneous values can be easily corrected via the GUI Focus (aka context) of speech modality is highlighted Position and value ambiguity are shown (and typically resolved) via the GUI Voice prompts are significantly shorter and mostly used to emphasize information that is already displayed graphically GUI takes full advantage of intelligence of voice UI, e.g., ‘round trip’ speech input will ‘gray out’ the third leg button in the GUI Seamless integration of semantics from the two modalities using modality-specific pragmatic scores
ASR: I want to fly from Boston to New York on September 6 th. new focus field disabled Example 1: Flight First Leg navigation buttons
Example 2: Flight Second Leg ASR: round trip value induction button disabled
ASR: I want a compact car from AVIS GUI: “rental” button pressed Example 3: Car Rental
Example 4: Ambiguity and Errors
Mixing the Modalities: Turn-Taking “Click to talk” vs “Open Mike” “Click to talk” can be restrictive “Open mike” can be confusing (falling out of turn) Both have limitations Often there is a dominant modality based on Type of input, e.g., “select from menu” vs enter free text Recent input history User preferences System automatically selects the dominant modality and the user can click to change it Dominant modality selection algorithm is adaptive