German Research Center for Artificial Intelligence DFKI GmbH Stuhlsatzenhausweg Saarbruecken, Germany phone: ( ) /4162 fax: ( ) WWW: Wolfgang Wahlster SmartKom: Fusion and Fission of Speech, Gestures, and Facial Expressions International Workshop on Man-Machine Symbiotic Systems Kyoto, 26 November 2002, p. 213
© W. Wahlster Spoken Dialogue Graphical User interfaces Gestural Interaction Multimodal Interaction SmartKom: Merging Various User Interface Paradigms Facial Expressions Biometrics
© W. Wahlster Symbolic and Subsymbolic Fusion of Multiple Modes Speech Recognition Gesture Recognition Prosody Recognition Facial Expression Recognition Lip Reading Subsymbolic Fusion - Neuronal Networks - Hidden Markov Models Symbolic Fusion - Graph Unification - Bayesian Networks Reference Resolution and Disambiguation Modality-Free Semantic Representation
© W. Wahlster 1.Using all Human Senses for Symbiotic Man-Machine Interaction 2.SmartKom: Multimodal, Multilingual and Multidomain Dialogues 3.Modality Fusion in SmartKom 4.Multimodal Discourse Processing 5. Plan-based Modality Fission in SmartKom 6. Conclusions Outline of the Talk
© W. Wahlster MM Dialogue Back- Bone Home: Consumer Electronics EPG Public: Cinema, Phone, Fax, Mail, Biometrics Mobile: Car and Pedestrian Navigation Application Layer SmartKom-Mobile SmartKom-Public SmartKom-Home/Office SmartKom: A Highly Portable Multimodal Dialogue System
© W. Wahlster SmartKom: Intuitive Multimodal Interaction MediaInterface European Media Lab Uinv. Of Munich Univ. of Stuttgart Saarbrücken Aachen Dresden Berkeley Stuttgart MunichUniv. of Erlangen Heidelberg Main Contractor Scientific Director W. Wahlster DFKI Saarbrücken The SmartKom Consortium: Project Budget: € 25.5 million, funded by BMBF (Dr. Reuse) and industry Project Duration: 4 years (September 1999 – September 2003) Ulm
© W. Wahlster SmartKom`s SDDP Interaction Metaphor SDDP = Situated Delegation-oriented Dialogue Paradigm Anthropomorphic Interface = Dialogue Partner User specifies goal delegates task cooperate on problems asks questions presents results Service 1 Service 2 Service 3 Webservices Personalized Interaction Agent See: Wahlster et al. 2001, Eurospeech
© W. Wahlster Multimodal Input and Output in the SmartKom System Where would you like to sit?
© W. Wahlster I‘d like to reserve tickets for this performance. Where would you like to sit? I‘d like these two seats. Symbiotic Interaction with a Life-like Character User Input: Speech, Gesture, and Facial Expressions Smartakus Output: Speech, Gesture and Facial Expressions User Input: Speech, Gesture, and Facial Expressions
© W. Wahlster Multimodal Input and Output in SmartKom Fusion and Fission of Multiple Modalities Input by the User Output by the Presentation agent Speech Gesture Facial Expressions
© W. Wahlster SmartKom‘s Data Collection of Multimodal Dialogs User Side-view Camera Face-tracking Camera with Microphone Environmental Noise Microphone Array Screen Projected Webpage Face-tracking Camera Loudspeaker Microphone Array User Bird’s-eye Camera LCD Beamer SIVIT- Camera
© W. Wahlster Personalized Interaction with WebTVs via SmartKom (DFKI with Sony, Philips, Siemens) User: Switch on the TV. Smartakus: Okay, the TV is on. User: Which channels are presenting the latest news right now? Smartakus: CNN and NTV are presenting news. User: Please record this news channel on a videotape. Smartakus: Okay, the VCR is now recording the selected program. Example: Multimodal Access to Electronic Program Guides for TV
© W. Wahlster Using Facial Expression Recognition for Affective Personalization (1) Smartakus: Here you see the CNN program for tonight. (2)User: That’s great. (3)Smartakus: I’ll show you the program of another channel for tonight. (2’)User: That’s great. (3’) Smartakus: Which of these features do you want to see? Processing ironic or sarcastic comments
© W. Wahlster negativeneutral Recognizing Affect: A Negative Facial Expression of the User
© W. Wahlster The SmartKom Demonstrator System Camera for Gestural Input Microphone Multimodal Control of TV-Set Multimodal Control of VCR/DVD Player Camera for Facial Analysis
© W. Wahlster Combination of Speech and Gesture in SmartKom This one I would like to see. Where is it shown?
© W. Wahlster Multimodal Input and Output in SmartKom Please show me where you would like to be seated.
© W. Wahlster Getting Driving and Walking Directions via SmartKom User: I want to drive to Heidelberg. Smartakus: Do you want to take the fastest or the shortest route? User: The fastest. Smartakus: Here you see a map with your route from Saarbrücken to Heidelberg. SmartKom can be used for Multimodal Navigation Dialogues in a Car
© W. Wahlster Getting Driving and Walking Directions via SmartKom Smartakus: You are now in Heidelberg. Here is a sightseeing map of Heidelberg. User: I would like to know more about this church! Smartakus: Here is some information about the St. Peter's Church. User: Could you please give me walking directions to this church? Smartakus: In this map, I have high-lighted your walking route.
© W. Wahlster SmartKom: Multimodal Dialogues with a Hybrid Navigation System
© W. Wahlster Seamless integration and mutual disambiguation of multimodalinput and output on semantic and pragmatic levels Situated understanding of possibly imprecise, ambiguous, or incom- plete multimodal input Context-sensitive interpretation of dialog interaction on the basis of dynamic discourse and context models Adaptive generation of coordinated, cohesive and coherent multimodal presentations Semi- or fully automatic completion of user-delegated tasks through the integration of information services Intuitive personification of the system through a presentation agent Salient Characteristics of SmartKom
© W. Wahlster The High-Level Control Flow of SmartKom
© W. Wahlster SmartKom’s Multimodal Dialogue Back-Bone Communication Blackboards Data Flow Context Dependencies Analyzers External Services Modality Fusion Discourse Modeling Action Planning Modality Fission Generators Speech Gestures Facial Expressions Speech Graphics Gestures Dialogue Manager
© W. Wahlster Unification of Scored Hypothesis Graphs for Modality Fusion in SmartKom Modality Fusion Mutual Disambiguation Reduction of Uncertainty Intention Hypotheses Graph Word Hypothesis Graph with Acoustic Scores Intention Recognizer Selection of Most Likely Interpretation Clause and Sentence Boundaries with Prosodic Scores Scored Hypotheses about the User‘s Emotional State Gesture Hypothesis Graph with Scores of Potential Reference Objects
© W. Wahlster SmartKom‘s Computational Mechanisms for Modality Fusion and Fission M3L: Modality-Free Semantic Representation Ontological Inferences Unification Overlay Operations Planning Constraint Propagation Modality Fusion Modality Fission
© W. Wahlster The Overlay Operation Versus the Unification Operation Nonmonotonic and noncommutative unification-like operation Inherit (non-conflicting) background information two sources of conflicts: –conflicting atomic values overwrite background (old) with covering (new) –type clash assimilate background to the type of covering; recursion Unification Overlay cf. J. Alexandersson, T. Becker 2001
© W. Wahlster Overlay Operations Using the Discourse Model Augmentation and Validation –compare with a number of previous discourse states: fill in consistent information compute a score –for each hypothesis - background pair : –Overlay (covering, background) Covering: Background: Intention Hypothesis Lattice Selected Augmented Hypothesis Sequence
© W. Wahlster An Example of the Overlay Operation Go to the moviesFilms on TV tonight Generalisation and Specialisation U: What films are shown on TV tonight?.... U: I‘d rather go to the movies.
© W. Wahlster Smartkom‘s Three-Tiered Discourse Model DO 1 DO 2 VO 1 DO 10 DO 3 DO 9 Modality Layer Discourse Layer System: This [ ] is a list of films showing in Heidelberg. heidelberg list LO 2 LO 3... Domain Layer DomainObject 1 ticketfirst DO 11 DO 12 reserve LO 4 LO 5 LO 6 DomainObject 2 GO 1... User: Please reserve a ticket for the first one. DO = Discourse Object, LO = Linguistic Object GO = Gestural Object, VO = Visual Object cf. M. Löckelt et. al. 2002, N. Pfleger 2002
© W. Wahlster The High-Level Control Flow of SmartKom
© W. Wahlster Smartakus uses body language to notify the user that it is waiting for his input, that it is listening to him, that it has problems to understand his input, or that it is trying hard to find an answer to his question.g Smartakus is a Self-Animated Interface Agent Idle TimeNavigationPresentationSystem State
© W. Wahlster Some Complex Behavioural Patterns of the Interaction Agent Smartakus
© W. Wahlster [...] cinema_17a Europa [...] pid1234 [...] [...] cinema_17a Europa [...] pid1234 [...] M3L Representation of the Multimodal Discourse Context Blackboard with Presentation Context of the Previous Dialogue Turn
© W. Wahlster M3L Specification of a Presentation Task EuroSport T14:00: T15:00:00 Sport News sport... leanForward APGOAL3000 generatorAction GraphicsAndSpeech
© W. Wahlster SmartKom‘s Presentation Planner The Presentation Planner generates a Presentation Plan by applying a set of Presentation Strategies to the Presentation Goal. GlobalPresent PresentAddSmartakus DoLayout EvaluatePersonaNode Inform TryToPresentTVOverview ShowTVOverview SetLayoutData ShowTVOverview SetLayoutData PersonaAction SendScreenCommand Generation of Layout Smartakus Actions GenerateText... Speak cf. J. Müller, P. Poller, V. Tschernomas 2002
© W. Wahlster SmartKom‘s Use of Semantic Web Technology Three Layers of Annotations cf.: Dieter Fensel, James Hendler, Henry Liebermann, Wolfgang Wahlster (eds.) Spinning the Semantic Web, MIT Press, November 2002 Personalized Presentation M3L Content high Structure XML medium Layout HTML low
© W. Wahlster Various types of unification, overlay, constraint processing, planning and ontological inferences are the fundamental processes involved in SmartKom‘s modality fusion and fission components. The key function of modality fusion is the reduction of the overall uncertainty and the mutual disambiguation of the various analysis results based on a three-tiered representation of multimodal discourse. We have shown that a multimodal dialogue sytsem must not only understand and represent the user‘s input, but its own multimodal output. Conclusions
© W. Wahlster First International Conference on Perceptive & Multimodal User Interfaces (PMUI’03) November 5-7 th, 2003 Delta Pinnacle Hotel, Vancouver, B.C., Canada Conference Chair Sharon Oviatt, Oregon Health & Science Univ., USA Program Chairs Wolfgang Wahlster, DFKI, Germany Mark Maybury, MITRE, USA PMUI’03 is sponsored by ACM, and will be co-located in Vancouver with ACM’s UIST’03. This meeting follows three successful Perceptive User Interface Workshops (with PUI’01 held in Florida) and three International Multimodal Interface Conferences initiated in Asia (with ICMI’02 held in Pittsburgh).