Download presentation
Presentation is loading. Please wait.
Published byMaria Gibson Modified over 9 years ago
1
Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan
2
Introduction The term “multi-modal” General description of an application that could be operated in multiple input/output modes. E.g Input: voice, pen, gesture, face expression. Output: voice, graphical output [Also see the supplementary slides on Alex- Arthur’s discussion on the definition]
3
Multi-modal Dialogue (MMD) in Personal Navigation System Motivation of this presentation Navigation System provides MMD an interesting scenario a case why MMD is useful Structure of this presentation 3 system papers AT&T MATCH speech and pen input with pen gesture Speechworks Walking Direction System speech and stylus input Univ. of Saarland REAL Speech and pen input Both GPS and a magnetic tracker were used.
4
Multi-modal Language Processing for Mobile Information Access
5
Overall Function A working city guide and navigation system Easy access restaurant and subway information Runs on a Fujitsu pen computer Users are free to give speech command draw on display with stylus
6
Types of Inputs Speech Input “show cheap italian restaurants in chelsea” Simultaneous Speech and Pen Input Circle and area Say “show cheap italian restaurants in neighborhood” at the same time. Functionalities include Review Subway routine
7
Input Overview Speech Input Use AT&T Watson speech recognition engine Pen Input (electron Ink) Allow usage of pen gesture. It could be a complex, pen input Use special aggregation techniques for all this gesture. Inputs would be combined using lattice combination.
8
Pen Gesture and Speech Input For example: U: “How do I get to this place?” S: “Where do you want to go from?” U “25 th St & 3 rd Avenue”
9
Summary Interesting aspects of the system Illustrate the real life scenario where multi- modal inputs could be used Design issue: how different inputs should be used together? Algorithmic issue: how different inputs should be combined together?
10
Multi-modal Spoken Dialog with Wireless Devices
11
Overview Work by Speechworks Jointly conducted by speech recognition and user interface folks Two distinct elements Speech recognition In a embedded domain, which speech recognition paradigm should be used? embedded speech recognition? network speech recognition? distributed speech recognition? User interface How to “situationlize” the application?
12
Overall Function Walking Directions Application Assume user walking in an unknown city Compaq iPAQ 3765 PocketPC Users could Select a city, start-end addresses Display a map Control the display Display directions Display interactive directions in the form of list of steps. Accept speech input and stylus input Not pen gesture.
13
Choice of speech recognition paradigm Embedded speech recognition Only simple commands could be used due to computation limits. Network speech recognition Bandwidth is required Sometimes network would be cut-off Distributed speech recognition Client takes care of front-end Server takes care of decoding
14
User Interface Situationalization Potential scenario Sitting at a desk Getting out of a cab, building, subway and preparing to walk somewhere Walking somewhere with hands free Walking somewhere carrying things Driving somewhere in heavy traffic Driving somewhere in light traffic Being the passenger in a car Being in highly noisy environment.
15
Their conclusion Balances of audio and visual information Could be reduced to 4 complementary components Single-modal 1, Visual Mode 2, Audio Mode Multi-modal 3, Visual dominant 4, Visual dominant
16
A glance of UI
17
Summary Interesting aspects Great discussion on how speech recognition could be used in an embedded domain how the user would use the dialogue application
18
Multi-modal Dialog in a Mobile Pedestrian Navigation System
19
Overview Pedestrian Navigation System Two components: IRREAL : indoor navigation system Use magnetic tracker ARREAL: outdoor navigation system Use GPS
20
Speech Input/Output Speech Input: HTK / IBM Viavoice embedded and Logox was being evaluated Speech Output: Festival
21
Visual output Both 2D and 3D spatialization supported
22
Interesting aspects Tailor the system for elderly people Speaker clustering to improve recognition rate for elderly people Model selection Choose from two models based on likelihood Elderly models Normal adult models
23
Conclusion Aspects of multi-modal dialogue What kind of inputs should be used? How speech and other inputs could be combined/interacted? How users would use the system? How the system should respond to the users?
24
Supplements on Definition of Multi-modal Dialogue & How MATCH combine multi-modal inputs
25
Definition of Multi-modal Dialog In slide “Introduction”, Arthur’s definition of multi-modal application General description of an application that could be operated in multiple input/output modes. Alex’s comment “So how about the laptop? Will you consider it as a multi-modal application?”
26
I am stunned! Alex makes some sense! The laptop examples show We expect “multi-modal application” to be in some way to allow two different modes to operate simultaneously. So, though laptop allows both mouse input and keyboard input, it doesn’t fit into what people call multi-modal application.
27
A further refinement It is still important to consider a multi- modal application as a generalization of a single-modal application This allows Thinking on how to deal with situation where a particular mode fails.
28
How multi-modal inputs could be combined? How speech are input? Simple click-to-speak input is used. Output are speech lattice.
29
How pen gesture are input? Key strokes could contain Lines and arrows Handwritten words Or selection of entities on the screen Standard template-based algorithm is used Also extract arrow head and mark. Recognition could be 285 words attribute of the restaurants “Cheap”, “Chinese” Zones or point of interest “soho”, “empire” 10 basic gesture marks: Lines, arrows, areas, points and question mark Input broken into a lattice of strokes.
30
Pen Input Representation FORM MEANING (NUMBER TYPE) SEM FORM: physical form of the gesture e.g. area, point, line, arrow MEANING: meaning of the form e.g. “area” could be loc(cation) or sel(ection) NUMBER: number of entities in the selection e.g. 1, 2, 3 or many TYPE: the type of entities e.g. res(taurant) and theater SEM: place holder for specific contents of a gesture e.g. points make up an area, identifiers of an object.
31
Example: First Area GestureSecond Area Gesture
32
Example (cont.) Either a Location (0->1->2->3->7) Or the restaurant (0->1->2->4->5->6->7) Either a Location (8->9->10->16) Or two restaurant (8->9->11->12->13->16) Aggregate numerical expression from Gesture 1 and 2 ->14->15
33
Example (cont.) User say: “show Chinese restaurant in this and this neighborhood” (Two locations are specified)
34
Example (cont.) User say: “Tell me about this place and these places (Two restaurants are specified)
35
Example (cont.) Not covered here: If users say “these three restaurants” The program need to aggregate two gestures together. Covered by “Deixis and Conjunction in Multimodal Systems” by Michael Johnston In brief: gestures will be combined and forming new paths of lattice.
36
How Multi-modal inputs are integrated? Issues: 1, Timing of inputs 2, How Inputs are processed? (FST) Details could be found in “Finite-state multimodal parsing and understanding” “Tight-coupling of Multimodal Language Processing with Speech Recognition “ 3, Multi-modal Grammars
37
Timing of Inputs MATCH Takes speech and gesture lattice and create meaning lattice A time out system is used. When user hit a click-to-speak button, speech result arrives If inking is on the progress, MATCH waits for the gesture lattice in the short time-out Otherwise MATCH will treat the input as unimodal. Similar case for the gesture lattice.
38
FST processing of multi-modal inputs Multi-modal integration Modeled by a 3-tape finite state device Speech and Gesture stream (gesture symbols) Their combined meaning (meaning symbols) Device take speech and gesture as inputs and create the meaning output. Simulated by two transducers G:W -> aligning speech and gesture G_W:M -> composite alphabet of speech and gesture symbols as inputs and outputs meaning Speech and gesture input will be composed by G:W Then G_W will be composed by G_W:M
39
Multi-modal Grammar Input word and gesture streams generate an XML representation of meaning eps : epsilon Output would look like [id1]
40
Multi-modal Grammar
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.