Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan

Introduction The term “multi-modal”  General description of an application that could be operated in multiple input/output modes.  E.g Input: voice, pen, gesture, face expression. Output: voice, graphical output  [Also see the supplementary slides on Alex- Arthur’s discussion on the definition]

Multi-modal Dialogue (MMD) in Personal Navigation System Motivation of this presentation  Navigation System provides MMD an interesting scenario a case why MMD is useful Structure of this presentation  3 system papers AT&T MATCH speech and pen input with pen gesture Speechworks Walking Direction System speech and stylus input Univ. of Saarland REAL Speech and pen input Both GPS and a magnetic tracker were used.

Multi-modal Language Processing for Mobile Information Access

Overall Function A working city guide and navigation system  Easy access restaurant and subway information Runs on a Fujitsu pen computer Users are free to  give speech command  draw on display with stylus

Types of Inputs Speech Input  “show cheap italian restaurants in chelsea” Simultaneous Speech and Pen Input  Circle and area  Say “show cheap italian restaurants in neighborhood” at the same time. Functionalities include  Review  Subway routine

Input Overview Speech Input  Use AT&T Watson speech recognition engine Pen Input (electron Ink)  Allow usage of pen gesture.  It could be a complex, pen input Use special aggregation techniques for all this gesture. Inputs would be combined using lattice combination.

Pen Gesture and Speech Input For example:  U: “How do I get to this place?”  S: “Where do you want to go from?”  U “25 th St & 3 rd Avenue” 

Summary Interesting aspects of the system  Illustrate the real life scenario where multi- modal inputs could be used  Design issue: how different inputs should be used together?  Algorithmic issue: how different inputs should be combined together?

Multi-modal Spoken Dialog with Wireless Devices

Overview Work by Speechworks  Jointly conducted by speech recognition and user interface folks  Two distinct elements Speech recognition In a embedded domain, which speech recognition paradigm should be used?  embedded speech recognition?  network speech recognition?  distributed speech recognition? User interface How to “situationlize” the application?

Overall Function Walking Directions Application  Assume user walking in an unknown city  Compaq iPAQ 3765 PocketPC  Users could Select a city, start-end addresses Display a map Control the display Display directions Display interactive directions in the form of list of steps.  Accept speech input and stylus input Not pen gesture.

Choice of speech recognition paradigm Embedded speech recognition  Only simple commands could be used due to computation limits. Network speech recognition  Bandwidth is required  Sometimes network would be cut-off Distributed speech recognition  Client takes care of front-end  Server takes care of decoding 

User Interface Situationalization  Potential scenario Sitting at a desk Getting out of a cab, building, subway and preparing to walk somewhere Walking somewhere with hands free Walking somewhere carrying things Driving somewhere in heavy traffic Driving somewhere in light traffic Being the passenger in a car Being in highly noisy environment.

Their conclusion Balances of audio and visual information  Could be reduced to 4 complementary components Single-modal 1, Visual Mode 2, Audio Mode Multi-modal 3, Visual dominant 4, Visual dominant

A glance of UI

Summary Interesting aspects  Great discussion on how speech recognition could be used in an embedded domain how the user would use the dialogue application

Multi-modal Dialog in a Mobile Pedestrian Navigation System

Overview Pedestrian Navigation System  Two components: IRREAL : indoor navigation system Use magnetic tracker ARREAL: outdoor navigation system Use GPS

Speech Input/Output Speech Input:  HTK / IBM Viavoice embedded and Logox was being evaluated Speech Output:  Festival

Visual output Both 2D and 3D spatialization supported

Interesting aspects Tailor the system for elderly people  Speaker clustering to improve recognition rate for elderly people  Model selection Choose from two models based on likelihood Elderly models Normal adult models

Conclusion Aspects of multi-modal dialogue  What kind of inputs should be used?  How speech and other inputs could be combined/interacted?  How users would use the system?  How the system should respond to the users?

Supplements on Definition of Multi-modal Dialogue & How MATCH combine multi-modal inputs

Definition of Multi-modal Dialog In slide “Introduction”,  Arthur’s definition of multi-modal application General description of an application that could be operated in multiple input/output modes.  Alex’s comment “So how about the laptop? Will you consider it as a multi-modal application?”

I am stunned! Alex makes some sense! The laptop examples show  We expect “multi-modal application” to be in some way to allow two different modes to operate simultaneously.  So, though laptop allows both mouse input and keyboard input, it doesn’t fit into what people call multi-modal application.

A further refinement It is still important to consider a multi- modal application as  a generalization of a single-modal application This allows  Thinking on how to deal with situation where a particular mode fails.

How multi-modal inputs could be combined? How speech are input?  Simple click-to-speak input is used.  Output are speech lattice.

How pen gesture are input? Key strokes could contain  Lines and arrows  Handwritten words  Or selection of entities on the screen Standard template-based algorithm is used  Also extract arrow head and mark. Recognition could be  285 words attribute of the restaurants “Cheap”, “Chinese” Zones or point of interest “soho”, “empire”  10 basic gesture marks: Lines, arrows, areas, points and question mark Input broken into a lattice of strokes.

Pen Input Representation FORM MEANING (NUMBER TYPE) SEM FORM: physical form of the gesture e.g. area, point, line, arrow MEANING: meaning of the form e.g. “area” could be loc(cation) or sel(ection) NUMBER: number of entities in the selection e.g. 1, 2, 3 or many TYPE: the type of entities e.g. res(taurant) and theater SEM: place holder for specific contents of a gesture e.g. points make up an area, identifiers of an object.

Example: First Area GestureSecond Area Gesture

Example (cont.) Either a Location (0->1->2->3->7) Or the restaurant (0->1->2->4->5->6->7) Either a Location (8->9->10->16) Or two restaurant (8->9->11->12->13->16) Aggregate numerical expression from Gesture 1 and 2 ->14->15

Example (cont.) User say:  “show Chinese restaurant in this and this neighborhood”  (Two locations are specified)

Example (cont.) User say:  “Tell me about this place and these places  (Two restaurants are specified)

Example (cont.) Not covered here:  If users say “these three restaurants”  The program need to aggregate two gestures together. Covered by “Deixis and Conjunction in Multimodal Systems” by Michael Johnston  In brief: gestures will be combined and forming new paths of lattice.

How Multi-modal inputs are integrated? Issues:  1, Timing of inputs  2, How Inputs are processed? (FST) Details could be found in “Finite-state multimodal parsing and understanding” “Tight-coupling of Multimodal Language Processing with Speech Recognition “  3, Multi-modal Grammars

Timing of Inputs MATCH  Takes speech and gesture lattice and create meaning lattice  A time out system is used.  When user hit a click-to-speak button, speech result arrives If inking is on the progress, MATCH waits for the gesture lattice in the short time-out Otherwise MATCH will treat the input as unimodal.  Similar case for the gesture lattice.

FST processing of multi-modal inputs Multi-modal integration  Modeled by a 3-tape finite state device Speech and Gesture stream (gesture symbols) Their combined meaning (meaning symbols)  Device take speech and gesture as inputs and create the meaning output.  Simulated by two transducers G:W -> aligning speech and gesture G_W:M -> composite alphabet of speech and gesture symbols as inputs and outputs meaning  Speech and gesture input will be composed by G:W  Then G_W will be composed by G_W:M

Multi-modal Grammar  Input word and gesture streams generate an XML representation of meaning eps : epsilon  Output would look like   [id1] 

Multi-modal Grammar

Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Similar presentations

Presentation on theme: "Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Similar presentations

Presentation on theme: "Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan."— Presentation transcript:

Similar presentations

About project

Feedback