Download presentation
Presentation is loading. Please wait.
Published byIsaac Chapman Modified over 8 years ago
1
Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan
2
Introduction The term “multi-modal” General description of an application that could be operated in multiple input/output modes. E.g Input: voice, pen, gesture, face expression. Output: voice, graphical output
3
Multi-modal Dialogue (MMD) in Personal Navigation System Motivation of this presentation Navigation System provides MMD an interesting scenario a case why MMD is useful Structure of this presentation 3 system papers AT&T MATCH speech and pen input with pen gesture Speechworks Walking Direction System speech and stylus input Univ. of Saarland REAL Speech and pen input Both GPS and a magnetic tracker were used.
4
Multi-modal Language Processing for Mobile Information Access
5
Overall Function A working city guide and navigation system Easy access restaurant and subway information Runs on a Fujitsu pen computer Users are free to give speech command draw on display with stylus
6
Types of Inputs Speech Input “show cheap italian restaurants in chelsea” Simultaneous Speech and Pen Input Circle and area Say “show cheap italian restaurants in neighborhood” at the same time. Functionalities include Review Subway routine
7
Input Overview Speech Input Use AT&T Watson speech recognition engine Pen Input (electron Ink) Allow usage of pen gesture. It could be a complex, pen input Use special aggregation techniques for all this gesture. Inputs would be combined using lattice combination.
8
Pen Gesture and Speech Input For example: U: “How do I get to this place?” S: “Where do you want to go from?” U “25 th St & 3 rd Avenue”
9
Summary Interesting aspects of the system Illustrate the real life scenario where multi- modal inputs could be used Design issue: how different inputs should be used together? Algorithmic issue: how different inputs should be combined together?
10
Multi-modal Spoken Dialog with Wireless Devices
11
Overview Work by Speechworks Jointly conducted by speech recognition and user interface folks Two distinct elements Speech recognition In a embedded domain, which speech recognition paradigm should be used? embedded speech recognition? network speech recognition? distributed speech recognition? User interface How to “situationlize” the application?
12
Overall Function Walking Directions Application Assume user walking in an unknown city Compaq iPAQ 3765 PocketPC Users could Select a city, start-end addresses Display a map Control the display Display directions Display interactive directions in the form of list of steps. Accept speech input and stylus input Not pen gesture.
13
Choice of speech recognition paradigm Embedded speech recognition Only simple commands could be used due to computation limits. Network speech recognition Bandwidth is required Sometimes network would be cut-off Distributed speech recognition Client takes care of front-end Server takes care of decoding
14
User Interface Situationalization Potential scenario Sitting at a desk Getting out of a cab, building, subway and preparing to walk somewhere Walking somewhere with hands free Walking somewhere carrying things Driving somewhere in heavy traffic Driving somewhere in light traffic Being the passenger in a car Being in highly noisy environment.
15
Their conclusion Balances of audio and visual information Could be reduced to 4 complementary components Single-modal 1, Visual Mode 2, Audio Mode Multi-modal 3, Visual dominant 4, Visual dominant
16
A glance of UI
17
Summary Interesting aspects Great discussion on how speech recognition could be used in an embedded domain how the user would use the dialogue application
18
Multi-modal Dialog in a Mobile Pedestrian Navigation System
19
Overview Pedestrian Navigation System Two components: IRREAL : indoor navigation system Use magnetic tracker ARREAL: outdoor navigation system Use GPS
20
Speech Input/Output Speech Input: HTK / IBM Viavoice embedded and Logox was being evaluated Speech Output: Festival
21
Visual output Both 2D and 3D spatialization supported
22
Interesting aspects Tailor the system for elderly people Speaker clustering to improve recognition rate for elderly people Model selection Choose from two models based on likelihood Elderly models Normal adult models
23
Conclusion Aspects of multi-modal dialogue What kind of inputs should be used? How speech and other inputs could be combined/interacted? How users would use the system? How the system should respond to the users?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.