Center for Human Computer Communication Department of Computer Science, OG I 1 Designing Robust Multimodal Systems for Diverse Users and Mobile Environments Sharon Oviatt
Center for Human Computer Communication Department of Computer Science, OG I 2 Introduction to Perceptive Multimodal Interfaces Multimodal interfaces recognize combined natural human input modes (speech & pen, speech & lip movements) Radical departure from GUIs in basic features, interface design & architectural underpinnings Rapid development in 1990s of bimodal systems New fusion & language processing techniques Diversification of mode combinations & applications More general & robust hybrid architectures
Center for Human Computer Communication Department of Computer Science, OG I 3 Advantages of Multimodal Interfaces Flexibility & expressive power Support for users’ preferred interaction style Accommodate more users,** tasks, environments** Improved error handling & robustness** Support for new forms of computing, including mobile & pervasive interfaces Permit multifunctional & tailored mobile interfaces, adapted to user, task & environment
Center for Human Computer Communication Department of Computer Science, OG I 4 The Challenge of Robustness: Unimodal Speech Technology’s Achilles’ Heel Recognition errors currently limit commercialization of speech technology, especially for: –Spontaneous interactive speech –Diverse speakers & speaking styles (e.g., accented) –Speech in natural field environments (e.g., mobile) 20-50% drop in accuracy typical for real-world usage conditions
Center for Human Computer Communication Department of Computer Science, OG I 5 Improved Error Handling in Flexible Multimodal Interfaces Users can avoid errors through mode selection Users’ multimodal language is simplified, which reduces complexity of NLP & avoids errors Users mode switch after system errors, which undercuts error spirals & facilitates recovery Multimodal architectures potentially can support “mutual disambiguation” of input signals
Example of Mutual Disambiguation: QuickSet Interface during Multimodal “PAN” Command
Processing & Architecture Speech & gestures processed in parallel Statistically ranked unification of semantic interpretations Multi-agent architecture coordinates signal recognition, language processing, & multimodal integration
Center for Human Computer Communication Department of Computer Science, OG I 8 General Research Questions To what extent can a multimodal system support mutual disambiguation of input signals? How much is robustness improved in a multimodal system, compared with a unimodal one? In what usage contexts and for what user groups is robustness most enhanced by a multimodal system? What are the asymmetries between modes in disambiguation likelihoods?
Center for Human Computer Communication Department of Computer Science, OG I 9 Study 1- Research Method Quickset testing with map-based tasks (community fire & flood management) 16 users— 8 native speakers & 8 accented (varied Asian, European & African accents) Research design— completely-crossed factorial with between-subjects factors: (1) Speaker status (accented, native) (2) Gender Corpus of 2,000 multimodal commands processed by QuickSet
Center for Human Computer Communication Department of Computer Science, OG I 10 Videotape Multimodal system processing for accented and mobile users
Center for Human Computer Communication Department of Computer Science, OG I 11 Study 1- Results 1 in 8 multimodal commands succeeded due to mutual disambiguation (MD) of input signals MD levels significantly higher for accented speakers than native ones— 15% vs 8.5% of utterances Ratio of speech to total signal pull-ups differed for users—.65 accented vs.35 native Results replicated across signal & parse-level MD
Center for Human Computer Communication Department of Computer Science, OG I 12 Table 1—Mutual Disambiguation Rates for Native versus Accented Speakers
Center for Human Computer Communication Department of Computer Science, OG I 13 Table 2- Recognition Rate Differentials between Native and Accented Speakers for Speech, Gesture and Multimodal Commands
Center for Human Computer Communication Department of Computer Science, OG I 14 Study 1- Results (cont.) Compared to traditional speech processing, spoken language processed within a multimodal architecture yielded: 41.3% reduction in total speech error rate No gender or practice effects found in MD rates
Center for Human Computer Communication Department of Computer Science, OG I 15 Study 2- Research Method QuickSet testing with same 100 map-based tasks Main study: –16 users with high-end mic (close-talking, noise- canceling) –Research design completely-crossed factorial: (1) Usage Context- Stationary vs Mobile (within subjects) (2) Gender Replication: –6 users with low-end mic (built-in, no noise cancellation) –Compared stationary vs mobile
Center for Human Computer Communication Department of Computer Science, OG I 16 Study 2- Research Analyses Corpus of 2,600 multimodal commands Signal amplitude, background noise & SNR estimated for each command Mutual disambiguation & multimodal system recognition rates analyzed in relation to dynamic signal data
Center for Human Computer Communication Department of Computer Science, OG I 17 Mobile user with hand-held system & close- talking headset in moderately noisy environment (40-60 dB noise)
Center for Human Computer Communication Department of Computer Science, OG I 18 Mobile research infrastructure, with user instrumentation and researcher field station
Center for Human Computer Communication Department of Computer Science, OG I 19 Study 2- Results 1 in 7 multimodal commands succeeded due to mutual disambiguation of input signals MD levels significantly higher during mobile than stationary system use— 16% vs 9.5% of utterances Results replicated across signal and parse-level MD
Center for Human Computer Communication Department of Computer Science, OG I 20 Table 3- Mutual Disambiguation Rates during Stationary and Mobile System Use
Center for Human Computer Communication Department of Computer Science, OG I 21 Table 4- Recognition Rate Differentials during Stationary and Mobile System Use for Speech, Gesture and Multimodal Commands
Center for Human Computer Communication Department of Computer Science, OG I 22 Study 2- Results (cont.) Compared to traditional speech processing, spoken language processed within a multimodal architecture yielded: 19-35% reduction in total speech error rate (for noise-canceling & built-in mics, respectively) No gender effects found in MD
Center for Human Computer Communication Department of Computer Science, OG I 23 Multimodal architectures can support mutual disambiguation & improved robustness over unimodal processing Error rate reduction can be substantial— 20-40% Multimodal systems can reduce or close the recognition rate gap for challenging users (accented speakers) & usage contexts (mobile) Error-prone recognition technologies can be stabilized within a multimodal architecture, which functionmore reliably in real-world contexts Conclusions
Center for Human Computer Communication Department of Computer Science, OG I 24 Future Directions & Challenges Intelligently adaptive processing, tailored for mobile usage patterns & diverse users Improved language & dialogue processing techniques, and hybrid multimodal architectures Novel mobile & pervasive multimodal concepts Break the robustness barrier— reduce error rate (For more information—