Speech recognition, understanding and conversational interfaces Alexander Rudnicky School of Computer Science

Slides:



Advertisements
Similar presentations
Key architectural details RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda Dan BohusAlex Rudnicky School of.
Advertisements

Map of Human Computer Interaction
Manuela Veloso, Anthony Stentz, Alexander Rudnicky Brett Browning, M. Bernardine Dias Faculty Thomas Harris, Brenna Argall, Gil Jones Satanjeev Banerjee.
TECHNOLOGY FOR MOBILE ADVERTISING SEARCH & COMMERCE © 2007 Apptera Inc. Optimizing Software Architecture for Voice Search SpeechTek 2007.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
(Spoken) Dialogue and Information Retrieval Antoine Raux Dialogs on Dialogs Group 10/24/2003.
Managing Complexity: 3rd Generation Speech Applications Roberto Pieraccini August 7, 2006.
CSE111: Great Ideas in Computer Science Dr. Carl Alphonce 219 Bell Hall Office hours: M-F 11:00-11:
Information Retrieval in Practice
Course Overview Lecture 1 Spoken Language Processing Prof. Andrew Rosenberg.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
1 SWE Introduction to Software Engineering Lecture 22 – Architectural Design (Chapter 13)
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Spoken Dialogue Technology How can Jerry Springer contribute to Computer Science Research Projects?
Psychological Aspects Presented by Hanish Patel. Overview  HCI (Human Computer Interaction)  Overview of HCI  Human Use of Computer Systems  Science.
Application architectures
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Application architectures
Natural Language Processing and Speech Enabled Applications by Pavlovic Nenad.
1 A Practical Rollout & Tuning Strategy Phil Shinn 08/06.
© GyrusLogic, Inc A Conversational System That Reduces Development Cost Luis Valles, Chief Scientist GyrusLogic, Inc. Monday, August 7 at 1:30 PM.
Speech Guidelines 2 of Errors VUIs are error-prone due to speech recognition. Humans aren’t perfect speech recognizers, therefore, machines aren’t.
Should Intelligent Agents Listen and Speak to Us? James A. Larson Larson Technical Services
Audio Fundamentals Lesson 2 Sound, Sound Wave and Sound Perception
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
BravoBrava Mississippi State University Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? Joseph.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Information Retrieval using Intelligent Speech Communication Interface Institute of Informatics of the Slovak Academy of Sciences, Bratislava
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Spoken Dialogue Systems and the GALAXY Architecture 29 October 2000 Advanced Technology Laboratories 1 Federal Street A&E Building 2W Camden, New Jersey.
Objectives  Testing Concepts for WebApps  Testing Process  Content Testing  User Interface Testing  Component-level testing  Navigation Testing.
EXCS Sept Knowledge Engineering Meets Software Engineering Hele-Mai Haav Institute of Cybernetics at TUT Software department.
Interaction Design Session 12 LBSC 790 / INFM 718B Building the Human-Computer Interface.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
1 Computational Linguistics Ling 200 Spring 2006.
CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Outline Grammar-based speech recognition Statistical language model-based recognition Speech Synthesis Dialog Management Natural Language Processing ©
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Faculty: Manuela Veloso, Anthony Stentz, Alex Rudnicky Brett Browning, M. Bernardine Dias Students: Thomas Harris, Brenna Argall, Gil Jones Satanjeev Banerjee.
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
卓越發展延續計畫分項三 User-Centric Interactive Media ~ 主 持 人 : 傅立成 共同主持人 : 李琳山,歐陽明,洪一平, 陳祝嵩 水美溫泉會館研討會
DenK and iCat Two Projects on Cooperative Electronic Assistants (CEA’s) Robbert-Jan Beun, Rogier van Eijk & Huub Prüst Department of Information and Computing.
Architecture of Decision Support System
Introduction to Computational Linguistics
Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.
Natural Language Processing Menu Based Natural Language Interfaces -Kyle Neumeier.
Intelligent Robot Architecture (1-3)  Background of research  Research objectives  By recognizing and analyzing user’s utterances and actions, an intelligent.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
© 2013 by Larson Technical Services
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
1 Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents S. Kawamoto, et al. October 27, 2004.
金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.
Presentation Title 1 1/27/2016 Lucent Technologies - Proprietary Voice Interface On Wireless Applications Protocol A PDA Implementation Sherif Abdou Qiru.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Language Technologies Capability Demonstration Alon Lavie, Lori Levin, Alex Waibel Language Technologies Institute Carnegie Mellon University CATANAL Planning.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
Preparing for the 2008 Beijing Olympics : The LingTour and KNOWLISTICS projects. MAO Yuhang, DING Xiao-Qing, NI Yang, LIN Shiuan-Sung, Laurence LIKFORMAN,
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Application of Speech Recognition,
Natural Language Processing and Speech Enabled Applications
Natural Language Understanding
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Map of Human Computer Interaction
Artificial Intelligence 2004 Speech & Natural Language Processing
Huawei CBG AI Challenges
Presentation transcript:

Speech recognition, understanding and conversational interfaces Alexander Rudnicky School of Computer Science

Outline Speech Types of speech interfaces Speech systems and their structure Designing speech interfaces Some applications –SpeechWear –Communicator

Speech as a signal The difference between speech and sound –“CD” quality vs. intelligible quality high-quality is 44.1 / 48 kHz desirable speech bandwidth: 0-8kHz, 16bits –at 16bits/sample: 256kbps (tethered mic) –telephone: 64kbps (and lower) –Compression: –MPEG: 64kbps/channel and up (but not speech-optimal) –CELP: 16kbps … 2.4kbps (optimized for speech)

Speech for communication The difference between speech and language Speech recognition and speech understanding

Computers and speech Transcription –dictation, information retrieval Command and control –data entry, device control, navigation Information access –airline schedules, stock quotes Problem solving –travel planning, logistics

Speech system architecture SIGNAL PROCESSING DECODING UNDERSTANDING DISCOURSE ACTION

Varieties of speech systems

A generic speech system speech Signal processing Dialog manager Decoder Parser Language Generator Speech synthesizer Post parser Domain agent Domain agent Domain agent speechdisplayeffector

Decoding speech Signal processing Decoder Reduce dimensionality of signal noise conditioning Transcribe speech to words Acoustic models Language models Corpus-base statistical models

Creating models for recognition Acoustic models Language models Speech data Text data Train Transcribe*

Understanding speech Parser Post parser Extract semantic content from utterance Introduce context and world knowledge into interpretation Grammar Context Domain Agents Grounding, knowledge engineering Ontology design, language acquisition

Interacting with the user Dialog manager Domain agent Domain agent Domain agent Guide interaction through task Map user inputs and system state into actions Interact with back-end(s) Interpret information using domain knowledge Task schemas Database Live data (e.g. Web) Domain expert Context Task analysis Knowledge engineering

Communicating with the user Language Generator Speech synthesizer Display Generator Action Generator Decide what to say to user (and how to phrase it)

Speech recognition and understanding Sphinx system –speaker-independent –continuous speech –large vocabulary ATIS system –air travel information retrieval –context management film clip

Command and control systems Small vocabularies, fixed syntax –OPEN WINDOW –MOVE OBJECT to –Applications: data entry (e.g., zip codes), process control (e.g., electron microscope, darkroom equipment) Large vocabulary, fixed syntax –Web browsing (?)

SpeechWear Vehicle inspection task –USMC mechanics, fixed inspection form –Wearable computer (COTS components) –html-based task representation film clip

Information access Moderate to very large vocabulary –IVR and frame based systems Commercial systems: –Nuance: –SpeechWorks: –lots of others..

IVR and frame-based systems Interactive voice response (IVR) –interactions specified by a graph (typically a tree) Frame systems –ergodic graphs –states defined by multi-item forms

Graph-based systems Welcome to Bank ABC! Please say one of the following: Balance, Hours, Loan,... What type of loan are you interested in? Please say one of the following: Mortgage, Car, Personal,.....

Frame-based systems I would like to fly to Boston –I’d like to go to Boston on Friday, … When would you like to fly? Destination_City: Boston Departure_Date: ______ Departure_Time: ______ Preferred_Airline: ______...

Frame-based systems Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____. Transition on keyword or phrase

Some problems IVR systems work great, but only for well- structured (& “shallow”) tasks Frame systems are good for “tasks” that correspond to a single form leading to an action Neither approach does well with more complex problem-solving activities

Dialog Systems Problem solving activity; complex task –Order of progression through task depends on user goals (which can change) and system state (a back-end retrieval) and is not predictable. Track progress and help task along –mixed-initiative dialog Discourse phenomena –User expect to “converse” with the system

Carnegie Mellon Communicator A dialog system that supports complex problem solving in a travel planning domain –create an itinerary using air schedule, hotel and car information –186 U.S. airports (>140k enplanements/yr) currently: >500 world airports Web-based data resources –Live and cached flight information –Airport, airline, etc. information

Value schema/handlers value transform receptors Domain Agent

Compound schema value transform Value_3 Value_1 Value_2 Domain Agent e.g. SQL query +

Schema ordering Value i Value j Value k Schema i Schema j Schema k Destination airport Date Time Flight Leg Value transform Available flights Database lookup

Carnegie Mellon Communicator CMU Communicator –Call: –the information is accurate; you can use it for your own travel planning...

User-aware speech interfaces Predictable behavior on the system’s part Users coomunicate at different levels Chars.htmlhttp:// Chars.html

User-aware speech interfaces Content: task-centric utterances Possibility: What can I do? Orientation: Where are we? Navigation: moving through the task space Control: verbose/terse, listen! Customization: define this word

Speech interface guidelines Speech recognition is errorful System state is often opaque to the user pInGuidelines/SpInGuidelines.htmlhttp:// pInGuidelines/SpInGuidelines.html

Interface guidelines State transparency Input control Error recovery Error detection Error correction Log performance Application integration

Summary Speech and language communication Dialog structure Interface design