Speech Technology. HOT! What are the big players in the area up to? Google –http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech- technology.htmlhttp://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-

Slides:



Advertisements
Similar presentations
INTEGRATION OF VOICE SERVICES IN INTERNET APPLICATIONS By Eduardo Carrillo (lecturer), J. J Samper, J.J. Martínez-Durá Universidad Autónoma de Bucaramanga.
Advertisements

Fast. Forward. Wireless. Recommendations for Improving Device Independent Presentation Authoring Krishna Vedati.
Speech Synthesis Markup Language V1.0 (SSML) W3C Recommendation on September 7, 2004 SSML is an XML application designed to control aspects of synthesized.
Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.
1 SSML The Internationalization of the W3C Speech Synthesis Markup Language SpeechTek 2007 – C102 – Daniel C. Burnett.
XISL language XISL= eXtensible Interaction Sheet Language or XISL=eXtensible Interaction Scenario Language.
The State of the Art in VoiceXML Chetan Sharma, MS Graduate Student School of CSIS, Pace University.
Pace VoiceXML Absentee System Paul Visokey, Ping Gallivan, Yani Mulyani, Lisa Jordan, Elaine Li, George Mathew, Qisheng Hong Presenter Name : Paul Visokey.
VoiceXML and Internet Telephony Kundan Singh and Henning Schulzrinne Columbia University Joint work (in progress) with Daniel,
ISTD 2003, Audio / Speech Interactive Systems Technical Design Seminar work: Audio / Speech Ville-Mikko Rautio Timo Salminen Vesa Hyvönen.
Multimodal Architecture for Integrating Voice and Ink XML Formats Under the guidance of Dr. Charles Tappert By Darshan Desai, Shobhana Misra, Yani Mulyani,
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
Chapter 1 Understanding the Web Design Environment
Find The Better Way Expand Your Voice with VXML May 10 th, 2005.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
* The basic components of a web site are: * Content – information displayed or accepted from users * Static – content that doesn’t change for different.
Natural Language Processing and Speech Enabled Applications by Pavlovic Nenad.
Position Paper for W3C Workshop on Internationalizing SSML The Usage of Part-Of-Speech for Resolving Multiple Pronunciations in SSML Myoung-Wan.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Introduction and overview
Module 5 HTML: An Introduction. ●A language to describe and create web pages. ●Stands for: Hyper Text Markup Language What is HTML?
Chapter 1 Variables in the Web Design Environment
VoiceXML Builder Arturo Ramirez ACS 494 Master’s Graduate Project May 04, 2001.
1 Networks and the Internet A network is a structure linking computers together for the purpose of sharing resources such as printers and files Users typically.
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
Internet Skills An Introduction to HTML Alan Noble Room 504 Tel: (44562 internal)
Conversational Applications Workshop Introduction Jim Larson.
Introduction to XML Eugenia Fernandez IUPUI. What is XML? From the World Wide Web Consortium (W3C) The Extensible Markup Language (XML) is the universal.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
CS117 Introduction to Computer Science II Lecture 1 Introduction to WWW and HTML Instructor: Li Ma Office: NBC 126 Phone: (713)
ITCS 6010 SALT. Speech Application Language Tags (SALT) Speech interface markup language Extension of HTML and other markup languages Adds speech and.
Integrating VoiceXML with SIP services
Design and Construction of Accessible Web Sites Michael Burks Chairman Internet Society SIG For Internet Accessibility for People with Disabilities June.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Introduction to HTML Tutorial 1 eXtensible Markup Language (XML)
VoiceXML continued Speech reco/speech synthesis recap rps example ( ) Homework: Do VoiceXML examples. Start planning Project 2.
XML eXtensible Markup Language. Topics  What is XML  An XML example  Why is XML important  XML introduction  XML applications  XML support CSEB.
The Voice-Enabled Web: VoiceXML and Related Standards for Telephone Access to Web Applications 14 Feb Christophe Strobbe K.U.Leuven - ESAT-SCD-DocArch.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 29 – Accessibility Outline 29.1 Introduction 29.2 Web Accessibility 29.3 Web Accessibility Initiative.
Outline Grammar-based speech recognition Statistical language model-based recognition Speech Synthesis Dialog Management Natural Language Processing ©
Spoken Dialog Systems and Voice XML Lecturer: Prof. Esther Levin.
Acknowledgements Prof. Mctear, Natural Language Processing, University of Ulster.
XHTML By Trevor Adams. Topics Covered XHTML eXtensible HyperText Mark-up Language The beginning – HTML Web Standards Concept and syntax Elements (tags)
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Introduction to Computational Linguistics
Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.
Developing an Effective Wireless Middleware Strategy.
Introduction to Markup Languages January 31, 2002.
Listener-Control Navigation of VoiceXML. Nuance Speech Analysis 92% of customer service is through phone. 84% of industrialists believe speech better.
VoiceXML Version 2.0 Jon Pitcherella. What is it? A W3C standard for specifying interactive voice dialogues. Uses a “voice” browser to interpret documents,
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:
W3C Multimodal Interaction Activities Deborah A. Dahl August 9, 2006.
Presented by Sherif Abdou. VoiceXML experimentation platform Interpreter: SpeechWorks OpenVXI ASR engine: Lucent ASR (LASR) TTS engine:
VoiceXML. Nuance Speech Analysis 92% of customer service is through phone. 84% of industrialists believe speech better than web.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
INTONATION And IT’S FUNCTIONS
Presented By Sharmin Sirajudeen S7 CS Reg No :
A seminar by Ramesh Kumar Raju S CSSE 07121A1547.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
VoiceXML Tutorial: Part 1 Introduction and User Interaction with DTMF
Natural Language Processing and Speech Enabled Applications
Text-To-Speech System for English
Unit – 5 JAVA Web Services
Project 1 Introduction to HTML.
Teaching slides Chapter 6.
VoiceXML An investigation Author: Mya Anderson
Presentation transcript:

Speech Technology

HOT!

What are the big players in the area up to? Google – technology.htmlhttp://googleblog.blogspot.com/2010/12/can-we-talk-better-speech- technology.html Microsoft – enabled-world/ enabled-world/ Apple – heats-up-the-race-toward-a-voice-activated/ / heats-up-the-race-toward-a-voice-activated/ / IBM – Nuance – speechify-apps/ speechify-apps/ Voxeo

Apple, and the case of Siri Siri: B06O4 B06O4 Review of Siri: SkAU7c&feature=watch_response SkAU7c&feature=watch_response

Types of dialog systems by modality –text-based –spoken –graphical user interface –multi-modal by device –telephone-based systems –PDA systems –in-car systems –robot systems –desktop/laptop systems native in-browser systems in-virtual machine –in-virtual environment –robots by style –command-based –menu-driven –natural language by initiative –system initiative –user initiative –mixed initiative by application –information service –command-and-control –entertainment –education/tutorial –edutainment –reminder systems –companion systems –healthcare –eldercare –assistive/access systems

More about application types Information providing systems: –weather reports –stock quotes –timetables –... Transaction-based systems: –calendar functions –shopping –financial transactions –travel reservations –...

Why Voice?

Why voice? Wireless devices have small screens and limited input capabilities. Telephone keypad can give users only a limited number of choices. Speech technology is improving. The exchange of information between a person and a computer is becoming more like a real conversation. Users want hands-free or eyes-free use. From a business viewpoint, voice applications open up a host of new revenue opportunities. There exist many more telephones than computers with the potential to access the Internet.

Traditional Interactive Voice Response (IVR)

Speech versus Touch Tone

Architecture 1

Architecture 2

Today Presentation of project ideas TTS evaluation Short intro to XML Speech technology standards overview Speech Synthesis Markup Language (SSML) Presentation of home assignment 3: ASR evaluationASR evaluation

Project ideas?

Intro to XML

W3C Speech Standards Torbjörn Lager

VoiceXML – a part of the web Web servers VoiceXML browser (ASR, TTS, interpreter) Voice XML HTML browser HTML

The place of speech technology … speech technology itself has a very long way to go. … the most important thing may turn out to be not the speech technology itself, but the way in which speech technology connects to all the other technologies. Tim Berners-Lee

The What and Why of Standards Software standards include terminology, languages and protocols specified by committees of experts for widespread use in the software industry. Software standards have both advantages and disadvantages. Advantages: –developers can create applications using the standard languages that are portable across a variety of platforms; –products from different vendors are able to interact with each other; –a community of experts evolves around the standard and is available to develop products and services based on the standard. Disadvantages: –some developers feel that standards may inhibit creativity and stall the introduction of superior technology. However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough. Emerging speech-technology standards could give a boost to an industry hampered by proprietary software and hardware.

World Wide Web Consortium

W3C Speech Standards Speech Recognition Grammar Specification (SRGS) – What the user can say Semantic Interpretation for Speech Recognition (SISR) – What the user means Speech Synthesis Markup Language (SSML) – What the user hears VoiceXML – Dialog management: What the system is to do

Speech Recognition Grammar Specification (SRGS) Covers both speech and DTMF (Dual-Tone Multi-Frequency) input. (DTMF is valuable in noisy conditions or when the social context makes it awkward to speak.) Grammars can be specified in either an XML or an equivalent augmented BNF (ABNF) syntax. –Speech recognition is an inherently uncertain process. Recognizers may report confidence values. –If the utterance has several possible parses, the recognizer may be able to report the most likely alternatives (N-best results). What about statistical language models? Not covered by SRGS!

Semantic Interpretation for Speech Recognition (SISR) yes yeah yes you bet yes oui yes no nope no way no

Semantic Interpretation for Speech Recognition (SISR) I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }

I would like a out.drink = new Object(); out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize; and out.pizza=rules.pizza; coke pepsi coca cola out="coke"; out="medium"; small out="small"; medium large out="large"; regular out="medium"; out=new Array; out.push(rules.top); and out.push(rules.top); anchovies pepperoni mushroom out="mushrooms"; mushrooms out.drinksize=rules.foodsize; out.type=rules.kindofdrink; out.pizzasize=rules.foodsize; out.number=rules.number; pizzas with out.topping=rules.tops; out=1; a one two out=2; three out=3; I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] } }

Foundational Grammar (CFG, PSG) Automata theory (FSMs, FSTs, etc) Logic Phonetics Linguistics Computer science

Speech Synthesis Markup Language (SSML) The key concepts of SSML are –interoperability, or interacting with other markup languages (VoiceXML, etc.); –consistency, or providing predictable control of voice output across platforms and across speech synthesis implementations; and –internationalization, or enabling speech output in a large number of languages within or across documents.

Speech Synthesis Markup Language (SSML) – An Example For English, press one. Para español, oprima el dos.

Text Structure: p and s Elements A p element represents a paragraph. An s element represents a sentence. This is the first sentence of the paragraph. Here's another sentence.

The phoneme Element The phoneme element provides a phonemic/phonetic pronunciation for the contained text. tomato

The sub Element The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. W3C

The voice Element The voice element is a production element that requests a change in speaking voice. A selection of attributes is: –gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral". –age: optional attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. –name: optional attribute indicating a processor-specific voice name to speak the contained text. Mary had a little lamb, Its fleece was white as snow. I want to be like Mike.

The emphasis Element The emphasis element requests that the contained text be spoken with emphasis. That is a big car! That is a huge bank account!

The break Element The break element is an empty element that controls the pausing or other prosodic boundaries between words. Take a deep breath then continue. Press 1 or wait for the tone. I didn't hear you! Please repeat.

The prosody Element The prosody element permits control of the pitch, speaking rate and volume of the speech output.prosody The attributes, all optional, are: –pitch: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels.numberrelative change –contour: sets the actual pitch contour for the contained text. The format is specified in Pitch contour below.Pitch contour –range: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x- low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges.numberrelative change –rate: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.relative changenumberrelative change –duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [CSS2], e.g. "250ms", "3s".CSS2 –volume: the volume for the contained text in the range 0.0 to (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.numberrelative change

The prosody Element (cont’d) Pitch contour. The pitch contour is defined as a set of white space- separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor- specific. In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value). good morning

Today Project reminder Presentation of the results of the TTS evaluation Speech Synthesis Poetry Slam Wrapping up TTS (stages of TTS) Presentation of home assignment 3: ASR evaluation Automatic speech recognition (ASR) Natural language understanding (NLU) Speech Recognition Grammar Specification (SRGS) Semantic Interpretation for Speech Recognition (SISR) Thursday's Lab session

Architecture 1

Wrapping up TTS Stages of TTS: –Structure analysis (sentence splitting) –Text normalisation –Text to phoneme conversion –Prosody analysis –Waveform production Speech Synthesis Markup Language –enables developers to override default behavior

TTS stages and SSML elements StageSSML elements Structure analysis (sentence splitting),,., ?, ! Text normalisation, Text to phoneme conversion Prosody analysis,,,., ?, ! Waveform production,

Prosody analysis Pitch (intonation or melody), timing (rhythm), pauses, speech rate, emphasis on words, and the relative timing of segments and pauses. most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech. For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed. Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory. Prosody rules and algorithms are not perfect and are a topic of ongoing research. Prosody rules for different spoken national languages may be quite different. For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different.

Speech Recognition (ASR)

Architecture 1

ASR Input and Output A speech recognizer is a component with the following inputs and outputs: Input –A grammar or multiple grammars as defined by the SRGS specification. These grammars inform the recognizer of the words and patterns of words to listen for. –An audio stream that may contain speech content that matches the grammar(s). –Parameters: timeouts, recognition thresholds, or N-best result counts. Output –Descriptions of results that indicate details about the speech content detected by the speech recognizer. Recognizers will include at least a transcription of any detected words. –Errors and other performance information such as confidence

SRGS

hello s -> "hello"

SRGS hello goodbye s -> "hello" s -> "goodbye" s -> "hello" | "goodbye"

SRGS hello how are you s -> "hello" ("how are you")

SRGS hello s -> "hello" s -> "hello" s s -> "hello"+ NOTE: Listing is no longer possible

SRGS hello goodbye

SRGS hello goodbye s -> greeting+ greeting -> "hello" | "goodbye"

SRGS Boston Philadelphia Fargo Florida North Dakota New York

SRGS + SISR hello hello hi

SRGS + SISR yes yeah yes you bet yes oui yes no nope no way no

SISR I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }

I would like a out.drink={}; out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize; and out.pizza=rules.pizza; coke pepsi coca cola out="coke"; out="medium"; small out="small"; medium large out="large"; regular out="medium"; out=[]; out.push(rules.top); and out.push(rules.top); anchovies pepperoni mushroom out="mushrooms"; mushrooms out.drinksize=rules.foodsize; out.type=rules.kindofdrink; out.pizzasize=rules.foodsize; out.number=rules.number; pizzas with out.topping=rules.tops; out=1; a one two out=2; three out=3; I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] } }