Download presentation
Presentation is loading. Please wait.
1
VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES Roberto Pieraccini, CTO, Tell-Eureka Corporation 535 West 34 th Street New York, NY 10001 +1 646 792 2744 roberto@telleureka.com http://www.telleureka.com
2
The vision
3
Recreating the Speech Chain DIALOG SEMANTICS SYNTAX LEXICON MORPHOLOG Y PHONETICS VOCAL-TRACT ARTICULATORS INNER EAR ACOUSTIC NERVE SPEECH RECOGNITION DIALOG MANAGEMENT SPOKEN LANGUAGE UNDERSTANDING SPEECH SYNTHESIS
4
The technology
5
Talking Machines: First Steps into Spoken Language Technology Joseph Faber (1835) Von Kempelen (1791) Homer Dudley Bell Labs (1939)
6
Speech Recognition: the Early Years 1952 – Automatic Digit Recognition (AUDREY) Davis, Biddulph, Balashek (Bell Laboratories)
7
1960’s – Speech Processing and Digital Computers AD/DA converters and digital computers start appearing in the labs James Flanagan Bell Laboratories
8
The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value:7360474))
9
The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value:7360474)) errors Intra-speaker variability Noise/reverberation Coarticulation Context-dependency Word confusability Word variations Speaker Dependency Multiple Interpretations Limited vocabulary Ellipses and Anaphors rules
10
1969 – Whither Speech Recognition? […] General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish. […] It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but no sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour. […] Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach). The Journal of the Acoustical Society of America, June 1969 J. R. Pierce Executive Director, Bell Laboratories
11
1971-1976: The ARPA SUR project In spite of the anti-speech recognition campaign headed by the Pierce Commission ARPA launches into a 5 year program on Spoken Understanding Research REQUIREMENTS: 1000 word vocabulary, 90%understanding rate, near real time on a 100 MIPS machine 4 Systems built by the end of the program SDC (24%) BBN’s HWIM (44%) CMU’s Hearsay II (74%) CMU’s HARPY (95% -- 80 times real time!) HARPY was based on an engineering approach search on a network representing all the possible utterances Lack of a scientific evaluation approach Speech Understanding: too early for its time The project was not extended. Raj Reddy -- CMU LESSON LEARNED: Hand-built knowledge does not scale up Need of a global “optimization” criterion
12
Vintage Speech Recognition
13
1970’s – Dynamic Time Warping The Brute Force of the Engineering Approach TEMPLATE (WORD 7) UNKNOWN WORD T.K. Vyntsyuk (1969) H. Sakoe, S. Chiba (1970) Isolated Words Speaker Dependent Connected Words Speaker Independent Sub-Word Units
14
1980s -- The Statistical Approach Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s Purely statistical approach pursued by Fred Jelinek and Jim Baker at IBM T.J.Watson Research Foundations of modern speech recognition engines Fred Jelinek S1S1 S2S2 S3S3 a 11 a 12 a 22 a 23 a 33 Acoustic HMMs Word Tri-grams No Data Like More Data Whenever I fire a linguist, our system performance improves (1988) Some of my best friends are linguists (2004) Jim Baker
15
1980-1990 – The statistical approach becomes ubiquitous Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.
16
1980s-1990s – The Power of Evaluation Pros and Cons of DARPA programs + Continuous incremental improvement - Loss of “bio-diversity” SPOKEN DIALOG INDUSTRY SPEECHWORKS NUANCE MIT SRI TECHNOLOGY VENDORS PLATFORM INTEGRATORS APPLICATION DEVELOPERS HOSTING TOOLS STANDARDS 1997 1995 1996 1998 1999 2000 2001 2002 2003 2004 2005
17
The business of speech
18
Voice User Interface (VUI) Design—the Quantum Leap in Dialog Systems 1995 -- The WildFire Effect Change of perspective: From technology driven to user centered RESEARCH: Natural Language free form Commercial: Task completion and usability. Persona: the personality of the application (TTS vs. Recording) Speech recognition accuracy is important, but success is determined by the VUI. The importance of a repeatable, streamlined, teachable, development process
19
The Speech Application Lifecycle 23 45 6 7 8 9 10 requirements VUI design usability 1 VUI development speech science high level system design system engineering integration partial deployment full deployment Analyst VUI Designer Speech Scientist VUI Designer Architect, App Developer Engineer Project Manager
20
Get Origin Account Get Destination Account Get Amount Enter Transfer amount > origin account? Play Wrong Amount Message YES Play Confirmation confirmed? What is wrong? Go to Main Menu NO YES NO amount destination account origin account Voice User Interface Design Get AmountInteraction Module PROMPTS TypeWordingSource InitialPlease say the amount you would like to transfer from yourget_amount_I_1.wav TTS to yourget_amount_I_2.wav TTS in dollars and cents.get_amount_I_3.wav Retry 1Please say the amount you would like to transfer from yourget_amount_I_1.wav TTS to yourget_amount_I_2.wav TTS in dollars and cents.get_amount_I_3.wav Retry 2Please say the amount you would like to have transferred, like one hundred dollars and fifty cents.get_amount_R_2_1.wav Timeout 1 I'm sorry, I didn't hear you.get_amount_T_1_1.wav Please say the amount you would like to transfer from yourget_amount_I_1.wav TTS to yourget_amount_I_2.wav TTS Timeout 2 I didn't hear you this time either. Please say the amount you would like to have transferred, like one hundred dollars and fifty cents.get_amount_T_2_1.wav Help Please say how much do you wish to transfer. You can say the amount in dollars and cents, like, for instance, one hundred dollars and fifty cents.get_amount_H.wav ACTIONS CONDITIONACTION if amount greater than amount in Go to "Play Wrong Amount Message" elseGo to "Play Confirmation"
21
Speech Science: Tuning for performance Recognition Correctly Recognize Mis- recognize In Vocabulary Out of Vocabulary Falsely Reject False rejection Falsely Accept Correctly Reject Correct rejection False acceptance - out Accept Confirm False acceptance - in False confirmation Accept Confirm Correct acceptance Correct confirmation
22
Speech Science: Tuning for performance utt#sub-err%fa-err%fr-err%rej%OOV%fa-oov% WaitPowerBothUp-2175.8800 0 WaitHowMuchSnow175.8811.765.8823.5329.4140 MissingOneChannel224.55009.09 0 WPAllChannels234.350 8.74.350 PictureBack273.7 7.41 50 WaitFindInputSource293.450013.79 0 PictureProb333.0312.1200 100 DM Utt# = Number of utterances Sub-err% = percent of in-voc utterances wrongly recognized Fa-err% = percent of utterances wrongly accepted Fr-err% = percent of utterances wrongly rejected Rej% = total percent of all utterances rejected OOV% = percent of out-voc utterances Fa-oov% = percent of out-voc utterances wrongly accepted ACTION -Prioritize grammars that need improvement -Use transcriptions to improve grammars
23
The Architectural Evolution of Spoken Dialog Native Code Proprietary IVR Systems Standard Clients (VoiceXML) Standard Application servers 1994199820002005
24
Telephony Platform The Voice Web Web Server Voice Browser Internet ASR TTS Telephone VoiceXML /SALT MRCP SSML, SRGF EMMA? SCXML? CCXML
25
The Evolution of the Interface and the Research-Industry Chasm Natural Language Directed Dialog 1994199619982000200220042006 Research Systems a-la DARPA Communicator Small Vocabulary Menu Based Large Vocabulary, Dialog Modules SLU: Statistical Language Understanding Spoken dialog as an anthropomorphic system Spoken dialog as a tool
26
The evolution of the market and the industry TECHNOLOGY VENDORS SPEECH RECOGNITION, TTS PLATFORM INTEGRATORS IVR, VoiceXML, CTI,… TOOLS – AUTHORING, TUNING, PREPACKAGED APPLICATIONS APPLICATION DEVELOPERS PROFESSIONAL SERVICES HOSTING 600 to 1,000M$ revenue > 8000 apps worldwide New evolving standards guarantee interoperability of engines and platforms.
27
Third generation dialog systems LOW MEDIUM HIGH COMPLEXITY FLIGHT STATUS STOCK TRADING PACKAGE TRACKING FLIGHT/TRAIN RESERVATION BANKING CUSTOMER CARE TECHNICAL SUPPORT 1 st Generation INFORMATIONAL 2 nd Generation TRANSACTIONAL 3 RD Generation PROBLEM SOLVING
28
2005 -- Spoken Dialog goes to Saturday Night Live
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.