Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech Recognition: A 50 Year Retrospective Paper at ASA 2004 in Honor of Contributions of James Flanagan SR:50 year: It is a pleasure for me to be part.

Similar presentations


Presentation on theme: "Speech Recognition: A 50 Year Retrospective Paper at ASA 2004 in Honor of Contributions of James Flanagan SR:50 year: It is a pleasure for me to be part."— Presentation transcript:

1 Speech Recognition: A 50 Year Retrospective Paper at ASA 2004 in Honor of Contributions of James Flanagan SR:50 year: It is a pleasure for me to be part of this session to honor the many contributions of Jim Flanagan. Providing a 50 year retrospective on speech recognition research is a daunting task. Fortunately my task has been made easier by the presentations of Doctors Furui and Zue. Raj Reddy School of Computer Science Carnegie Mellon University Pittsburgh November 15, 2004

2 Speech Recognition Objective: Recognize, interpret, and execute spoken language input to computer Background: ATT, CMU, IBM, and MIT working on the problem for over 40 years Other Key Contributors: BBN, Dragon Systems, Kurzweil, SRI, Japan Inc., Europe Inc. Research and Development Level of Effort: About $200 million/year world wide Long Term Goal : Make speech the preferred mode of communication to computers Speech: The long term objective has always been to make speech the preferred mode of human computer interaction. Many researchers in US, Japan and Europe have made substantial contributions.

3 Why Speech Recognition Has Been Difficult?
Too Many Sources of Variability Noise Microphones Speakers Different Speech Sounds Different Pronunciations Non Grammaticality Imprecision of Language Why speech: Human Level Speech Recognition has proved to be an elusive goal because of the many sources of variability that affect speech: from noise, microphone and speaker variability, to variability at phonetic, lexical and grammatical levels.

4 Why Speech Recognition Has Been Difficult? (Cont)
Too Many Sources of Knowledge Acoustics Phonetics and Phonology Lexical Information Syntax Semantics Context Task Dependent Knowledge Why cont: While many knowledge sources are available to help in the decoding task, the problem of using these knowledge sources is analogous to getting experts who speak different languages to collaborate with each other.

5 Syntax: Use of Sentence Structure
How do we incorporate syntax into a recognition algorithm? Recognize the state and sub-select vocabulary Example: Video from Here! Hear! (1968) Imposing constraints on sentence and lexical structure reduces ambiguity Syntax: While isolated word recognition systems primarily used acoustic knowledge, some systems in the late 60s used mechanisms to represent and use syntactic and semantic knowledge.

6 Here hear: The following video clip from 1968 shows a voice controlled robot using constraints at sentence and lexical level to reduce ambiguity. <<Show video>>

7 Semantics: Use of Task level Knowledge
What is Semantics in the context of ASR and how to harness this power? Convert knowledge into constraints that limits the search space Video: Hearsay (1973) Chess Task Semantics Constrains the Commands (and the Vocabulary) to Only The Legal Moves Video Icon Lesson: Task Level Semantics can provide powerful constraints in situations like chess, but much less in Information Retrieval and Medical Diagnosis Semantics: As vocabularies became larger, leading to greater ambiguity and perplexity, we had to use context specific knowledge to reduce branching factors.

8 Hearsay: The following video clip from 73 of Hearsay-I shows a voice controlled chess machine, in which Task Level Semantics provide powerful constraints. In a chess task, commands that can be given are limited in any given board position. <<Show video>>

9 Representation: FSG and HMMs
How to effectively use all the disparate sources of knowledge? Blackboard Model using Hypothesize and Test paradigm (Hearsay system) Represent linguistic, lexical, phonological and acoustic phonetic knowledge as a single integrated FSG (Dragon system) Example from Dragon and Harpy Systems Compiling all knowledge into an integrated network permits efficient execution Video Icon Lesson: Integrated representation provides a single abstract model, leading to a great conceptual simplicity. Representation: A major problem in the 70s was how to represent and use diverse sources of knowledge effectively. Two models emerged. The blackboard model uses a hypothesis and test paradigm, in which one knowledge source proposes possible alternatives, and other knowledge sources accept or reject these choices. The second model, used by Dragon and Harpy systems, uses a single integrated Finite State Graph to represent linguistic, lexical, phonological and acoustic knowledge.

10 Harpy-1: This video clip of the Harpy system from 1976 shows that a single abstract finite state representation leads to a great conceptual simplicity. <<Show video>>

11 Search: Beam Search Optimal Search Requires Consideration Of Every Path at Huge Cost! given the probability estimates are approximate anyway, why not ignore unpromising alternatives? Example: Beam Search from Harpy System Speed-up by ignoring unpromising alternatives Eliminate backtracking Video Icon Lessons: Beam search improved the speed by one to two orders of magnitude with little degradation of accuracy compared to best first search such as “branch and bound” and A* type search techniques. Search: Once we have Finite State representation, searching all the paths to determine the optimal choice can be expensive.

12 Harpy-2: This video is from the Harpy system shows that the use of Beam search results in significant speed up. <<Show video>>

13 Speaker Independent Recognition: Use of Large Data Sets
Is speaker specific training essential for high performance? No. Equivalent performance can be obtained from multi-speaker training data. Needs usually 3 to 10 times more data than for speaker specific training. Leads to a more robust system Example: Kai Fu Lee Video Lesson: One hour speech from each of 100 different speakers can lead to more robust and equally accurate system than 10 hours speech from one speaker! Speaker: As the need arose for systems that can be used by open populations, we developed learning techniques that use very large data sets.

14 Sphinx: This video clip from the Sphinx system circa 1988 shows that HMM based learning systems can be effectively used to create speaker independent recognition systems. <<Show video>>

15 Unlimited Vocabulary Dictation: Statistical Language Modeling
Can a system be used for unlimited vocabulary dictation? Trigram and N-gram language models provide a flexible representation Examples: WSJ Dictation (1994) Unlimited Vocabulary Dictation (1995) Lesson: Given a large enough corpus of data, statistical language modeling can lead to respectable system performance Unlimited: For the first 30 years, the holy grail of speech recognition research was unlimited vocabulary dictation! With the emergence of computers that are a 1000 times faster than those of the 70s, and with more powerful language modeling, it became possible to demonstrate real time unlimited vocabulary dictation systems.

16 Lin Chase: The next two video clips illustrate dictation in the contexts of Wall Street Journal task and task. <<Show video>>

17 Voice email: <<Show video>>

18 Non Grammaticality in Spoken Language
Unlike written language, spoken language tends to be non-grammatical including non-verbal disfluencies Semantic Case Frame Parsing Example: Air Travel Information from open population (1994) Wayne Ward Video Lesson: Conventional NL Parsing breaks down for spoken language and need the use less rigid structures Non-grammaticality: Fluent speech phenomena such as ums, aahs, and other disfluencies, are pervasive in spoken language. They require different parsing techniques than those used for written language.

19 Wayne-Ward: In this video clip circa 1994 by Wayne Ward illustrates the use of Case Frame Parsing and Semantic Frames in the recognition task. <<Show video>>

20 Land Marks Dragon Dictate and Naturally Speaking
IBM Via Voice dictation Nuance-based Tellme 800 services allow voice query for directory information, stocks, sports, news, weather, and horoscopes Microsoft Speech Server e.g. voice dialing Landmarks: The emergence of commercial products in the last 10 years has been a most gratifying development. Dragon Naturally speaking, IBM Via Voice, Nuance based tell-me 800 service, and Microsoft Speech Server are noteworthy landmarks.

21 On the Need for Interdisciplinary Teams
Signal Processing Fourier Transforms, DFT, FFT Acoustics Physics of sounds & speech Vocal tract model Phonetics and Linguistics Sounds (Acoustic-Phonetics) Words (Lexicon) Grammar (Syntax) Meaning (Semantics) Statistics Probability Theory Hidden Markov Models Clustering Dynamic Programming AI and Pattern Recognition Knowledge Representation and Search Approximate Matching Natural Language Processing Human Computer Interaction Cognitive Science Design Social Networks Computer Science Hardware, Parallel Systems Algorithms Optimization Interdisciplinary: In summary, speech recognition has been a grand adventure for the past 50 years. The task has been difficult because, it requires the expertise of many interdisciplinary teams.

22 Future Challenges Unrehearsed Spontaneous Speech
Non Native Speakers of English Dynamic Learning from Sparse Data New Words New Speakers New Grammatical Forms New Languages No Silver Bullet on the Horizon! 50 more years? Million times greater computational power, memory and bandwidth? Future: We still have a long way to go before we can satisfactorily handle problems such as unrehearsed spontaneous speech. But with a million times more computational power potentially on the horizon, we may be able solve such problems.

23 Speech Research and Jim Flanagan
Pervasive Influence Across The Spectrum Of Speech Research Source of Encouragement and Inspiration Flanagan: Jim Flanagan’s many contributions to speech research, over the past 60 years, have left a lasting legacy. Thank you, Jim, for being a continuous source of inspiration.


Download ppt "Speech Recognition: A 50 Year Retrospective Paper at ASA 2004 in Honor of Contributions of James Flanagan SR:50 year: It is a pleasure for me to be part."

Similar presentations


Ads by Google