ISSUES IN SPEECH RECOGNITION Shraddha Sharma

Slides:

Advertisements

Similar presentations

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Advertisements

Automatic Speech Recognition Slides now available at

Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.

Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.

Sean Powers Florida Institute of Technology ECE 5525 Final: Dr. Veton Kepuska Date: 07 December 2010 Controlling your household appliances through conversation.

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

Chapter 1: Introduction Business Data Communications, 4e.

Auditory User Interfaces

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Why is ASR Hard? Natural speech is continuous

A PRESENTATION BY SHAMALEE DESHPANDE

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

1 D r a f t Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module k – Uncertainty in LCA.

Natural Language Understanding

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

Some Voice Enable Component Group member: CHUAH SIONG YANG LIM CHUN HEAN Advisor: Professor MICHEAL Project Purpose: For the developers,

Introduction to Automatic Speech Recognition

Language Assessment 4 Listening Comprehension Testing Language Assessment Lecture 4 Listening Comprehension Testing Instructor Tung-hsien He, Ph.D. 何東憲老師.

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

Phonetics and Phonology

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.

Microphone Integration – Can Improve ARS Accuracy? Tom Houy

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

1 Computational Linguistics Ling 200 Spring 2006.

By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.

Speech, Perception, & AI Artificial Intelligence CMSC March 5, 2002.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Introduction to CL & NLP CMSC April 1, 2003.

THE NATURE OF TEXTS English Language Yo. Lets Refresh So we tend to get caught up in the themes on English Language that we need to remember our basic.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

Dept. of Computer Science University of Rochester Rochester, NY By: James F. Allen, Donna K. Byron, Myroslava Dzikovska George Ferguson, Lucian Galescu,

Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.

HYMES (1964) He developed the concept that culture, language and social context are clearly interrelated and strongly rejected the idea of viewing language.

Speech Recognition MIT SMA 5508 Spring 2004 Larry Rudolph (MIT)

Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.

Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.

Introduction to Language Phonetics 1. Explore the relationship between sound and spelling Become familiar with International Phonetic Alphabet (IPA )

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Introduction to Digital Speech Processing Presented by Dr. Allam Mousa 1 An Najah National University SP_1_intro.

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Against formal phonology (Port and Leary).  Generative phonology assumes:  Units (phones) are discrete (not continuous, not variable)  Phonetic space.

Speech Recognition Created By : Kanjariya Hardik G.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

IIS for Speech Processing Michael J. Watts

Fourier Analysis Patrice Koehl Department of Biological Sciences National University of Singapore

Speech Recognition

Automatic Speech Recognition

Speech Recognition UNIT -5.

Artificial Intelligence for Speech Recognition

3.0 Map of Subject Areas.

Automatic Speech Recognition

Kocaeli University Introduction to Engineering Applications

汉语连续语音识别年1月4日访北京工业大学 973 Project 2019/4/17 汉语连续语音识别年1月4日访北京工业大学郑方清华大学计算机科学与技术系语音实验室

Topic: Language perception

Artificial Intelligence 2004 Speech & Natural Language Processing

Presentation transcript:

ISSUES IN SPEECH RECOGNITION Shraddha Sharma

Contents: Introduction What is speech recognition? Terminology of speech recognition Why we want speech recognition? What is speech? Difficulties with ASR? Solutions for difficulties of ASR

Speech Recognition: Definition:- The process of interpreting human speech in a computer. The more technical definition by Jurafsky: ASR as the building of system for mapping acoustic signals to a string of words.

Terminology of Speech Recognition: Speaker Dependent Recognition The recognition system is designed to work with just one or a small number of individual speakers Speaker Independent Recognition These systems are designed to work with all the speakers from a given linguistic community Large Vocabulary Recognition Very difficult to make accurate large vocabulary, speaker independent systems

Small Vocabulary Recognition . Small Vocabulary Recognition Typically recognition of a few keywords such as digits or a set of commands. Example: voice operated telephone number dialing Isolated Word Recognition: Systems which can only recognize individual words which are preceded and followed by relatively long period of silence.

Connected Word Recognition: Systems which can recognize a limited sequence of words spoken in succession. e.g. “Ninety-eight thirty-five four thousand” Continuous Word Recognition: These systems can recognize speech as it occurs and recognize the speech in real time. Such system usually work with large vocabulary, but with moderate accuracy.

Why we want speech recognition ? The main goal of speech recognition is to get efficient ways for humans to communicate with computers. speech recognition is important, not because it is natural for us to communicate via speech, but because in some cases, it is the most efficient way to interface to a computer.

Applications of speech recognition:- 1. Telephone application 2. Hands free operation 3. Application for physically handicapped 4. Dictation 5. Translation 6. Environmental control

What is speech? When humans speak, let air pass from our lungs through our mouth and nasal cavity, and this air stream is restricted and changed with our tongue and lips. This produces contractions and expansions of the air, an acoustic wave, a sound. The sounds we form, the vowels and consonants, are usually called phones. The phones are combined together into words. However, speech is more than sequences of phones that forms words

The term speech signal within ASR refers to the analog electrical representation of the contractions and expansions of air. The analog signal is then converted into a digital representation by sampling the analog continuous signal. A high sampling rate in the A/D conversion gives a more accurate description of the analog signal, but also leads to a higher degree of space consumption.

Difficulties with ASR:- 1. Human comprehension of speech compared with ASR 2. Body Language 3. Noise 4. Spoken language /Written language 5. Continuous speech 6. Channel variability 7.Speaker variability 8.Amount of data & search space 9. Ambiguity

Human comprehension of speech compared to ASR :- Humans use the knowledge they have about the speaker and the subject. Words are not arbitrarily sequenced together, there is a grammatical structure and redundancy that humans use to predict words not yet spoken. In ASR we only have the speech signal. We can construct a model for the grammatical structure and use some kind of statistical model to improve prediction, but there are still the problem of how to model world knowledge, the knowledge of the speaker and encyclopedic knowledge.

Body language:- A human speaker does not only communicate with speech, but also with body signals - hand waving, eye movements, postures etc. This information is completely missed by ASR. Noise:- Speech is uttered in an environment of sounds Unwanted information in the speech signal is called noise. In ASR we have to identify and filter out these noises from the speech signal.

Spoken language /Written language:- 1. Written communication is usually a one-way communication, but speech is dialogue-oriented. 2. Disfluences in speech, e.g. normal speech is filled with hesitations, repetitions, changes of subject in the middle of an utterance, slips of the tounge etc. 3. The grammaticality of spoken language is quite different to written language at many different levels.

Continuous speech:- Natural speech is continuous it does not have pauses between the words , so the recognition of continuously spoken speech is significantly more difficult. The complexity ASR is caused by mainly 3 properties of continuous speech that are: 1. Word boundaries 2. Coarticulatory effects 3. Content words

Channel variability:- Aspect of variability is the context were the acoustic wave is uttered. Here we have the problem with noise that changes over time, and different kinds of microphones and everything else that effects the content of the acoustic wave from the speaker to the discrete representation in a computer. This phenomena is called channel variability.

Speaker variability:- All speakers have their special voices, due to their unique physical body and personality. The voice is not only different between speakers, there are also wide variations within one specific speaker. list some of these variations are: 1.Realization 2. Speaking style 3. The sex of the speaker

4. Anatomy of vocal tract 5. Speed of speech 6. Regional and social dialects Regional dialects involves features of pronunciation, vocabulary and grammar which differ according to the geographical area the speaker come from. Social dialect are distinguished by features of pronunciation, vocabulary and grammar according to the social group of the speaker.

Also minimize our lexicon, i.e. set of words. Amount of data and search space:- Communication with a computer via a microphone induces a large amount of speech data every second. This has to be matched to group of phones the sounds, the words and the sentences. Groups of groups of phones that build up words and words builds up sentences. The number of possible sentences are enormous. Also minimize our lexicon, i.e. set of words. This introduces another problem, which is called out-of-vocabulary, which means that the intended word is not in the lexicon. ASR system has to handle out-of vocabulary in a robust way.

Ambiguity:- Natural language has an inherent ambiguity, i.e. we can not always decide which of a set of words is actually intended. There are two ambiguities that are particular to ASR, 1.homophones 2. word boundary ambiguity

The concept homophones refers to words that sound the same, but have different orthography. They are two unrelated words that just happened to sound the same

Word boundary ambiguity:- When a sequence of groups of phones are put into a sequence of words, we sometimes encounters word boundary ambiguity. Word boundary ambiguity occurs when there are multiple ways of grouping phones into words.

Solution for issues in the speech recognition:- A general solution of many of the above problems effectively requires human knowledge and experience, and would thus require advanced artificial intelligence technologies to be implemented on a computer. In particular, statistical language models are often employed for disambiguation and improvement of the recognition accuracies.

Language Model: The choice of language model has a significant impact on recognition process. The constraint provided by a language model can substantially improve a system performance & size of the search space generated by a L.M. There are 4 type of L.M. 1. UNIFORM L.M.:- Every word in sentences is equally probable

2. Stochastic L.M.:- Trigram, bigram & unigram 3. Finite state L.M.:- It is simple artificial language that model all legal sentences using a single network. 4. Other possible L.M.:- In this context free, unification, stastistical tree based & case frame grammars.

Fundamental equation of speech recognition:- P(w)-> a priori probability of word sequence of w It is computed from language model. P(y/w)-> the conditional probability of the acoustic model P(y)-> the probability of acoustic sequence.

Combining language & acoustic models:- Probability theory suggest that acoustic & language probabilites can be combined through multiplication there is some weighting is necessary. To balance the both replace the term p(w) with p(w)^l Range of l is between 2 – 5. Here l indicates the language model weight. It is determined in order to optimize the recognition performance in CSR.

Improve the acoustic models so that they better represent the statistics of the true incoming audio data. continuous recognition requires lots of CPU power. While isolated-word recognizers can run on a slower machines, this is only because when you pause between words, you're telling it where the words start and stop. But in continuous speech, any word could potentially start and stop at any time, so the system has to search through and consider every possible start time and every possible end time for every possible word to be recognized, and find the sequence that fits the best.

design your grammar using words that are different by multiple phonemes, and you should have good results.

THANK YOU…