Automatic Speech Recognition

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Automatic Speech Recognition: An Overview
CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
COMP 4060 Natural Language Processing Speech Processing.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
1 Automatic Speech Recognition: An Overview Julia Hirschberg CS 4706 (special thanks to Roberto Pieraccini)
Why is ASR Hard? Natural speech is continuous
Natural Language Understanding
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
7-Speech Recognition Speech Recognition Concepts
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Automatic Speech Recognition
Artificial Intelligence for Speech Recognition
Automatic Speech Recognition Introduction
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition, Text-to-Speech, and Natural Language Understanding Technologies Julia Hirschberg LSA.
Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Speech Processing Speech Recognition
Audio Books for Phonetics Research
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
N-Gram Model Formulas Word sequences Chain rule of probability
PROJ2: Building an ASR System
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LECTURE 15: REESTIMATION, EM AND MIXTURES
Dynamic Programming Search
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Automatic Speech Recognition
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Presentation transcript:

Automatic Speech Recognition Lecture 7 Automatic Speech Recognition CS 4705

What is speech recognition? Transcribing words? Understanding meaning? Today: Overview ASR issues Building an ASR system Using an ASR system Future research

“It’s hard to ... recognize speech/wreck a nice beach” Speaker variability: within and across Recording environment varies wrt noise Transcription task must handle all of this and produce a transcript of what was said, from limited, noisy information in the speech signal Success: low word error rate (WER) WER = (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER Understanding task must do more: from words to meaning

Measure concept accuracy (CA) of string in terms of accuracy of recognition of domain concepts mentioned in string and their values I want to go from Boston to Baltimore on September 29 Domain concepts Values source city Boston target city Baltimore travel date September 29 Score recognized string “Go from Boston to Washington on December 29” (1/3 = 33% CA) “Go to Boston from Baltimore on September 29”

Again, the Noisy Channel Model Source Noisy Channel Decoder Input to channel: spoken sentence s Output from channel: an observation O Decoding task: find s = P(s|O) Using Bayes Rule And since P(O) doesn’t change for any hypothetical s’ s’ = P(O|s) P(s) P(O|s) is the observation likelihood, or Acoustic Model, and P(s) is the prior, or Language Model

What do we need to build use an ASR system? Corpora for training and testing of components Feature extraction component Pronunciation Model Acoustic Model Language Model Algorithms to search hypothesis space efficiently

Training and Test Corpora Collect corpora appropriate for recognition task at hand Small speech + phonetic transcription to associate sounds with symbols (Acoustic Model) Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) Very large text corpus to identify unigram and bigram probabilities (Language Model)

Representing the Signal What parameters (features) of the speech input Can be extracted automatically Will preserve phonetic identity and distinguish it from other phones Will be independent of speaker variability and channel conditions Will not take up too much space Speech representations (for [ae] in had): Waveform: change in sound pressure over time LPC Spectrum: component frequencies of a waveform Spectrogram: overall view of how frequencies change from phone to phone

Signal divided into frames Speech captured by microphone and sampled (digitized) -- may not capture all vital information Signal divided into frames Power spectrum computed to represent energy in different bands of the signal LPC spectrum, Cepstra, PLP Each frame’s spectral features represented by small set of numbers Frames clustered into ‘phone-like’ groups (phones in context) -- Gaussian or other models

Why this works? Different phonemes have different spectral characteristics Why it doesn’t work? Phonemes can have different properties in different acoustic contexts, spoken by different people … Nice white rice

Pronunciation Model Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) Allophones: butter vs. but Multiple pronunciations for each word Lexicon may be weighted automaton or simple dictionary Words come from all corpora; pronunciations from pronouncing dictionary or TTS system

Acoustic Models Model likelihood of phones or subphones given spectral features and prior context Use pronunciation models Usually represented as HMM Set of states representing phones or other subword units Transition probabilities on states: how likely is it to see one phone after seeing another? Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1?

Initial estimates for Transition probabilities between phone states Observation probabilities associating phone states with acoustic examples Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) I.e., we tell the HMM the ‘right’ answers -- which words to associate with which sequences of sounds Iteratively retrain the transition and observation probabilities by running the training data through the model and scoring output until no improvement

Language Model Models likelihood of word given prior word and of entire sentence Ngram models: Build the LM by calculating bigram or trigram probabilities from text training corpus Smoothing issues very important for real systems Grammars Finite state grammar or Context Free Grammar (CFG) or semantic grammar Out of Vocabulary (OOV) problem

Entropy H(X): the amount of information in a LM, grammar How many bits will it take on average to encode a choice or a piece of information? More likely things will take fewer bits to encode Perplexity 2H: a measure of the weighted mean number of choice points in e.g. a language model

Search/Decoding Find the best hypothesis P(O|s) P(s) given Lattice of subword units (Acoustic Model) Segmentation of all paths into possible words (Pronunciation Model) Probabilities of word sequences (Language Model) Produces a huge search space: How to reduce? Lattice minimization and determinization Forward algorithm: sum of all paths leading to a state Viterbi algorithm: max of all paths leading to a state

Forward-backward (Baum-Welch, Expectation-Maximization) algorithm: computes probability of sequence at any state in search space Beam search: prune the lattice

Varieties of Speech Recognition Telephone, microphones Input device > 30dB (high)  < 10dB (low) SNR <10  > 100 Perplexity Finite state, ngrams, CFGs, CSGs Language Model <20  5K --> 60K -->~1M Vocabulary size Speaker-dependent or independent Enrollment Read, prepared, spontaneous Style Isolated words  continuous Mode

Challenges for Transcription Robustness to channel characteristics and noise Portability to new applications Adapatation: to speakers, to environments LMs: simple ngrams need help Confidence measures OOV words New speaking styles/genres New applications

Challenges for Understanding Recognizing communicative ‘problems’ ASR errors User corrections Disfluencies and self-repairs Possibilities: Recognizing speaker emotion Identifying speech acts: okay Locating topic boundaries for topic tracking, audio browsing, speech data mining

An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning. Note that the previous speaker was not a native speaker of English -- non-native speakers are often harder for recognizers to understand. Here however is a native speaker who has a terrible time with TOOT -- which never seems to understand ‘Baltimore’. Note the different ways she produces that city name during the dialogue. You’ll here some clear examples of hyperarticulation. $TOOT/non_adapt/MixedImplicit/award/ -- TOOT can never understand Baltimore

S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore. Note how she changes her pronunciation of “Baltimore”, hoping TOOT will finally here it clearly enough!

U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev-

Summary ASR technology relies upon a large number of phenomena and techniques we’ve already seen to convert sound into words Phonetic/phonological, morphological, and lexical events FSA’s, Ngrams, Dynamic programming algorithms Better modeling of linguistic phenomena will be needed to improve performance on transcription and especially on understanding For next class: we’ll start talking about larger structures in language above the word (Ch 8)

Disfluencies and Self-Repairs Disfluencies abound in spontaneous speech every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. Hard to recognize Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago. Kasl & Mahl: 41% more filled pauses in audio only vs ftf; Oviatt: 8.83 to 5.50% disfluencies in phone conversations vs. non /n/u118/exp98/adapt/MixedImplicit/mccoy/task1 line 198, rehesitation (mor- morning --> not really) /n/u118/exp98/adapt/UserNoConfirm/sgoel/task1 line 1315, filled pause (um baltimore --> from baltimoreten) /n/u118/exp98/adapt/UserNoConfirm/selina/task1, line 1132, repair (Ba- uh Chicago --> for Boston Chicago)