Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

Application of HMMs: Speech recognition “Noisy channel” model of speech.

1 LSA 352 Summer 2007 LSA 352: Speech Recognition and Synthesis Dan Jurafsky Lecture 5: Intro to ASR+HMMs: Forward, Viterbi, Baum-Welch IP Notice:

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.

BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.

Isolated-Word Speech Recognition Using Hidden Markov Models

Gaussian Mixture Model and the EM algorithm in Speech Recognition

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Speech and Language Processing

CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II)

Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Estimating N-gram Probabilities Language Modeling.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT.

Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.

Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy

CS 224S / LINGUIST 285 Spoken Language Processing

Automatic Speech Recognition

Speech Recognition and Synthesis

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Automatic Speech Recognition Introduction

Statistical Models for Automatic Speech Recognition

Speech Processing Speech Recognition

CSCI 5832 Natural Language Processing

Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky

CSCI 5832 Natural Language Processing

Automatic Speech Recognition

CSC 594 Topics in AI – Natural Language Processing

Speech Processing Speech Recognition

Hidden Markov Models Part 2: Algorithms

CPSC 503 Computational Linguistics

Statistical Models for Automatic Speech Recognition

Hidden Markov Model LR Rabiner

N-Gram Model Formulas Word sequences Chain rule of probability

CSCI 5832 Natural Language Processing

Speech Processing Speech Recognition

LECTURE 15: REESTIMATION, EM AND MIXTURES

CPSC 503 Computational Linguistics

Speech Recognition: Acoustic Waves

CSCI 5582 Artificial Intelligence

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Search and Decoding in Speech Recognition

Presentation transcript:

Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky 2/17/2019 LING 138/238 Autumn 2004

Outline for ASR this week Acoustic Phonetics: Using Praat ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction Acoustic Model Lexicon/Pronunciation Model Decoder Language Model Evaluation 2/17/2019 LING 138/238 Autumn 2004

Summary from Tuesday ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 2/17/2019 LING 138/238 Autumn 2004

Speech Recognition Architecture Speech Waveform Spectral Feature Vectors Phone Likelihoods P(o|q) Words 1. Feature Extraction (Signal Processing) 2. Acoustic Model Phone Likelihood Estimation (Gaussians or Neural Networks) 5. Decoder (Viterbi or Stack Decoder) 4. Language Model (N-gram Grammar) 3. HMM Lexicon 2/17/2019 LING 138/238 Autumn 2004

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform. 2/17/2019 LING 138/238 Autumn 2004

Noisy channel model likelihood prior 2/17/2019 LING 138/238 Autumn 2004

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 2/17/2019 LING 138/238 Autumn 2004

Five easy pieces Feature extraction Acoustic Modeling HMMs, Lexicons, and Pronunciation Decoding Language Modeling 2/17/2019 LING 138/238 Autumn 2004

ASR Lexicon: Markov Models for pronunciation 2/17/2019 LING 138/238 Autumn 2004

The Hidden Markov model 2/17/2019 LING 138/238 Autumn 2004

Formal definition of HMM States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 2/17/2019 LING 138/238 Autumn 2004

Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 2/17/2019 LING 138/238 Autumn 2004

Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 2/17/2019 LING 138/238 Autumn 2004

Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 2/17/2019 LING 138/238 Autumn 2004

A real HMM 2/17/2019 LING 138/238 Autumn 2004

HMMs: what’s the point again? HMM is used to compute P(O|W) I.e. compute the likelihood of the acoustic sequence O given a string of words W As part of our generative model We do this for every possible sentence of English and then pick the most likely one How? Decoding, which we’ll get to by the end of today 2/17/2019 LING 138/238 Autumn 2004

The Three Basic Problems for HMMs (From the classic formulation by Larry Rabiner after Jack Ferguson) Problem 1 (Evaluation): Given the observation sequence O=(o1o2…oT), and an HMM model =(A,B,), how do we efficiently compute P(O| ), the probability of the observation sequence given the model? Problem 2 (Decoding): Given the observation sequence O=(o1o2…oT), and an HMM model =(A,B,), how do we choose a corresponding state sequence Q=(q1,q2…qt) that is optimal in some sense (I.e. best explains the observations)? Problem 3 (Learning): How do we adjust the model parameters =(A,B,) to maximize P(O| )? 2/17/2019 LING 138/238 Autumn 2004

The Evaluation Problem Computing the likelihood of the observation sequence Why is this hard? Imagine the HMM for “need” above, with subphones. n0 n1 n2 iy4 iy5 d6 d7 d8 And that each state has a loopback Given a series of 350 ms (35 observations) Possible alignments of states to observations 001112223333334444555555666666777778888 000011112223345555556666666666667777788 000000011111111122223333334444455556678 000111223344555666666666677777778888888 We would have to sum over all these to compute P(O|Q) 2/17/2019 LING 138/238 Autumn 2004

Given a Word string W, compute p(O|W) Sum over all possible sequences of states 2/17/2019 LING 138/238 Autumn 2004

Summary: Computing the observation likelihood p(O|) Why we can’t do an explicit sum over all paths? Because it’s intractable O(NT) What to do instead? The Forward Algorithm O(N2T) I won’t give this, but it uses dynamic programming to compute P(O|) 2/17/2019 LING 138/238 Autumn 2004

The Decoding Problem Given observations O=(o1o2…oT), and HMM =(A,B,), how do we choose best state sequence Q=(q1,q2…qt)? The forward algorithm computes P(O|W) Could find best W by running forward algorithm for each W in L, picking W maximizing P(O|W) But we can’t do this, since number of sentences is O(WT). Instead: Viterbi Decoding: dynamic programming modification of the forward algorithm A* Decoding: search the space of all possible sentences using the forward algorithm as a subroutine. 2/17/2019 LING 138/238 Autumn 2004

Viterbi: the intuition 2/17/2019 LING 138/238 Autumn 2004

Viterbi: Search 2/17/2019 LING 138/238 Autumn 2004

Viterbi: Word Internal 2/17/2019 LING 138/238 Autumn 2004

Viterbi: Between words 2/17/2019 LING 138/238 Autumn 2004

Language Modeling The noisy channel model expects P(W); the probability of the sentence We saw this was also used in the decoding process as the probability of transitioning from one word to another. The model that computes P(W) is called the language model. 2/17/2019 LING 138/238 Autumn 2004

The Chain Rule Recall the definition of conditional probabilities Rewriting: Or… 2/17/2019 LING 138/238 Autumn 2004

The Chain Rule more generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) The big red dog was P(The)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog) Better P(The| <Beginning of sentence>) written as P(The | <S>) 2/17/2019 LING 138/238 Autumn 2004

General case The word sequence from position 1 to n is So the probability of a sequence is 2/17/2019 LING 138/238 Autumn 2004

Unfortunately This doesn’t help since we’ll never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,and,saw,a) 2/17/2019 LING 138/238 Autumn 2004

Markov Assumption Make the simplifying assumption Or maybe P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a) Or maybe P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a) 2/17/2019 LING 138/238 Autumn 2004

Markov Assumption So for each component in the product replace each with its with the approximation (assuming a prefix of N) 2/17/2019 LING 138/238 Autumn 2004

N-Grams The big red dog Unigrams: P(dog) Bigrams: P(dog|red) Trigrams: P(dog|big red) Four-grams: P(dog|the big red) In general, we’ll be dealing with P(Word| Some fixed prefix) 2/17/2019 LING 138/238 Autumn 2004

Computing bigrams In actual cases it’s slightly more complicated because of zeros, but we won’t worry about that today. 2/17/2019 LING 138/238 Autumn 2004

Counts from the Berkeley Restaurant Project 2/17/2019 LING 138/238 Autumn 2004

BeRP Bigram Table 2/17/2019 LING 138/238 Autumn 2004

Some observations The following numbers are very informative. Think about what they capture. P(want|I) = .32 P(to|want) = .65 P(eat|to) = .26 P(food|Chinese) = .56 P(lunch|eat) = .055 2/17/2019 LING 138/238 Autumn 2004

Generation Choose N-grams according to their probabilities and string them together. I want want to to eat eat Chinese Chinese food food . 2/17/2019 LING 138/238 Autumn 2004

Learning Setting all the parameters in an ASR system Given: training set: wavefiles & word transcripts for each sentence Hand-built HMM lexicon Initial acoustic models stolen from another recognizer train a LM on the word transcripts + other data For each sentence, create one big HMM by combining all the HMM-words together Use the Viterbi algorithm to align HMM against the data, resulting in phone labeling of speech train new Gaussian acoustic models iterate (go to 2). 2/17/2019 LING 138/238 Autumn 2004

Word Error Rate Word Error Rate = 100 (Insertions+Substitutions + Deletions) ------------------------------ Total Word in Correct Transcript Aligment example: REF: portable **** PHONE UPSTAIRS last night so HYP: portable FORM OF STORES last night so Eval I S S WER = 100 (1+2+0)/6 = 50% 2/17/2019 LING 138/238 Autumn 2004

Summary: ASR Architecture Five easy pieces: ASR Noisy Channel architecture Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: what phones can follow each other Language Model N-grams for computing p(wi|wi-1) Decoder Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech! 2/17/2019 LING 138/238 Autumn 2004