Automatic Speech Recognition Introduction

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Angelo Dalli Department of Intelligent Computing Systems
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition. What makes speech recognition hard?
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
COMP 4060 Natural Language Processing Speech Processing.
Why is ASR Hard? Natural speech is continuous
Automatic Speech Recognition Introduction. The Human Dialogue System.
Automatic Speech Recognition Introduction Readings: Jurafsky & Martin HLT Survey Chapter 1.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech and Language Processing
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
7-Speech Recognition Speech Recognition Concepts
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Learning, Uncertainty, and Information: Learning Parameters
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Speech Processing Speech Recognition
Automatic Speech Recognition Introduction
Hidden Markov Models Part 2: Algorithms
CS 188: Artificial Intelligence Fall 2008
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
CONTEXT DEPENDENT CLASSIFICATION
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LECTURE 15: REESTIMATION, EM AND MIXTURES
Automatic Speech Recognition
Speech recognition, machine learning
Automatic Speech Recognition
Combination of Feature and Channel Compensation (1/2)
Speech recognition, machine learning
Presentation transcript:

Automatic Speech Recognition Introduction Readings: Jurafsky & Martin 7.1-2 HLT Survey Chapter 1

The Human Dialogue System

The Human Dialogue System

Computer Dialogue Systems Management Audition Automatic Speech Recognition Natural Language Understanding Natural Language Generation Text-to- speech Planning signal signal words words signal logical form

Computer Dialogue Systems Mgmt. Audition ASR NLU NLG Text-to- speech Planning signal signal words words signal logical form

Parameters of ASR Capabilities Different types of tasks with different difficulties Speaking mode (isolated words/continuous speech) Speaking style (read/spontaneous) Enrollment (speaker-independent/dependent) Vocabulary (small < 20 wd/large >20kword) Language model (finite state/context sensitive) Perplexity (small < 10/large >100) Signal-to-noise ratio (high > 30 dB/low < 10dB) Transducer (high quality microphone/telephone)

The Noisy Channel Model message message noisy channel + Message Channel =Signal Decoding model: find Message*= argmax P(Message|Signal) But how do we represent each of these things?

ASR using HMMs Try to solve P(Message|Signal) by breaking the problem up into separate components Most common method: Hidden Markov Models Assume that a message is composed of words Assume that words are composed of sub-word parts (phones) Assume that phones have some sort of acoustic realization Use probabilistic models for matching acoustics to phones to words

HMMs: The Traditional View go home Markov model backbone composed of phones (hidden because we don’t know correspondences) g o h o m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations Each line represents a probability estimate (more later)

HMMs: The Traditional View go home Markov model backbone composed of phones (hidden because we don’t know correspondences) g o h o m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations Even with same word hypothesis, can have different alignments. Also, have to search over all word hypotheses

HMMs as Dynamic Bayesian Networks Markov model backbone composed of phones go home q0=g q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations

HMMs as Dynamic Bayesian Networks Markov model backbone composed of phones go home q0=g q1=o q2=o q3=o q4=h q5=o q6=o q7=o q8=m q9=m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ASR: What is best assignment to q0…q9 given x0…x9?

Hidden Markov Models & DBNs DBN representation Markov Model representation

Parts of an ASR System Feature Calculation Acoustic Modeling Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog

Parts of an ASR System Feature Calculation Acoustic Modeling Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ Maps acoustics to phones Maps phones to words Strings words together Produces acoustics (xt)

Feature calculation

Feature calculation Frequency Time Find energy at each time step in each frequency channel

Feature calculation Frequency Time Take inverse Discrete Fourier Transform to decorrelate frequencies

Feature calculation Input: Output: … -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … Output: …

Robust Speech Recognition Different schemes have been developed for dealing with noise, reverberation Additive noise: reduce effects of particular frequencies Convolutional noise: remove effects of linear filters (cepstral mean subtraction)

Now what? -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … ??? That you …

Machine Learning! Pattern recognition That you … with HMMs -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … Pattern recognition That you … with HMMs

Hidden Markov Models (again!) P(statet+1|statet) Pronunciation/Language models P(acousticst|statet) Acoustic Model

Acoustic Model -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 4.4 2.2 -6.1 -2.1 3.1 2.4 1.0 0.0 dh a t Assume that you can label each vector with a phonetic label Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks) Na(m,S) P(X|state=a)

Building up the Markov Model Start with a model for each phone Typically, we use 3 states per phone to give a minimum duration constraint, but ignore that here… a p 1-p transition probability a p 1-p a p 1-p a p 1-p

Building up the Markov Model Pronunciation model gives connections between phones and words Multiple pronunciations: dh pdh 1-pdh a pa 1-pa t pt 1-pt t ow ow ey t m ah ah

Building up the Markov Model Language model gives connections between words (e.g., bigram grammar) h iy dh a p(he|that) t y uw p(you|that)

ASR as Bayesian Inference h iy q1w1 q2w1 q3w1 th a p(he|that) t y uw p(you|that) x1 x2 x3 h iy argmaxW P(W|X) =argmaxW P(X|W)P(W)/P(X) =argmaxW P(X|W)P(W) =argmaxW SQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X|Q) P(Q|W) P(W) sh uh d

ASR Probability Models Three probability models P(X|Q): acoustic model P(Q|W): duration/transition/pronunciation model P(W): language model language/pronunciation models inferred from prior knowledge Other models learned from data (how?)

Parts of an ASR System Feature Calculation Acoustic Modeling P(X|Q) P(Q|W) P(W) Feature Calculation Acoustic Modeling Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ S E A R C H The cat chased the dog

EM for ASR: The Forward-Backward Algorithm Determine “state occupancy” probabilities I.e. assign each data vector to a state Calculate new transition probabilities, new means & standard deviations (emission probabilities) using assignments

ASR as Bayesian Inference h iy q1w1 q2w1 q3w1 th a p(he|that) t y uw p(you|that) x1 x2 x3 h iy argmaxW P(W|X) =argmaxW P(X|W)P(W)/P(X) =argmaxW P(X|W)P(W) =argmaxW SQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X,Q|W)P(W) ≈argmaxW maxQ P(X|Q) P(Q|W) P(W) sh uh d

Search When trying to find W*=argmaxW P(W|X), need to look at (in theory) All possible word sequences W All possible segmentations/alignments of W&X Generally, this is done by searching the space of W Viterbi search: dynamic programming approach that looks for the most likely path A* search: alternative method that keeps a stack of hypotheses around If |W| is large, pruning becomes important

How to train an ASR system Have a speech corpus at hand Should have word (and preferrably phone) transcriptions Divide into training, development, and test sets Develop models of prior knowledge Pronunciation dictionary Grammar Train acoustic models Possibly realigning corpus phonetically

How to train an ASR system Test on your development data (baseline) **Think real hard Figure out some neat new modification Retrain system component Test on your development data Lather, rinse, repeat ** Then, at the end of the project, test on the test data.

Judging the quality of a system Usually, ASR performance is judged by the word error rate ErrorRate = 100*(Subs + Ins + Dels) / Nwords REF: I WANT TO GO HOME *** REC: * WANT TWO GO HOME NOW SC: D C S C C I 100*(1S+1I+1D)/5 = 60%

Judging the quality of a system Usually, ASR performance is judged by the word error rate This assumes that all errors are equal Also, a bit of a mismatch between optimization criterion and error measurement Other (task specific) measures sometimes used Task completion Concept error rate