Automatic Speech Recognition Introduction

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Why is ASR Hard? Natural speech is continuous
Automatic Speech Recognition Introduction. The Human Dialogue System.
Automatic Speech Recognition Introduction Readings: Jurafsky & Martin HLT Survey Chapter 1.
Natural Language Understanding
Automatic Continuous Speech Recognition Database speech text Scoring.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Speech and Language Processing
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
7-Speech Recognition Speech Recognition Concepts
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Speech Recognition Created By : Kanjariya Hardik G.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Automatic Speech Recognition Introduction
Statistical Models for Automatic Speech Recognition
8.0 Search Algorithms for Speech Recognition
Intelligent Information System Lab
CSCI 5832 Natural Language Processing
Automatic Speech Recognition
Speech Processing Speech Recognition
Hidden Markov Models Part 2: Algorithms
CRANDEM: Conditional Random Fields for ASR
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Statistical Models for Automatic Speech Recognition
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LECTURE 15: REESTIMATION, EM AND MIXTURES
Automatic Speech Recognition
Speech recognition, machine learning
Automatic Speech Recognition
Presenter: Shih-Hsiang(士翔)
Speech recognition, machine learning
Presentation transcript:

Automatic Speech Recognition Introduction

The Human Dialogue System

The Human Dialogue System

Computer Dialogue Systems Management Audition Automatic Speech Recognition Natural Language Understanding Natural Language Generation Text-to- speech Planning signal signal words words signal logical form

Computer Dialogue Systems Mgmt. Audition ASR NLU NLG Text-to- speech Planning signal signal words words signal logical form

Parameters of ASR Capabilities Different types of tasks with different difficulties Speaking mode (isolated words/continuous speech) Speaking style (read/spontaneous) Enrollment (speaker-independent/dependent) Vocabulary (small < 20 wd/large >20kword) Language model (finite state/context sensitive) Signal-to-noise ratio (high > 30 dB/low < 10dB) Transducer (high quality microphone/telephone)

The Noisy Channel Model (Shannon) message message =Signal noisy channel Channel + Message Decoding model: find Message*= argmax P(Message|Signal) But how do we represent each of these things?

What are the basic units for acoustic information? When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable. Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary & continuous SR: Each word is treated individually –which implies large amount of training data and storage. The recognition vocabulary may consist of words which have never been given in the training data. Expensive to model interword coarticulation effects.

Why phones are better units than words: an example

"SAY BITE AGAIN" spoken so that the phonemes are separated in time Recorded sound spectrogram

"SAY BITE AGAIN" spoken normally

And why phones are still not the perfect choice Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent). However, each word is not a sequence of independent phonemes! Our articulators move continuously from one position to another. The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc. Different realizations of a phoneme are called allophones.

Example: different spectrograms for “eh”

Triphone model Each triphone captures facts about preceding and following phone Monophone: p, t, k Triphone: iy-p+aa a-b+c means “phone b, preceding by phone a, followed by phone c” In practice, systems use order of 100,000 3phones, and the 3phone model is the one currently used (e.g. Sphynx)

Parts of an ASR System Feature Calculation Acoustic Modeling Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … Language Modeling cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … k @ Produces acoustic vectors (xt) Maps acoustics to 3phones Maps 3phones to words Strings words together

Feature calculation interpretations

Feature calculation Frequency Time Find energy at each time step in each frequency channel

Feature calculation Frequency Time Take Inverse Discrete Fourier Transform to decorrelate frequencies

Feature calculation Input: Output: acoustic observations … vectors -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … Output: acoustic observations vectors …

Robust Speech Recognition Different schemes have been developed for dealing with noise, reverberation Additive noise: reduce effects of particular frequencies Convolutional noise: remove effects of linear filters (cepstral mean subtraction) cepstrum: fourier transfor of the LOGARITHM of the spectrum

How do we map from vectors to word sequences? -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … ??? “That you” …

HMM (again)! Pattern recognition “That you” … with HMMs -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … Pattern recognition “That you” … with HMMs

ASR using HMMs Try to solve P(Message|Signal) by breaking the problem up into separate components Most common method: Hidden Markov Models Assume that a message is composed of words Assume that words are composed of sub-word parts (3phones) Assume that 3phones have some sort of acoustic realization Use probabilistic models for matching acoustics to phones to words

Creating HMMs for word sequences: Context independent units 3phones

“Need” 3phone model

Hierarchical system of HMMs HMM of a triphone HMM of a triphone HMM of a triphone Higher level HMM of a word Language model

To simplify, let’s now ignore lower level HMM Each phone node has a “hidden” HMM (H2MM)

HMMs for ASR go home Markov model backbone composed of sequences of 3phones (hidden because we don’t know correspondences) g o h o m g o h o m m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations Each line represents a probability estimate (more later)

HMMs for ASR go home Markov model backbone composed of phones (hidden because we don’t know correspondences) g o h o m x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 Acoustic observations Even with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypotheses

For every HMM (in hierarchy): compute Max probability sequence th a t h iy y uw p(he|that) p(you|that) sh uh d X= acoustic observations, (3)phones, phone sequences W= (3)phones, phone sequences, word sequences argmaxW P(W|X) =argmaxW P(X|W)P(W)/P(X) =argmaxW P(X|W)P(W) COMPUTE:

Search When trying to find W*=argmaxW P(W|X), need to look at (in theory) All possible (3phone, word.. etc) sequences All possible segmentations/alignments of W&X Generally, this is done by searching the space of W Viterbi search: dynamic programming approach that looks for the most likely path A* search: alternative method that keeps a stack of hypotheses around If |W| is large, pruning becomes important Need also to estimate transition probabilities

Training: speech corpora Have a speech corpus at hand Should have word (and preferrably phone) transcriptions Divide into training, development, and test sets Develop models of prior knowledge Pronunciation dictionary Grammar, lexical trees Train acoustic models Possibly realigning corpus phonetically

Acoustic Model -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 4.4 2.2 -6.1 -2.1 3.1 2.4 1.0 0.0 dh a t Assume that you can label each vector with a phonetic label Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks) Na(m,S) P(X|state=a)

Pronunciation model Pronunciation model gives connections between phones and words Multiple pronunciations (tomato): dh pdh 1-pdh a pa 1-pa t pt 1-pt t ow ow ey t m ah ah

Training models for a sound unit

Language Model Language model gives connections between words (e.g., bigrams: probability of two word sequences) h iy dh a p(he|that) t y uw p(you|that)

Lexical trees START S-T-AA-R-TD STARTING S-T-AA-R-DX-IX-NG STARTED S-T-AA-R-DX-IX-DD STARTUP S-T-AA-R-T-AX-PD START-UP S-T-AA-R-T-AX-PD S T AA R TD DX IX NG DD AX PD start starting started startup start-up

Judging the quality of a system Usually, ASR performance is judged by the word error rate ErrorRate = 100*(Subs + Ins + Dels) / Nwords REF: I WANT TO GO HOME *** REC: * WANT TWO GO HOME NOW SC: D C S C C I 100*(1S+1I+1D)/5 = 60%

Judging the quality of a system Usually, ASR performance is judged by the word error rate This assumes that all errors are equal Also, a bit of a mismatch between optimization criterion and error measurement Other (task specific) measures sometimes used Task completion Concept error rate

Sphinx4 http://cmusphinx.sourceforge.net This will be a practical intro to understanding speech recognition focused on the interface of Sphinx

Sphinx4 Implementation Basic flow chart of how the components fit together

Sphinx4 Implementation Basic flow chart of how the components fit together

Frontend Feature extractor Frontend is the first component of the system to see the data It does signal processing to enhance the signal and extracts features

Frontend Feature extractor Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors -Different formats exist but MFCC common The important point is that it is a way of transforming an analog signal into digital feature vectors of 39 numbers representing phonetic sounds - Observations are taken every 10 ms -> 100 feature vectors a second ch. 9.3 in Jurafsky&Martin has a nice description of the process

Hidden Markov Models (HMMs) Acoustic Observations HMMs used for speech recog. Have 3 main components We have observations the feature vectors,

Hidden Markov Models (HMMs) Acoustic Observations Hidden States Hidden states phones, partial phones and words, which we are trying to figure out

Hidden Markov Models (HMMs) Acoustic Observations Hidden States Acoustic Observation likelihoods Observation likelihoods the probability of a feature vector being generated by a hidden state (phone, etc.) P(features | phone)

Hidden Markov Models (HMMs) “Six” -HMMs are finite state machines and we can depict them graphically -Consists of emitting states like S1 plus start and end states -Transitions b/t states are weighted by their probability -They flow Left to right, state can transition to self or forward (captures sequential nature of speech) -Phones can vary in pronunciation length widely, self-loops allow accounting for this variable (left -> right flowing HMM is called Bakis network)

Sphinx4 Implementation the linguist generates search graph

Linguist Constructs the search graph of HMMs from: Acoustic model Statistical Language model ~or~ Grammar Dictionary Language model or grammar must contain same words as dictionary Acoustic model must contain same phone set as dictionary

Acoustic Model Constructs the HMMs of phones Produces observation likelihoods Contains the acoustic info Construct the HMMs for phones just described Use Probability Density Functions and Gaussian Mixtures to create flexible models of phonetic sounds which are then used to compute P(observation | phone) observation likelihoods For more on PDfs and Gaussian methods see Jurafsky&Martin ch. 9.4.2

Acoustic Model Constructs the HMMs for units of speech Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k All models are marked with a sampling rate Sampling rate is very important! And must match what is in the application Ie. You can’t train on 16k and use in the wild with 8k data. You will get horrible results

Acoustic Model Constructs the HMMs for units of speech Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k TIDIGITS, RM1, AN4, HUB4 Creating acoustic models is a lot of work so usually we use ones that are available Different models are available trained on different vocabularies, sampling rates, languages etc. Can be found in sphinx4 in /models/acoustic read the readmes in their folders for details

Language Model Word likelihoods contains information about how likely certain words are to occur

Language Model ARPA format Example: 1-grams: -3.7839 board -0.1552 -2.5998 bottom -0.3207 -3.7839 bunch -0.2174 2-grams: -0.7782 as the -0.2717 -0.4771 at all 0.0000 -0.7782 at the -0.2915 3-grams: -2.4450 in the lowest -0.5211 in the middle -2.4450 in the on Common format is ARPA and can be produced using CMU-Cambridge Statistical Language Modeling Toolkit Commonly contains tables of probabilities of 1,2 and 3 grams Ngrams are listed 1 per line preceded by log of conditional prob. Followed by log of backoff weight ONLY for those N-grams that form a prefix of longer N-grams in the model. #s are neg. because they are logarithms base 10 of the probabilities of very small #s See Jurafsky&Martin p. 313

Grammar (example: command language) public <basicCmd> = <startPolite> <command> <endPolite>; public <startPolite> = (please | kindly | could you ) *; public <endPolite> = [ please | thanks | thank you ]; <command> = <action> <object>; <action> = (open | close | delete | move); <object> = [the | a] (window | file | menu); A away of specifying what words can be used and how Java Speech API Grammar Format (JSGF) Alternative to a statistical language model like ngrams * = 0 or many, [] = optional () = grouping | = or http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/jsgf/JSGFGrammar.html

Dictionary Maps words to phoneme sequences Defines what words will be available for recognition Maps these words to phoneme sequences which are then used in creating HMMs

Dictionary Example from cmudict.06d POULTICE P OW L T AH S POULTICES P OW L T AH S IH Z POULTON P AW L T AH N POULTRY P OW L T R IY POUNCE P AW N S POUNCED P AW N S T POUNCEY P AW N S IY POUNCING P AW N S IH NG POUNCY P UW NG K IY Defines what words will be available for recognition Cmu pronouncing dictionary is widely used Contains over 100,000 words and their transcriptions.

Sphinx4 Implementation The SearchManager uses the Features and the SearchGraph to find the best fit path

Search Graph We can represent a partial diagram of the digit recognition task like so

Search Graph Another representation of the same idea

Search Graph Can be statically or dynamically constructed The entire search graph for the model can be computed ahead of time for small vocab tasks For larger applications likely a partial search graph would be constructed ahead of time, dynamically expanded at runtime

Sphinx4 Implementation Then comes the decoder which constructs the search manager

Decoder Maps feature vectors to search graph Job of the decoder is to use feature vectors from the frontend in conjunction with the search graph generated by the linguist to generate a result

Search Manager Searches the graph for the “best fit” The decoder calls the search manager to search the graph for the best fit

Search Manager Searches the graph for the “best fit” P(sequence of feature vectors| word/phone) aka. P(O|W) -> “how likely is the input to have been generated by the word” For a given word or phone we want to determine the P(sequence of feature vectors | word or phone)

F ay ay ay ay v v v v v F f ay ay ay ay v v v v F f f ay ay ay ay v v v F f f f ay ay ay ay v v F f f f ay ay ay ay ay v F f f f f ay ay ay ay v F f f f f f ay ay ay v … We could calculate every possible probability for a given word given a set of observations but this would take exponential time to solve

Viterbi Search Time O1 O2 O3 -The search manager commonly uses the Viterbi algorithm, -a form of Optimized graph search (heuristic) for finding the most likely sequence of hidden states (phones) in a HMM based on a sequence of observations over time More in the appendix in these slides Time O1 O2 O3

Pruner Uses algorithms to weed out low scoring paths during decoding -Viterbi is more efficient than brute force but still can be slow on large search graphs -the search manager often uses a pruner to narrow possible paths and speed up search -Commonly prunes based on an absolute max # of paths or a threshold of probability relative to the currently most probable path

Result Words! Finally, the result is the words contained in the best fit path through the search graph

Word Error Rate Most common metric Measure the # of modifications to transform recognized sentence into reference sentence -Measure of recognition accuracy, more specifically -the measure of the number of modification operations required to transform one sentence to the other in terms of number of insertions, deletions and substitutions.

Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.” the measure of the number of modification operations required to transform one sentence to the other in terms of number of insertions, deletions and substitutions.

Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.” Requires 2 deletions, 1 substitution Errors: 1 deletion (a) 1 deletion (sentence) 1 substitution (reference for neuroscience)

Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.”

Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.” D S D 2 deletions + 1 sub / length of 5 = .6 * 100 = 60%

Installation details http://cmusphinx.sourceforge.net/wiki/sphinx4:howtobuildand_run_sphinx4 Student report on NLP course web site