Download presentation
Presentation is loading. Please wait.
Published byJoan Walters Modified over 9 years ago
1
Automatic Speech Recognition Introduction
2
The Human Dialogue System
4
Computer Dialogue Systems Audition Automatic Speech Recognition Natural Language Understanding Dialogue Management Planning Natural Language Generation Text-to- speech signalwords logical form wordssignal
5
Computer Dialogue Systems AuditionASRNLU Dialogue Mgmt. Planning NLG Text-to- speech signalwords logical form wordssignal
6
Parameters of ASR Capabilities Different types of tasks with different difficulties –Speaking mode (isolated words/continuous speech) –Speaking style (read/spontaneous) –Enrollment (speaker-independent/dependent) –Vocabulary (small 20kword) –Language model (finite state/context sensitive) –Signal-to-noise ratio (high > 30 dB/low < 10dB) –Transducer (high quality microphone/telephone)
7
The Noisy Channel Model (Shannon) message Message noisy channel Channel + message =Signal Decoding model: find Message*= argmax P(Message|Signal) But how do we represent each of these things?
8
What are the basic units for acoustic information? When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable. Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary & continuous SR: Each word is treated individually –which implies large amount of training data and storage. The recognition vocabulary may consist of words which have never been given in the training data. Expensive to model interword coarticulation effects.
9
Why phones are better units than words: an example
10
"SAY BITE AGAIN""SAY BITE AGAIN" spoken so that the phonemes are separated in time Recorded sound spectrogram
11
"SAY BITE AGAIN""SAY BITE AGAIN" spoken normally
12
And why phones are still not the perfect choice Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent). However, each word is not a sequence of independent phonemes! Our articulators move continuously from one position to another. The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc. Different realizations of a phoneme are called allophones.
13
Example: different spectrograms for “eh”
14
Triphone model Each triphone captures facts about preceding and following phone Monophone: p, t, k Triphone: iy-p+aa a-b+c means “phone b, preceding by phone a, followed by phone c” In practice, systems use order of 100,000 3phones, and the 3phone model is the one currently used (e.g. Sphynx)
15
Parts of an ASR System Feature Calculation Language Modeling Acoustic Modeling k@ Pronunciation Modeling cat: k@t dog: dog mail: mAl the: D&, DE … cat dog: 0.00002 cat the: 0.0000005 the cat: 0.029 the dog: 0.031 the mail: 0.054 … Produces acoustic vectors (x t ) Maps acoustics to 3phones Maps 3phones to words Strings words together
16
Feature calculation interpretations
17
Feature calculation Frequency Time Find energy at each time step in each frequency channel
18
Feature calculation Frequency Time Take Inverse Discrete Fourier Transform to decorrelate frequencies
19
Feature calculation -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … … Input: Output: acoustic observations vectors
20
Robust Speech Recognition Different schemes have been developed for dealing with noise, reverberation –Additive noise: reduce effects of particular frequencies –Convolutional noise: remove effects of linear filters (cepstral mean subtraction) cepstrum: fourier transfor of the LOGARITHM of the spectrum
21
How do we map from vectors to word sequences? -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … “That you” … ???
22
HMM (again)! -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … “That you” … Pattern recognition with HMMs
23
ASR using HMMs Try to solve P(Message|Signal) by breaking the problem up into separate components Most common method: Hidden Markov Models –Assume that a message is composed of words –Assume that words are composed of sub-word parts (3phones) –Assume that 3phones have some sort of acoustic realization –Use probabilistic models for matching acoustics to phones to words
24
Creating HMMs for word sequences: Context independent units 3phones
25
“Need” 3phone model
26
Hierarchical system of HMMs HMM of a triphone Higher level HMM of a word Language model
27
To simplify, let’s now ignore lower level HMM Each phone node has a “hidden” HMM (H 2 MM)
28
HMMs for ASR gohome gohom x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 x9x9 Markov model backbone composed of sequences of 3phones (hidden because we don ’ t know correspondences) Acoustic observations Each line represents a probability estimate (more later) goooooohmm
29
HMMs for ASR gohome gohom x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 x9x9 Markov model backbone composed of phones (hidden because we don ’ t know correspondences) Acoustic observations Even with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypotheses Even with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypotheses
30
For every HMM (in hierarchy): compute Max probability sequence tha t hiy yuw p(he|that) p(you|that) h iy shuh d X= acoustic observations, (3)phones, phone sequences W= (3)phones, phone sequences, word sequences argmax W P(W|X) =argmax W P(X|W)P(W)/P(X) =argmax W P(X|W)P(W) COMPUTE:
31
Search When trying to find W*=argmax W P(W|X), need to look at (in theory) –All possible (3phone, word.. etc) sequences –All possible segmentations/alignments of W&X Generally, this is done by searching the space of W –Viterbi search: dynamic programming approach that looks for the most likely path –A* search: alternative method that keeps a stack of hypotheses around If |W| is large, pruning becomes important Need also to estimate transition probabilities
32
Training: speech corpora Have a speech corpus at hand –Should have word (and preferrably phone) transcriptions –Divide into training, development, and test sets Develop models of prior knowledge –Pronunciation dictionary –Grammar, lexical trees Train acoustic models –Possibly realigning corpus phonetically
33
Acoustic Model -0.1 0.3 1.4 -1.2 2.3 2.6 … 0.2 0.1 1.2 -1.2 4.4 2.2 … -6.1 -2.1 3.1 2.4 1.0 2.2 … 0.2 0.0 1.2 -1.2 4.4 2.2 … dhaat Assume that you can label each vector with a phonetic label Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks) N a ( ) P(X|state=a)
34
Pronunciation model Pronunciation model gives connections between phones and words Multiple pronunciations (tomato): ow tm dh p dh 1-p dh a papa 1-p a t ptpt 1-p t ah ow ey ah t
35
Training models for a sound unit
36
Language Model Language model gives connections between words (e.g., bigrams: probability of two word sequences) dha t hiyyuw p(he|that) p(you|that)
37
Lexical trees STARTS-T-AA-R-TD STARTINGS-T-AA-R-DX-IX-NG STARTEDS-T-AA-R-DX-IX-DD STARTUPS-T-AA-R-T-AX-PD START-UPS-T-AA-R-T-AX-PD STAA R RT TD DX IX NG DD AX PD start starting started startup start-up
38
Judging the quality of a system Usually, ASR performance is judged by the word error rate ErrorRate = 100*(Subs + Ins + Dels) / Nwords REF: I WANT TO GO HOME *** REC: * WANT TWO GO HOME NOW SC: D C S C C I 100*(1S+1I+1D)/5 = 60%
39
Judging the quality of a system Usually, ASR performance is judged by the word error rate This assumes that all errors are equal –Also, a bit of a mismatch between optimization criterion and error measurement Other (task specific) measures sometimes used –Task completion –Concept error rate
40
Sphinx4 http://cmusphinx.sourceforge.net http://cmusphinx.sourceforge.net
41
Sphinx4 Implementation
43
Frontend Feature extractor
44
Frontend Feature extractor Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors
45
Hidden Markov Models (HMMs) Acoustic Observations
46
Hidden Markov Models (HMMs) Acoustic Observations Hidden States
47
Hidden Markov Models (HMMs) Acoustic Observations Hidden States Acoustic Observation likelihoods
48
Hidden Markov Models (HMMs) “Six”
49
Sphinx4 Implementation
50
Linguist Constructs the search graph of HMMs from: –Acoustic model –Statistical Language model ~or~ –Grammar –Dictionary
51
Acoustic Model Constructs the HMMs of phones Produces observation likelihoods
52
Acoustic Model Constructs the HMMs for units of speech Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k
53
Acoustic Model Constructs the HMMs for units of speech Produces observation likelihoods Sampling rate is critical! WSJ vs. WSJ_8k TIDIGITS, RM1, AN4, HUB4
54
Language Model Word likelihoods
55
Language Model ARPA format Example: 1-grams: -3.7839 board-0.1552 -2.5998 bottom-0.3207 -3.7839 bunch-0.2174 2-grams: -0.7782 as the -0.2717 -0.4771 at all 0.0000 -0.7782 at the -0.2915 3-grams: -2.4450 in the lowest -0.5211 in the middle -2.4450 in the on
56
Grammar (example: command language) public = ; public = (please | kindly | could you ) *; public = [ please | thanks | thank you ]; = ; = (open | close | delete | move); = [the | a] (window | file | menu);
57
Dictionary Maps words to phoneme sequences
58
Dictionary Example from cmudict.06d POULTICE P OW L T AH S POULTICES P OW L T AH S IH Z POULTON P AW L T AH N POULTRY P OW L T R IY POUNCE P AW N S POUNCED P AW N S T POUNCEY P AW N S IY POUNCING P AW N S IH NG POUNCY P UW NG K IY
59
Sphinx4 Implementation
60
Search Graph
62
Can be statically or dynamically constructed
63
Sphinx4 Implementation
64
Decoder Maps feature vectors to search graph
65
Search Manager Searches the graph for the “best fit”
66
Search Manager Searches the graph for the “best fit” P(sequence of feature vectors| word/phone) aka. P(O|W) -> “how likely is the input to have been generated by the word”
67
F ay ay ay ay v v v v v F f ay ay ay ay v v v v F f f ay ay ay ay v v v F f f f ay ay ay ay v v F f f f ay ay ay ay ay v F f f f f ay ay ay ay v F f f f f f ay ay ay v …
68
Viterbi Search Time O1O2O3
69
Pruner Uses algorithms to weed out low scoring paths during decoding
70
Result Words!
71
Word Error Rate Most common metric Measure the # of modifications to transform recognized sentence into reference sentence
72
Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.”
73
Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.” Requires 2 deletions, 1 substitution
74
Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.”
75
Word Error Rate Reference: “This is a reference sentence.” Result: “This is neuroscience.” D S D
76
Installation details http://cmusphinx.sourceforge.net/wiki/sphin x4:howtobuildand_run_sphinx4http://cmusphinx.sourceforge.net/wiki/sphin x4:howtobuildand_run_sphinx4 Student report on NLP course web site
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.