The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute of Advanced Computer Science
Theoretical Background – Unit Selection When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable. Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary continuous SR: Each word is treated individually – no data sharing, which implies large amount of training data and storage. The recognition vocabulary may consist of words which have never been given in the training data. Expensive to model interword coarticulation effects.
Theoretical Background - Phonemes The alternative unit is a Phoneme. Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent). However, each word is not a sequence of independent phonemes! Our articulators move continuously from one position to another. The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc. Different realizations of a phoneme are called allophones.
Theoretical Background - Triphones The Triphone model is a phonetic model which takes into consideration both the left and the right neighbouring phonemes. Triphones are an example of allophones. This model captures the most important coarticulatory effects, a fact which makes him a very powerful model. The cost – as context-dependent models generally increase the number of parameters, the trainability becomes much harder. Notice that in English there are more than 100,000 triphones ! Nevertheless, so far we have assumed that every triphone context is different. We are motivated to finds instances of similar contexts and merge them.
Theoretical Background - Senones Recall that each allophone model is an HMM, made of states, transitions and probability distributions; the bottom line is that some distributions can be tied. The basic idea is clustering, but rather than clustering the HMM models themselves – we shall cluster only the the HMM states. Each cluster will represent a set of similar Markov states, and is called a Senone. The senones provide not only an improved recognition accuracy, but also a pronunciation-optimization capability.
Theoretical Background – Senonic Trees Reminder: a decision tree is a binary tree which classifies target objects by asking Yes/No questions in a hierarchical manner. The senonic decision tree classifies Markov states of triphones, represented in the training data, by asking linguistic questions. => The leaves of the senonic trees are the possible senones.
Sphinx III, A Short Review – Front End Feature Extraction Cepstrum 12 elements Time-der Cepstrum Time-2-der Cepstrum 12 elements Power 3 elements Current frame 7 frame speech window Fetch phonetic data (Senones !) from these Gaussian Mixtures – using the well-trained machine. Feature vectors and their analysis are inputs into Gaussian Mixtures Fitting Process. Gaussian Mixtures 39 elements Mean, Variance, Determinant Senones Data (Scoring Table)
Sphinx III – the implementation Handling a single word; evaluating each HMM according to the input, using the Viterbi Search. Every senone gets a HMM: UW ONE TWO THREE T AHWN RTHIY 5-state HMM
The Viterbi Search - basics Instantaneous score: how well a given HMM state matches the feature vector. Path: A sequence of HMM states traversed during a given segment of feature vectors. Path-score: Product of instantaneous scores and state transition probabilities corresponding to a given path. The Viterbi search: An efficient lattice structure and algorithm for computing the best path score for a given segment of feature vectors.
time Initial state initialized with path-score = 1.0 The Viterbi Search - demo
The Viterbi Search (demo-contd.) time State with best path-score State with path-score < best State without a valid path-score P (t) j = max [P (t-1) a b (t)] iijj i Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t
The Viterbi Search (demo-contd.) time
Continuous Speech Recognition UW ONE TWO THREE T AHWN RTHIY Add transitions from word ends to beginnings, and run the Viterbi Search.
Cross-Word Triphone Modeling Sphinx III uses “triphone” or “phoneme-in-context” HMMs; Remember to inject left-context into entry state. AH ONE WN Context- dependent AH HMM Separate N HMM instances for each possible right context Inherited left context propagated along with path-scores, and dynamically modifies the state model.
Sphinx-III - Lexical Tree Structure Nodes shared if triphone Senone-Sequence-ID (SSID) identical: STARTS-T-AA-R-TD STARTINGS-T-AA-R-DX-IX-NG STARTEDS-T-AA-R-DX-IX-DD STARTUPS-T-AA-R-T-AX-PD START-UPS-T-AA-R-T-AX-PD STAA R RT TD DX IX NG DD AX PD start starting started startup start-up
Cross-Word Triphones (left context) Root nodes replicated for left context. Nodes are shared if SSIDs are identical. STAA R RT TD DX IX NG DD AX PD start starting started startup start-up left-contexts to rest of lextree S-models for different left contexts
Cross-Word Triphones (right context) Leaf node Triphones for all right contexts HMM states for triphones Picking states composite states; average of component states Composite SSID model
Sphinx III, the Acoustic Model – File List Summary mdef.c – definition of the basic phones and triphones HMMs, the mapping of each HMM state to a senone and to its transition matrix. dict.c – pronunciation dictionary structure. hmm.c – implementing HMM evaluation using Viterbi Search, which means fetching the best senone score. Note that the HMM data structures, defined at hmm.h, are hardwired to 2 possible HMM topologies – 3 / 5 state left-to-right HMMs. lextree.c – lexical tree search.
Presentation Resources: Spoken Language Processing: A Guide to Theory, Algorithm and System Development by Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Raj Reddy (Hardcover, 980 pages; Publisher: Prentice Hall PTR; ISBN: ; 1st edition, April 25, 2001). Chapters 9,13. Hwang, M., Huang, X., Alleva, F. : “Predicting Unseen Triphones with Senone”, Hwang et al : Shared Distribution Hidden Markov Models for Speech Recognition, Hwang et al : Subphonetic Modeling with Markov States – Senones, Sphinx-III documentation - a presentation made by Mosur Ravishankar; found in the /doc/ folder of the sphinx-III package. “Sphinx-III bible” - a presentation made by Edward Lin;
“I shall never believe that God plays dice with the world, but maybe machines should play dice with human capabilities…” John Doe