State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D-52056 Aachen EUROSPEECH97,Sep97 Present by Hsu Ting-Wei 2006.05.01
Reference J. J. Odell, The Use of Context in Large Vocabulary Speech Recognition, Ph.D. Thesis, Cambridge University, Cambridge, March 1995. 羅應順,自發性中文語音基本辨認系統之建立,交大電信,94.6 HTK books -HHed
1.Introduction In this paper several modifications of two well known methods for parameter reduction of Hidden Markov Models by state tying are described. The two methods are: A data driven method which clusters triphone states with a bottom up algorithm. A top down method which grows decision trees for triphone states . Decision tree State tying Bottom up Top down Data driven
1.Introduction (cont.) We investigate the following aspects: The possible reduction of the word error rate by state tying the consequences of different distance measures for the data driven approach and modifications of the original decision tree approach such as node merging Node merging Distance measures
Corpus Test corpus : Evaluation corpus : 5000 word vocabulary of the WSJ November 92 task Evaluation corpus : 3000 word vocabulary of the VERBMOBIL ’95 task The reduction of the word error rate by state tying : 14% for the WSJ task 5% for the VERBMOBIL task compared to simple triphone models.
2.State tying Aim: Reduce the number of parameters of the speech recognition system without a significant degradation in modeling accuracy. Steps: Establish triphone list of the training corpus Estimate the mean and variance of the triphone state by using a segmentation of the training data The triphone states are then subdivided into subsets according to their central phoneme and their position within the phoneme model. Inside these sets the states are tied together according to a distance measure. Additionally it has to be assured that every model contains a sufficient amount of training data. mean1 mean2 ㄅㄧㄠ ㄅㄧㄩ State tying
3.Data driven method Two steps Drawback Cluster Criteria The triphone states being very much alike due to a distance measure are clustered together (seen >= 50 times in the training corpus) The states which do not contain enough data are clustered together with the nearest neighbor Drawback For triphones which were not observed in the training corpus no tied model is available (Backing off models) Usually these models are simple generalizations of the triphones such as diphones or monophones Cluster Criteria The approximative divergence of two states The log-likelihood difference
3.Data driven method (cont.) Data driven clustering flow chart: State tying Mixture incrementing
4.Decision tree method Construct steps: Advantage Collect all states in the root of tree Find a binary question which can be solved by maximum log-likelihood. Cluster the data into two parts, one is “Yes” and the other one is “No”. Repeat the step 2 until the maximum log-likelihood below the threshold. Advantage No backing off models are needed because using the decision trees one can find a generalized model for every triphone state in the recognition vocabulary
4.Decision tree method (cont.) Question: Is the left context a word boundary? Bias: Most questions ask for a very special phoneme property, so most triphone states belong to the right subtree. (l,r,流音) Heterogamous: All the questions from the root to the leaf were answered by “No”. The number of triphone states The number of observations which belong to this leaf
4.Decision tree method (cont.) We also tested the following modifications of the original method: 4.1 No Ad-hoc subdivision 4.2 Different triphone lists 4.3 Cross validation 4.4 Gender dependent(GD) coupling 4.5 Node merging 4.6 Full covariance matrix
4.1 No Ad-hoc subdivision Instead of one distinct tree for every phoneme and state, a single tree Split every node until it contains only states with one central phoneme Split every node until every word in the vocabulary can be discriminated In the experiments we found out that such heterogemous nodes are very rare and do not introduce any ambiguities in the lexicon.
4.2 Different triphone lists 大於no tying (780)原因: 以state來看 1. 2. 3. 4. 4: Contain those triphones from the training corpus which can also be found in the test lexicon
4.3 Cross validation Triphone list 1(50 observations) was used to estimate the Gaussian models of the tree nodes. Triphone list 2 (20 observations) was used to cross validate the splits . At every node the split with the highest gain in log-likelihood was made which also achieved a positive gain for the triphones of list 2
4.4 Gender dependent (GD)coupling Advantage: Construct gender dependent decision trees. Disadvantage: The training data for the tree construction is being halved. Every tree node contains two separate models for male and female data. The log-likelihood of the node data can then be calculated as the sum of the log-likelihoods of the two models
4.4 Gender dependent (GD)coupling (cont.)
4.5 Node merging The merged node represents the triphone states for which the disjunction of the conjuncted answers are true. So every possible combination of questions can be constructed.
4.6 Full covariance matrix Replaced the diagonal covariance matrices of the Gaussian models by full covariance matrices. This modification results in a large increase in the number of parameters of the decision tree. Smoothing method:
4.6 Full covariance matrix (cont.)
5. Conclusion Two well known methods for parameter reduction of Hidden Markov Models by state tying are described. There are some ways to adjust the methods of state tying , and they produce some good results.
State tying in HTK – HHEd command Present by Hsu Ting-Wei 2006.05.08
Reference : HTK –HHEd (page 18) HHEd is a HMM definition editor which will clone models into context-dependent sets, apply a variety of parameter tyings and increment the number of mixture components in specified distributions. HERest and HVite can be used to adapt HMMs to better model. The single biggest problem in building context-dependent HMM systems is always data insufficiency. For continuous density systems, this balance is achieved by tying parameters together.
Reference : HTK –HHEd (cont.) 3.3 Creating Tied-State Triphones Given a set of monophone HMMs, the final stage of model building is to create context-dependent triphone HMMs. This is done in two steps : 3.3.1 Step 9 - Making Triphones from Monophones Firstly, the monophone transcriptions are converted to triphone transcriptions and a set of triphone models are created by copying the monophones and re-estimating. 3.3.2 Step 10 - Making Tied-State Triphones Secondly, similar acoustic states of these triphones are tied to ensure that all state distributions can be robustly estimated.
Reference : HTK –HHEd (cont.) 3.3.1 Step 9 - Making Triphones from Monophones Tying transition matrices State-tying in next step For state cluster in next step 儲存統計值
Reference : HTK –HHEd (cont.) HLEd HHEd Triphone list (是outputfile?) Triphone master label file Monophone master label file WB: Define an inter-word label TC: convert all phoneme lable to triphone CL : clone a HMM list. Ex: A will be cloned 3 times and B will be cloned 2 times. TI : count HMM prob.
Reference : HTK –HHEd (cont.) 3.3.2 Step 10 - Making Tied-State Triphones When estimating these models, many of the variances in the output distributions will have been floored since there will be insufficient data associated with many of the states. The last step in the model building process is to tie states within triphone sets in order to share data and thus be able to make robust parameter estimates. HHEd provides two mechanisms which allow states to be clustered and then each cluster tied. Data-driven clustering Tree-based clustering HERest
Reference : HTK –HHEd (cont.) Data-Driven Clustering Data-driven clustering is performed by the TC and NC commands. For single Gaussians, a weighted Euclidean distance between the means is used and for tied-mixture systems a Euclidean distance between the mixture weights is used. NC command: N-cluster the states listed in the itemList and tie each cluster i as macro macroi where i is 1,2,3,. . . ,N. The set of states in the itemList are divided into N clusters using the following furthest neighbor hierarchical cluster algorithm:
Reference : HTK –HHEd (cont.) Here g(i,j) is the inter-group distance between clusters i and j defined as the maximum distance between any state in cluster i and any state in cluster j. The calculation of the inter-state distance depends on the type of HMMs involved. Single mixture Gaussians use Fully tied mixture systems use and all others use
Reference : HTK –HHEd (cont.) Data-Driven Clustering TC command: Cluster all states in the given itemList and tie them as macroi where i is 1,2,3,. . . . This command is identical to the NC command described above except that the number of clusters is varied such that the maximum within cluster distance is less than the value given by f. One limitation of the data-driven clustering procedure described above is that it does not deal with triphones for which there are no examples in the training data.
Reference : HTK –HHEd (cont.) Tree-Based Clustering: RO: outlier threshold (step 9) TR: trace flag in HTK QS: questions TB: decision tree clustering of states AU: synthesis new unseen triphones CO: compact a set of HMMs ST: save tree One of the advantages of using decision tree clustering is that it allows previously unseen triphones to be synthesised.
Reference : HTK –HHEd (cont.) TB: decision tree clustering of states, the steps are : Each set of states defined by the final argument is pooled to form a single cluster. Each question in the question set loaded by the QS commands is used to split the pool into two sets. The use of two sets rather than one, allows the log likelihood of the training data to be increased and the question which maximizes this increase is selected for the first branch of the tree. The process is then repeated until the increase in log likelihood achievable by any question at any node is less than the threshold specified by the first argument (350.0 in this case). 符合這樣QS的states A B 經由統計值x, 代入A、B與A+B, 如果A+B的統計值(log likelihood) 比A、B大,則合併
Reference : HTK –HHEd (cont.) CO newList For example, suppose that models A, B, C and D were currently loaded and A and B were identical. Then the command would tie HMMs A and B, set the physical name of B to A and output the new HMM list to the newList file. This command is used mainly after performing a sequence of parameter tying commands.
TC VS TB Data-Driven Clustering TC command Tree-Based Clustering: TB command TC uses a distance metric between states TB uses a log likelihood criterion. TC supports any type of output distribution TB only supports single-Gaussian continuous density output distributions. TB command can also be used to cluster whole models.
Reference : HTK –HHEd (cont.) After step9 and step 10..