Learning Structured Models for Phone Recognition Slav Petrov, Adam Pauls, Dan Klein
Acoustic Modeling
Motivation Standard acoustic models impose many structural constraints We propose an automatic approach Use TIMIT Dataset MFCC features Full covariance Gaussians (Young and Woodland, 1994)
Phone Classification ??????????
æ
HMMs for Phone Classification
Temporal Structure
Standard subphone/mixture HMM Temporal Structure Gaussian Mixtures Model Error rate HMM Baseline25.1%
Our Model Standard Model Single Gaussians Fully Connected
Hierarchical Baum-Welch Training 32.1% 28.7% 25.6% HMM Baseline25.1% 5 Split rounds21.4% 23.9%
Phone Classification Results MethodError Rate GMM Baseline (Sha and Saul, 2006) 26.0 % HMM Baseline (Gunawardana et al., 2005) 25.1 % SVM (Clarkson and Moreno, 1999) 22.4 % Hidden CRF (Gunawardana et al., 2005) 21.7 % Our Work21.4 % Large Margin GMM (Sha and Saul, 2006) 21.1 %
Phone Recognition ?????????
Standard State-Tied Acoustic Models
No more State-Tying
No more Gaussian Mixtures
Fully connected internal structure
Fully connected external structure
Refinement of the /ih/-phone
Refinement of the /l/-phone
Hierarchical Refinement Results HMM Baseline41.7% 5 Split Rounds28.4%
Merging Not all phones are equally complex Compute log likelihood loss from merging Split modelMerged at one node t-1tt+1t-1tt+1
Merging Criterion t-1tt+1 t-1tt+1
Split and Merge Results Split Only28.4% Split & Merge27.3%
HMM states per phone
Alignment Hand Aligned27.3% Auto Aligned26.3% Results
Alignment State Distribution
Inference State sequence: d 1 -d 6 -d 6 -d 4 -ae 5 -ae 2 -ae 3 -ae 0 -d 2 -d 2 -d 3 -d 7 -d 5 Phone sequence: d - d - d -d -ae - ae - ae - ae - d - d -d - d - d Transcription d - ae - d Viterbi Variational ???
Variational Inference Variational Approximation: Viterbi26.3% Variational25.1% : Posterior edge marginals Solution:
Phone Recognition Results MethodError Rate State-Tied Triphone HMM (HTK) (Young and Woodland, 1994) 27.7 % Gender Dependent Triphone HMM (Lamel and Gauvain, 1993) 27.1 % Our Work26.1 % Bayesian Triphone HMM (Ming and Smith, 1998) 25.6 % Heterogeneous classifiers (Halberstadt and Glass, 1998) 24.4 %
Conclusions Minimalist, Automatic Approach Unconstrained Accurate Phone Classification Competitive with state-of-the-art discriminative methods despite being generative Phone Recognition Better than standard state-tied triphone models
Thank you!