State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D-52056.

Slides:

Advertisements

Similar presentations

DECISION TREES. Decision trees  One possible representation for hypotheses.

Advertisements

Missing values problem in Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

K Means Clustering , Nearest Cluster and Gaussian Mixture

Supervised Learning Recap

Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Speech Recognition Training Continuous Density HMMs Lecture Based on:

Decision Tree Algorithm

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Training Tied-State Models

Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.

Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Ensemble Learning (2), Tree and Forest

Decision Tree Models in Data Mining

Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models C. J. Leggetter and P. C. Woodland Department of.

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Automatic Continuous Speech Recognition Database speech text Scoring.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Isolated-Word Speech Recognition Using Hidden Markov Models

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Learning from Observations Chapter 18 Through

Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

CS690L Data Mining: Classification

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Classification and Regression Trees

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Deep Feedforward Networks

LECTURE 20: DECISION TREES

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Statistical Models for Automatic Speech Recognition

5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang

Decision Tree Saed Sayad 9/21/2018.

Probabilistic Models with Latent Variables

Statistical Models for Automatic Speech Recognition

Decision Trees By Cole Daily CSCI 446.

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Handwritten Characters Recognition Based on an HMM Model

LECTURE 18: DECISION TREES

EM Algorithm and its Applications

State Tying for Context Dependent Phoneme Models

Presentation transcript:

State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D Aachen EUROSPEECH97,Sep97 Present by Hsu Ting-Wei

2 Reference J. J. Odell, The Use of Context in Large Vocabulary Speech Recognition, Ph.D. Thesis, Cambridge University, Cambridge, March 羅應順, 自發性中文語音基本辨認系統之建立, 交大電信,94.6

3 1.Introduction In this paper several modifications of two well known methods for parameter reduction of Hidden Markov Models by state tying are described. The two methods are ： –A data driven method which clusters triphone states with a bottom up algorithm. –A top down method which grows decision trees for triphone states. Decision tree Data driven Bottom up Top down State tying

4 1.Introduction (cont.) We investigate the following aspects ： –The possible reduction of the word error rate by state tying –the consequences of different distance measures for the data driven approach and –modifications of the original decision tree approach such as node merging Distance measures Node merging

5 Corpus Test corpus : –5000 word vocabulary of the WSJ November 92 task Evaluation corpus : –3000 word vocabulary of the VERBMOBIL ’95 task The reduction of the word error rate by state tying ： –14% for the WSJ task –5% for the VERBMOBIL task compared to simple triphone models.

6 2.State tying Aim ： Reduce the number of parameters of the speech recognition system without a significant degradation in modeling accuracy. Steps: 1.Establish triphone list of the training corpus 2.Estimate the mean and variance of the triphone state by using a segmentation of the training data 3.The triphone states are then subdivided into subsets according to their central phoneme and their position within the phoneme model. Inside these sets the states are tied together according to a distance measure. 4.Additionally it has to be assured that every model contains a sufficient amount of training data. ㄅㄧㄠㄅㄧㄩ mean1 mean2 State tying

7 3.Data driven method Two steps –The triphone states being very much alike due to a distance measure are clustered together (seen >= 50 times in the training corpus) –The states which do not contain enough data are clustered together with the nearest neighbor Drawback –For triphones which were not observed in the training corpus no tied model is available (Backing off models) –Usually these models are simple generalizations of the triphones such as diphones or monophones Cluster Criteria –The approximative divergence of two states –The log-likelihood difference

8 3.Data driven method (cont.) Data driven clustering flow chart ：

9 4.Decision tree method Construct steps: 1.Collect all states in the root of tree 2.Find a binary question which can be solved by maximum log- likelihood. 3.Cluster the data into two parts, one is “Yes” and the other one is “No”. 4.Repeat the step 2 until the maximum log-likelihood below the threshold. Advantage –No backing off models are needed because using the decision trees one can find a generalized model for every triphone state in the recognition vocabulary

10 4.Decision tree method (cont.) Question: Is the left context a word boundary? The number of triphone states The number of observations which belong to this leaf Bias: Most questions ask for a very special phoneme property, so most triphone states belong to the right subtree. Heterogamous: All the questions from the root to the leaf were answered by “ No ”.

11 4.Decision tree method (cont.) We also tested the following modifications of the original method: –4.1 No Ad-hoc subdivision –4.2 Different triphone lists –4.3 Cross validation –4.4 Gender dependent(GD) coupling –4.5 Node merging –4.6 Full covariance matrix

No Ad-hoc subdivision Instead of one distinct tree for every phoneme and state, a single tree –Split every node until it contains only states with one central phoneme –Split every node until every word in the vocabulary can be discriminated In the experiments we found out that such heterogemous nodes are very rare and do not introduce any ambiguities in the lexicon.

Different triphone lists : Contain those triphones from the training corpus which can also be found in the test lexicon

Cross validation Triphone list 1(50 observations) was used to estimate the Gaussian models of the tree nodes. Triphone list 2 (20 observations) was used to cross validate the splits. At every node the split with the highest gain in log-likelihood was made which also achieved a positive gain for the triphones of list 2

Gender dependent (GD)coupling Advantage: Construct gender dependent decision trees. Disadvantage: The training data for the tree construction is being halved. Every tree node contains two separate models for male and female data. The log-likelihood of the node data can then be calculated as the sum of the log-likelihoods of the two models

Gender dependent (GD)coupling (cont.)

Node merging The merged node represents the triphone states for which the disjunction of the conjuncted answers are true. So every possible combination of questions can be constructed.

Full covariance matrix Replaced the diagonal covariance matrices of the Gaussian models by full covariance matrices. This modification results in a large increase in the number of parameters of the decision tree. Smoothing method:

Full covariance matrix (cont.)

20 5. Conclusion Two well known methods for parameter reduction of Hidden Markov Models by state tying are described. There are some ways to adjust the methods of state tying, and they produce some good results.