ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le, Laurent Besacier, Tanja Schultz** * CLIPS-IMAG Laboratory,

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

歡迎 IBM Watson 研究員詹益毅博士蒞臨國立台灣師範大學. Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, Franc¸ois Yvon ICASSP 2011 許曜麒 Structured Output.

MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.

Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.

Introduction to Machine Learning Approach Lecture 5.

Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models C. J. Leggetter and P. C. Woodland Department of.

Automatic Continuous Speech Recognition Database speech text Scoring.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Comparison of the SPHINX and HTK Frameworks Processing the AN4 Corpus Arthur Kunkle ECE 5526 Fall 2008.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Olivier Siohan David Rybach

2 Research Department, iFLYTEK Co. LTD.

Pick samples from task t

Statistical Models for Automatic Speech Recognition

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Statistical Models for Automatic Speech Recognition

Research on the Modeling of Chinese Continuous Speech Recognition

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Presentation transcript:

ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory, UMR CNRS 5524 BP 53, Grenoble Cedex 9, FRANCE ** Interactive Systems Laboratories Carnegie Mellon University, Pittsburgh, PA, USA Presenter: Hsu-Ting Wei

2 Outline 1. Introduction 2. Acoustic-phonetic unit similarities –2.1. Phoneme Similarity –2.2. Polyphone Similarity –2.3. Clustered Polyphonic Model Similarity 3. Experimental and results 4. Conclusion

3 1. Introduction There are at less 6900 languages in the world. We are interested in new techniques and tools for rapid portability of speech recognition systems (ex:ASR) when only limited resources are available.

4 1. Introduction (cont.) In crosslingual acoustic modeling, previous approaches have been limited to context independent models. –Monophonic acoustic models in target language were initialized using seed models from source language. –Then, these initial models could be rebuilt or adapted using training data from the target language. The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated.

5 1. Introduction (cont.) 1996, J. Kohler –J. Kohler used HMM distances to calculate the similarity between two monophonic models. –Method : To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric. During training the parameters of the mixture Laplacian density phoneme models are estimated. Further, for each phoneme a set of phoneme tokens X is extracted from a test or development corpus. Given two phoneme models and and given the sets of phoneme tokens or observations X i and X j the distance between model and is defined by:

6 1. Introduction (cont.) 2000, B. Imperl –proposed that a triphone similarity estimation method based on phoneme distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones. –Method: Confusion matrix

7 1. Introduction (cont.) 2001, T. Schultz –proposed PDTS (Polyphone Decision Tree Specialization) to overcome the problem which the context mismatch across languages increases dramatically for wider contexts. –PDTS : data-driven method (↓) In PDTS, the clustered multilingual polyphone decision tree is adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

8 1. Introduction (cont.) In this paper, we investigate a new method for this crosslingual transfer process. We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language. Then, based on the acoustic-phonetic unit similarities, some crosslingual transfer and adaptation processes are applied.

9 2. Acoustic-phonetic unit similarities 2.1. Phoneme Similarity –2.1.1 Data-driven methods –2.1.2 Proposed knowledge-based method Data-driven methods (↓) –The acoustic similarity between two phonemes can be obtained automatically by calculating the distance between two acoustic models HMM distance (Entropy) Kullback-Leibler distance Bhattacharyya distance Euclidean distance Confusion matrix

10 2. Acoustic-phonetic unit similarities (cont.) Proposed knowledge-based method (↑) –Step 1: Top-down classification –Step 2: Bottom-up estimation Step 1: Top-down classification –Figure 1 shows a hierarchical graph where each node is classified into different layers. k = number of layers G i = user defined similarity value for layer i (i = 0...k-1). –In our work, we investigated several settings of k and G i and set G = {0.9; 0.45; 0.25; 0.1; 0.0} with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments. G0 = 0.9 G1 = 0.45 G4 = 0.0 G3 = 0.1 G2 = 0.25

11 2. Acoustic-phonetic unit similarities (cont.) Step 2: Bottom-up estimation –d( [i], [u] ) = G 2 (=0.25 in this experiment) G0 = 0.9 G1= 0.45 G4 = 0.0 G3 = 0.1 G2 = 0.25

12 2. Acoustic-phonetic unit similarities (cont.) 2.2 Polyphone Similarity – Let S be the phoneme set in source language, T be the phoneme set in target language. Final, find the nearest polyphones P s *

13 2. Acoustic-phonetic unit similarities (cont.) 2.3 Clustered Polyphonic Model Similarity –Since the number of polyphones in a language is very large, a limited training corpus usually does not cover enough occurrences of every polyphones. –A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

14 2. Acoustic-phonetic unit similarities (cont.) 2.3 Clustered Polyphonic Model Similarity (cont.)

15 3. Experimental and results Syllable-based ASR system In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models, –Baseline 2.25 hours of data spoken by 8 speakers were used. (Vietnamese speech) –Crosslingual methods Speech data from six languages were used to build these models: Arabic, Chinese, English, German, Japanese and Spanish. Adapted by Vietnamese speech

16 3. Experimental and results (cont.) Initial model 越南人說越南話 (2.5 小時 ) 六國人說自己本國話 ( 也許 100 小時 ) 加入越南人說越南話 (2.5 小時 ) 調適 New models test

17 3. Experimental and results (cont.) As the number of clustered sub-models increases, SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data. However, the crosslingual system is able to overcome this problem by indirectly using data in other languages.

18 3. Experimental and results (cont.) The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation: –knowledge-based –data-driven using confusion matrix We find that the knowledge based method outperforms the data-driven method.

19 4. Conclusion This paper presents different methods of estimating the similarities between two acoustic-phonetic units. The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates