Advances in WP2 Torino Meeting – 9-10 March 2006 www.loquendo.com.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
Advances in WP1 Trento Meeting January
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Slide number 1 EE3P BEng Final Year Project Group Session 2 Processing Patterns using a Multi- Layer Perceptron (MLP) Martin Russell.
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
歡迎 IBM Watson 研究員 詹益毅 博士 蒞臨 國立台灣師範大學. Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, Franc¸ois Yvon ICASSP 2011 許曜麒 Structured Output.
HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Advances in WP2 Nancy Meeting – 6-7 July
Advances in WP1 Turin Meeting – 9-10 March
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Advances in WP1 Nancy Meeting – 6-7 July
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Advances in WP2 Trento Meeting – January
Non-native Speech Languages have different pronunciation spaces
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
Advances in WP2 Chania Meeting – May
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Advances in WP1 and WP2 Paris Meeting – 11 febr
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Advances in WP1 Chania Meeting – May
Radial Basis Function (RBF) Networks
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
1 st Neural Network: AND function Threshold(Y) = 2 X1 Y X Y.
7-Speech Recognition Speech Recognition Concepts
Appendix B: An Example of Back-propagation algorithm
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Algoritmi e Programmazione Avanzata
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Dünaamiliste süsteemide modelleerimine Identification for control in a non- linear system world Eduard Petlenkov, Automaatikainstituut, 2013.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Speech Enhancement based on
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
2 Research Department, iFLYTEK Co. LTD.
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
شبکه عصبی تنظیم: بهروز نصرالهی-فریده امدادی استاد محترم: سرکار خانم کریمی دانشگاه آزاد اسلامی واحد شهرری.
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
network of simple neuron-like computing elements
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Learning Long-Term Temporal Features
Deep Neural Network Language Models
Presentation transcript:

Advances in WP2 Torino Meeting – 9-10 March

2 Activities on WP2 since last meeting Study of innovative NN adaptation methods –Models: Linear Hidden Networks Test on project adaptation corpora: –WSJ0 Adaptation component –WSJ1 Spoke-3 component –Hiwire Non-Native Corpus

3 Speech Databases for Speaker Adaptation WSJ0: (standard ARPA, 1993, LDC, 1000$) –Large vocabulary (5K words) continuous speech database –Test Set: 8 speakers, ~40 utterances, read speech, bigram LM –Adaptation set: the same 8 speakers, 40 utterances each WSJ1: (1994,LDC, 1500$) –Similar to WSJ0, same vocabulary and LM –SPOKE-3: standard case study of adaptation to non-native speakers –10 speakers, 40 adaptation utterances, 40 test utterances Hiwire Non-Native Speaker database: –Collected within the project; –80 speakers, each reads 100 utterances

4 LIN Adaptation for HMM/NN LIN means “linear input network” LIN in a classical technique for speaker and channel adaptation in HMM/NN [Neto 1996]; The LIN is placed before an MLP already trained in a speaker independent way (SI-MLP) The input space is rotated by a linear transform, to make the target conditions nearer to the training conditions The linear transform is implemented with a linear neural network inserted between the input layer and the 1 st hidden layer

5 LIN Adaptation Output layer …. Input layer 1 st hidden layer 2 nd hidden layer Emission Probabilities Acoustic phonetic Units Speech Signal parameters …. Speaker Independent MLP SI-MLP LIN

6 LIN Training The global SI-MLP+LIN system is trained with vocal material from the target speaker; The LIN is initialized with an identity matrix; LIN weights are trained with error back- propagation through the global net; The original NN weights are kept frozen

7 LHN Adaptation LHN means “linear hidden network” The activations of the last hidden layer are linearly transformed to improve acoustic matching of the adaptation material The activation values of a hidden layer represent an internal structure of the input pattern in a space more suitable for classification and adaptation The linear transform is implemented with a linear neural network layer inserted between the last hidden layer and the output layer

8 LHN Adaptation Output layer …. Input layer 1 st hidden layer 2 nd hidden layer Emission Probabilities Acoustic phonetic Units Speech Signal parameters …. Speaker Independent MLP SI-MLP LHN

9 LHN Training The global SI-MLP+LHN system is trained with vocal material from the target speaker; The LHN is initialized with an identity matrix; LHN weights are trained with error back- propagation through the last layer of weights; The original NN weights are kept frozen

10 Paper at Icassp-2006 ADAPTATION OF HYBRID ANN/HMM MODELS USING LINEAR HIDDEN TRANSFORMATIONS AND CONSERVATIVE TRAINING Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface and Renato De Mori

11 WSJ0 LIN-LHN Adaptation Train: standard WSJ0 SI-84 train set, 16 kHz SI Test : 8 speakers and ~40 sentences for each speaker Vocabulary: 5K words, with a standard bigram LM Adaptation : the same 8 speakers of SI test, with 40 adaptation sentences for each of them Adaptation Model Spk: WV1_440 Spk: WV1_441 Spk: WV1_442 Spk: WV1_443 Spk: WV1_444 Spk: WV1_445 Spk: WV1_446 Spk: WV1_447 Average (E.R.) Baseline LIN Adapt (10.5%) LHN Adapt (20.0%)

12 WSJ1 – SPOKE-3 LIN-LHN Adaptation Spoke-3 is the standard WSJ1 case study to evaluate adaptation to non-native speakers There are 10 non-native speakers (40 adaptation sentences and ~40 test sentences) Train: standard WSJ0 SI-84 train set, 16 kHz Vocabulary is 5K words, with standard bigram LM Adaptation Model 4N04N14N34N44N54N84N94NA4NB4NCAverage (E.R.) Baseline LIN Adapt (14.2%) LHN Adapt (43.5%) THE FEMALE PRODUCES A LITTER OF TWO TO FOUR YOUNG IN NOVEMBER AND DECEMBER

13 Comments on WSJ0 – WSJ1 Results LIN does work for speaker adaptation: E.R. 10.5% on WSJ0 and 14.2% on WSJ1 However, with LIN in some cases performances does not improve or decrease LHN is a more powerful method: E.R. 20.0% on WSJ0 and 43.5% on WSJ1 with LHN performances always increase

14 Hiwire Non-Native Corpus (1) The database consists of English sentences uttered by non-native speakers. These speakers are from French, Italian, Greek and Spanish origins (plus an additional set of extra-European speakers). The uttered sentences belong to a command language used by aircraft pilots. The vocabulary contains 134 words. Each speaker has pronounced 1 list of 100 sentences.

15 Hiwire Non-Native Corpus (2) Corpus composition: French speakers: 31 Italian speakers:20 Greek speakers:20 Spanish speakers:10 World speakers: 10

16 Experimental conditions Starting models: -standard Loquendo ASR EN-US -Telephone models (8 kHz) -Training set: LDC Macrophone Adaptation: first 50 utterances of each speaker Test:last 50 utterances of each speaker LM: Hiwire grammar (134 words voc.) Signal proc.: down-sampling to 8 kHz

17 Results on Hiwire corpus Recognition model: ANN/HMM Adaptation Model: LIN - LHN Nationality# of speakers Default models Adapted LINAdapted LHN WAER %WAER % French Italian Greek Spanish World Total

18 Discussion The adaptation of Acoustic Models gives a good contribution also in the case of non-native speakers State-of-art LIN is a feasible and practical way to adapt hybrid NN-HMM models LHN (transformation of hidden layers activations) is a new NN adaptation method introduced in the project LHN outperforms LIN

19 Experiments for year 2 Speaker Adaptation tests on project test sets: –WSJ0 –WSJ1 spoke-3 –Hiwire non-native Tests with different techniques: –LIN –New NN adaptation methods

20 Workplan Selection of suitable benchmark databases (m6) Baseline set-up for the selected databases (m8) LIN adaptation method implemented and experimented on the benchmarks (m12) Experimental results on Hiwire database with LIN (m18) Innovative NN adaptation methods and algorithms for acoustic modeling and experimental results (m21) Further advances on new adaptation methods (m24) Unsupervised Adaptation: algorithms and experimentation (m33)