Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Slides:



Advertisements
Similar presentations
Fuzzy Angle Fuzzy Distance + Angle AG = 90 DG = 1 Annual Conference of ITA ACITA 2009 Exact and Fuzzy Sensor Assignment Hosam Rowaih 1 Matthew P. Johnson.
Advertisements

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
A Comparison of Rule-Based versus Exemplar-Based Categorization Using the ACT-R Architecture Matthew F. RUTLEDGE-TAYLOR, Christian LEBIERE, Robert THOMSON,
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Yajie Miao Florian Metze
Distributed Representations of Sentences and Documents
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
BPMN By Hosein Bitaraf Software Engineering. Business Process Model and Notation (BPMN) is a graphical representation for specifying business processes.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
Semantic Information Fusion Shashi Phoha, PI Head, Information Science and Technology Division Applied Research Laboratory The Pennsylvania State.
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, TX 75265
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
A Cross-Sensor Evaluation of Three Commercial Iris Cameras for Iris Biometrics Ryan Connaughton and Amanda Sgroi June 20, 2011 CVPR Biometrics Workshop.
Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†
Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Olivier Siohan David Rybach
Learning to Compare Image Patches via Convolutional Neural Networks
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
2 Research Department, iFLYTEK Co. LTD.
Chilimbi, et al. (2014) Microsoft Research
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Pick samples from task t
SPEEch on the griD (SPEED)
Ajita Rattani and Reza Derakhshani,
Intelligent Information System Lab
Deep Neural Networks based Text- Dependent Speaker Verification
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Random walk initialization for training very deep feedforward networks
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Bird-species Recognition Using Convolutional Neural Network
Machine Learning Today: Reading: Maria Florina Balcan
Convolutional Neural Networks for sentence classification
Two-Stream Convolutional Networks for Action Recognition in Videos
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Hybrid Programming with OpenMP and MPI
Deep Learning for Non-Linear Control
Neural Speech Synthesis with Transformer Network
MPJ: A Java-based Parallel Computing System
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Natural Language to SQL(nl2sql)
EE 492 ENGINEERING PROJECT
Sequence Student-Teacher Training of Deep Neural Networks
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Ali Hakimi Parizi, Paul Cook
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Word embeddings (continued)
Heterogeneous convolutional neural networks for visual recognition
Deep Learning for the Soft Cutoff Problem
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Learning Long-Term Temporal Features
Graph Attention Networks
3. Adversarial Teacher-Student Learning (AT/S)
Introduction to Neural Networks
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Can Genetic Programming Do Manifold Learning Too?
Deep Neural Network Language Models
Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu
Do Better ImageNet Models Transfer Better?
Presentation transcript:

Distributed Learning of Multilingual DNN Feature Extractors using GPUs Yajie Miao Hao Zhang Florian Metze Language Technologies Institute, School of Computer Science, Carnegie Mellon University ymiao, haoz1, fmetze@cs.cmu.edu Introduction DistLang: Distribution by Languages Preliminary Evaluation As the state of the art for speech recognition, DNNs are particularly suitable for multi-lingual and cross-lingual ASR. A multilingual DNN is trained over a group of languages, with hidden layers shared across languages. Given a new language, the shared hidden layers act as a deep feature extractor. Goal. With multiple GPUs available, we aim to parallelize the learning of the feature extractor over large amounts of multilingual training data. Highlight. We study how parallelization affects the quality of feature extractors. Feature extractor learning is robust to infrequent thread synchronization. Thus, time-synchronous model averaging achieves good speed-up. 1. Basic Idea WER% and Speed-up of DistModel as averaging interval increases Each GPU trains the DNN model as a language-specific feature extractor. On the target language, each speech frame is fed into these separate extractors. The feature vectors are fused into a single feature representation. With larger averaging interval, we obtain monotonically better speed-up 2000 seems to be a good tradeoff point Applied to monolingual DNN on Tagalog FullLP. The enlarged WER degradation shows that DistModel is particularly useful for multilingual DNN training LANG #1 LANG #1 …. Source Languages GPU #1 GPU #2 LANG #1 Target Language …. + Target DNN Methods WER% Speed-up Single GPU 49.3 ---- DistModel-600 50.5 1.9 DistModel-1000 2.2 DistModel-2000 50.8 2.5 LANG #1 Lang 1 softmax Lang 2 softmax Lang 3 softmax 2. Two Methods for Feature Fusion Hybrid DNN FeatConcat: concatenate outputs from the language-specific feature extractors into a single vector … averaging interval on 3 GPUs Feature Extractor … FeatMix: fuse the feature vectors via a linear weighted combination. The combined feature vector can be computed as Source WER% of DistLang with the two feature fusion methods Target … an weights for features from the n-th extractor b bias (vector) Methods Feature Dim WER% DistLang - FeatConcat 1024 61.4 DistLang - FeatMix 61.6 341 60.3 60.7 Always gives ~3.0 speed-up Worse than DistModel partly because of language dependence FeatConcat is slightly better than FeatMix 3. Pros & Cons Lang 1 input Lang 2 input Lang 3 input Target Lang No communication cost  perfect speed-up Inclusion of new source languages is easy  no need to retrain from scratch The number of GPUs is hardcoded by the number of source languages DistModel: Distribution by Model Larger-Scale Evaluation Training data of each language is partitioned evenly across the GPUs. After a specified number of mini-batches (averaging interval), feature extractors from the individual GPUs are averaged into a unified model. The averaged parameters are sent back to each GPU as the new starting model for the subsequent training. This is a time-synchronous method. However, on this particular feature learning task, DistModel is robust to large averaging interval up to 2000 mini-batches. Datasets and Experimental Setup WER% and Speed-up of DistModel as # of GPUs increases 1. Two evaluation conditions on the BABEL corpus Methods WER% Speed-up Monolingual DNN 72.5 --- Single GPU 65.7 DistModel - 3 GPUs 66.2 2.4 DistModel - 4 GPUs 66.7 3.1 DistModel - 5 GPUs 66.8 3.4 Consistent acceleration, although the improvement is not linear Pooling more GPUs degrades WERs on the target language This degradation might be mitigated by further optimization Tagalog - IARPA-babel106-v0.2f Cantonese - IARPA-babel101-v0.4c Turkish - IARPA-babel105b-v0.4 Pashto - IARPA-babel104b-v0.4aY Vietnamese - IARPA-babel101-v0.4c Bengali - IARPA-babel103b-v0.4b SOURCE TARGET Preliminary Cantonese, Turkish and Pashto (226 Hr) 10Hr set of Tagalog Larger-scale Cantonese, Turkish, Pashto, Tagalog and Vietnamese (460 Hr) 10Hr set of Bengali 30 Hrs Lang 1 Lang 1 Lang 1 Lang 1 90 Hours 30 Hrs Lang 2 Lang 2 Lang 2 Acknowledgements 30 Hrs Lang 3 Lang 3 Lang 3 Lang 2 2. Protocol. We measure the WERs on the target language, with the identical DNN architecture for various feature extractors. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. 90 Hours GPU #1 Extractor 1 GPU #2 Extractor 2 GPU #3 Extractor 3 3. Metrics Lang 3 WERs(%) of the hybrid DNN model on a 2-hour testing set of the target lang Speed-up: the ratio of the training time taken using a single GPU to the time using multiple GPUs 90 Hours averaging interval Averaged Extractor