Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Slides:

Advertisements

Similar presentations

Fuzzy Angle Fuzzy Distance + Angle AG = 90 DG = 1 Annual Conference of ITA ACITA 2009 Exact and Fuzzy Sensor Assignment Hosam Rowaih 1 Matthew P. Johnson.

Advertisements

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

A Comparison of Rule-Based versus Exemplar-Based Categorization Using the ACT-R Architecture Matthew F. RUTLEDGE-TAYLOR, Christian LEBIERE, Robert THOMSON,

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

Yajie Miao Florian Metze

Distributed Representations of Sentences and Documents

Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.

BPMN By Hosein Bitaraf Software Engineering. Business Process Model and Notation (BPMN) is a graphical representation for specifying business processes.

Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.

Semantic Information Fusion Shashi Phoha, PI Head, Information Science and Technology Division Applied Research Laboratory The Pennsylvania State.

EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, TX 75265

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

A Cross-Sensor Evaluation of Three Commercial Iris Cameras for Iris Biometrics Ryan Connaughton and Amanda Sgroi June 20, 2011 CVPR Biometrics Workshop.

Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†

Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Olivier Siohan David Rybach

Learning to Compare Image Patches via Convolutional Neural Networks

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

2 Research Department, iFLYTEK Co. LTD.

Chilimbi, et al. (2014) Microsoft Research

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Pick samples from task t

SPEEch on the griD (SPEED)

Ajita Rattani and Reza Derakhshani,

Intelligent Information System Lab

Deep Neural Networks based Text- Dependent Speaker Verification

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Random walk initialization for training very deep feedforward networks

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Bird-species Recognition Using Convolutional Neural Network

Machine Learning Today: Reading: Maria Florina Balcan

Convolutional Neural Networks for sentence classification

Two-Stream Convolutional Networks for Action Recognition in Videos

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Hybrid Programming with OpenMP and MPI

Deep Learning for Non-Linear Control

Neural Speech Synthesis with Transformer Network

MPJ: A Java-based Parallel Computing System

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Natural Language to SQL(nl2sql)

EE 492 ENGINEERING PROJECT

Sequence Student-Teacher Training of Deep Neural Networks

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Ali Hakimi Parizi, Paul Cook

Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab

Word embeddings (continued)

Heterogeneous convolutional neural networks for visual recognition

Deep Learning for the Soft Cutoff Problem

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Learning Long-Term Temporal Features

Graph Attention Networks

3. Adversarial Teacher-Student Learning (AT/S)

Introduction to Neural Networks

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Can Genetic Programming Do Manifold Learning Too?

Deep Neural Network Language Models

Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu

Do Better ImageNet Models Transfer Better?

Presentation transcript:

Distributed Learning of Multilingual DNN Feature Extractors using GPUs Yajie Miao Hao Zhang Florian Metze Language Technologies Institute, School of Computer Science, Carnegie Mellon University ymiao, haoz1, fmetze@cs.cmu.edu Introduction DistLang: Distribution by Languages Preliminary Evaluation As the state of the art for speech recognition, DNNs are particularly suitable for multi-lingual and cross-lingual ASR. A multilingual DNN is trained over a group of languages, with hidden layers shared across languages. Given a new language, the shared hidden layers act as a deep feature extractor. Goal. With multiple GPUs available, we aim to parallelize the learning of the feature extractor over large amounts of multilingual training data. Highlight. We study how parallelization affects the quality of feature extractors. Feature extractor learning is robust to infrequent thread synchronization. Thus, time-synchronous model averaging achieves good speed-up. 1. Basic Idea WER% and Speed-up of DistModel as averaging interval increases Each GPU trains the DNN model as a language-specific feature extractor. On the target language, each speech frame is fed into these separate extractors. The feature vectors are fused into a single feature representation. With larger averaging interval, we obtain monotonically better speed-up 2000 seems to be a good tradeoff point Applied to monolingual DNN on Tagalog FullLP. The enlarged WER degradation shows that DistModel is particularly useful for multilingual DNN training LANG #1 LANG #1 …. Source Languages GPU #1 GPU #2 LANG #1 Target Language …. + Target DNN Methods WER% Speed-up Single GPU 49.3 ---- DistModel-600 50.5 1.9 DistModel-1000 2.2 DistModel-2000 50.8 2.5 LANG #1 Lang 1 softmax Lang 2 softmax Lang 3 softmax 2. Two Methods for Feature Fusion Hybrid DNN FeatConcat: concatenate outputs from the language-specific feature extractors into a single vector … averaging interval on 3 GPUs Feature Extractor … FeatMix: fuse the feature vectors via a linear weighted combination. The combined feature vector can be computed as Source WER% of DistLang with the two feature fusion methods Target … an weights for features from the n-th extractor b bias (vector) Methods Feature Dim WER% DistLang - FeatConcat 1024 61.4 DistLang - FeatMix 61.6 341 60.3 60.7 Always gives ~3.0 speed-up Worse than DistModel partly because of language dependence FeatConcat is slightly better than FeatMix 3. Pros & Cons Lang 1 input Lang 2 input Lang 3 input Target Lang No communication cost  perfect speed-up Inclusion of new source languages is easy  no need to retrain from scratch The number of GPUs is hardcoded by the number of source languages DistModel: Distribution by Model Larger-Scale Evaluation Training data of each language is partitioned evenly across the GPUs. After a specified number of mini-batches (averaging interval), feature extractors from the individual GPUs are averaged into a unified model. The averaged parameters are sent back to each GPU as the new starting model for the subsequent training. This is a time-synchronous method. However, on this particular feature learning task, DistModel is robust to large averaging interval up to 2000 mini-batches. Datasets and Experimental Setup WER% and Speed-up of DistModel as # of GPUs increases 1. Two evaluation conditions on the BABEL corpus Methods WER% Speed-up Monolingual DNN 72.5 --- Single GPU 65.7 DistModel - 3 GPUs 66.2 2.4 DistModel - 4 GPUs 66.7 3.1 DistModel - 5 GPUs 66.8 3.4 Consistent acceleration, although the improvement is not linear Pooling more GPUs degrades WERs on the target language This degradation might be mitigated by further optimization Tagalog - IARPA-babel106-v0.2f Cantonese - IARPA-babel101-v0.4c Turkish - IARPA-babel105b-v0.4 Pashto - IARPA-babel104b-v0.4aY Vietnamese - IARPA-babel101-v0.4c Bengali - IARPA-babel103b-v0.4b SOURCE TARGET Preliminary Cantonese, Turkish and Pashto (226 Hr) 10Hr set of Tagalog Larger-scale Cantonese, Turkish, Pashto, Tagalog and Vietnamese (460 Hr) 10Hr set of Bengali 30 Hrs Lang 1 Lang 1 Lang 1 Lang 1 90 Hours 30 Hrs Lang 2 Lang 2 Lang 2 Acknowledgements 30 Hrs Lang 3 Lang 3 Lang 3 Lang 2 2. Protocol. We measure the WERs on the target language, with the identical DNN architecture for various feature extractors. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. 90 Hours GPU #1 Extractor 1 GPU #2 Extractor 2 GPU #3 Extractor 3 3. Metrics Lang 3 WERs(%) of the hybrid DNN model on a 2-hour testing set of the target lang Speed-up: the ratio of the training time taken using a single GPU to the time using multiple GPUs 90 Hours averaging interval Averaged Extractor