Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

Slides:

Advertisements

Similar presentations

Lect.3 Modeling in The Time Domain Basil Hamed

Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Advances in WP2 Torino Meeting – 9-10 March

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

B.Macukow 1 Lecture 12 Neural Networks. B.Macukow 2 Neural Networks for Matrix Algebra Problems.

Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.

ECE 8527 Homework Final: Common Evaluations By Andrew Powell.

Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.

Radial Basis Functions

Distributed Representations of Sentences and Documents

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.

October 28, 2010Neural Networks Lecture 13: Adaptive Networks 1 Adaptive Networks As you know, there is no equation that would tell you the ideal number.

Chapter 5 Data mining : A Closer Look.

Radial Basis Function (RBF) Networks

Radial-Basis Function Networks

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.

1 Design of an SIMD Multimicroprocessor for RCA GaAs Systolic Array Based on 4096 Node Processor Elements Adaptive signal processing is of crucial importance.

Low-Power Wireless Sensor Networks

Ensemble of ensemble of tree and neural network Louis Duclos-Gosselin.

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, TX 75265

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Efficient Estimation of Word Representations in Vector Space

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Fast Learning in Networks of Locally-Tuned Processing Units John Moody and Christian J. Darken Yale Computer Science Neural Computation 1, (1989)

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

1.7 Linear Independence. in R n is said to be linearly independent if has only the trivial solution. in R n is said to be linearly dependent if there.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Chapter 5(5.4~5.5) Applying information retrieval to text mining Parallel embedded system design lab 이청용.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Olivier Siohan David Rybach

Machine Learning Supervised Learning Classification and Regression

TensorFlow– A system for large-scale machine learning

2 Research Department, iFLYTEK Co. LTD.

Parallel Density-based Hybrid Clustering

Intelligent Information System Lab

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Efficient Estimation of Word Representation in Vector Space

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Data Mining Practical Machine Learning Tools and Techniques

Numerical Algorithms Quiz questions

An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.

Sungho Kang Yonsei University

Neural Networks Geoff Hulten.

المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen

Capabilities of Threshold Neurons

Computer Vision Lecture 19: Object Recognition III

Graph Attention Networks

Chapter 3 Modeling in the Time Domain

Deep Neural Network Language Models

Presentation transcript:

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T Labs Research, 180 Park Avenue, Florham Park, NJ 07932, USA 2014/08/14 報告者 : 陳思澄

Outline  Introduction  Output layer decomposition  Multiple parallel hidden layers  ASR results  Conclusions

Introduction  Recurrent neural network language modeling (RNNLM) have been shown to outperform most other advanced language modeling techniques, however, it suffers from high computational complexity.  In this paper, we present techniques for building faster and more accurate RNNLMs.  In particular, we show that Brown clustering of the vocabulary is much more effective than other techniques.  We also present an algorithm for converting an ensemble of RNNLMs into a single model that can be further tuned or adapted.

 Recurrent Neural Network Language Model Architecture:

Output layer decomposition  The computational complexity of RNNLM is very high both in training and testing. Given a vocabulary of size v and h hidden units, the runtime complexity is dominated by  A very effective technique to reduce this complexity by a factor of up to consists of partitioning the vocabulary in word classes, and decomposing  Words are ranked by their frequency, then, following the rank, they are assigned to each class so that the frequency of each class is close to uniform.  This technique is attractive because it is linear in the size of the corpus, and it creates balanced classes, however, it does not take into account any measure of modeling accuracy.

 We compare the speed and accuracy of this technique with Brown clustering.  Since it clusters words that share similar bigram contexts, we expect it to result in lower perplexity than frequency mining.

Multiple parallel hidden layers (MPHL)  It has been shown that the linear interpolation of many RNNLMs greatly reduces perplexity.  While effective, an ensemble of models allows limited tuning options beyond interpolation weight optimization.  Both the linear and the geometric average model combination methods are well justified.  They compute the model with the minimum average Kullback- Leibler divergence to all models in the ensemble.

 Let’s consider the weighted geometric average of two models:  Replacing P by the definition of exponential model (equation 4), and canceling the normalization terms:  Finally:

 Matrices V 1 and V 2 are linearly scaled by or  R 1 and R 2 are combined in a (h 1 +h 2 )x(h 1 +h 2 ) block diagonal matrix.  The weights between hidden layer nodes from different sub- models are set to zero.  This means that there are fewer independent parameters than in a single large model trained from scratch.

 An alternative implementation of the combination consists of keeping the hidden layers separate and averaging the last layer before computing the softmax.  This was our chosen implementation because it allows for a trivial parallel implementation. Table 3. Perplexity of MPHL, Linear and Log linear interpolation of multiple models on the Penn corpus. All models use 200 frequency classes. Table 4. Perplexity of MPHL, Linear and Log linear interpolation of multiple models on the Penn corpus. All models use 200 Brown classes.