歡迎 IBM Watson 研究員 詹益毅 博士 蒞臨 國立台灣師範大學. Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, Franc¸ois Yvon ICASSP 2011 許曜麒 Structured Output.

Slides:



Advertisements
Similar presentations
Advances in WP2 Torino Meeting – 9-10 March
Advertisements

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
Chapter 4: Linear Models for Classification
Po-Sen Huang1 Xiaodong He2 Jianfeng Gao2 Li Deng2
Advances in WP2 Nancy Meeting – 6-7 July
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Implementing a reliable neuro-classifier
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Distributed Representations of Sentences and Documents
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Radial Basis Function Networks
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
A Survey of ICASSP 2013 Language Model Department of Computer Science & Information Engineering National Taiwan Normal University 報告者:郝柏翰 2013/06/19.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Authors: F. Zamora-Martínez, V. Frinken, S. España-Boquera, M.J. Castro-Bleda, A. Fischer, H. Bunke Source: Pattern Recognition, Volume 47, Issue 4, April.
Olivier Siohan David Rybach
Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.
Basic machine learning background with Python scikit-learn
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Neural Language Model CS246 Junghoo “John” Cho.
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
SMEM Algorithm for Mixture Models
Computer Vision Chapter 4
Model generalization Brief summary of methods
CSE572: Data Mining by H. Liu
Introduction to Neural Networks
Deep Neural Network Language Models
CS249: Neural Language Model
Presentation transcript:

歡迎 IBM Watson 研究員 詹益毅 博士 蒞臨 國立台灣師範大學

Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, Franc¸ois Yvon ICASSP 2011 許曜麒 Structured Output Layer Neural Network Language Model 2

Outline Introduction Related Work Structured Output Layer Neural Network Language Model Word clustering Training Experimental Setup Experimental Results Conclusions and Future Work 3

Introduction(1/2) Neural network language models (NNLMs) are based on the idea of continuous word representation. Distributionally similar words are represented as neighbors in a continuous space. Both neural network approach and class-based models were shown to pertain to the few approaches that provide significant recognition improvements over n-gram baselines for large-scale speech recognition tasks. Probably the major bottleneck with NNLMs is the computation of posterior probabilities in the output layer. 4

Introduction(2/2) Short-list NNLMs, that estimate probabilities only for several thousands most frequent words, were proposed as a practical workaround this problem. As opposed to standard NNLMs, SOUL NNLMs make it feasible to estimate the n-gram probabilities for vocabularies of arbitrary size. As a result, all the vocabulary words, and not just the words in the short-list, can bene fi t from the improved prediction capabilities of the NNLMs. 5

Related Work(1/2) The NNLMs are only used to predict a limited number of words. Thus the probability distribution must be normalized with a standard back-off LM that is still used to deal with words out of the short-list. To handle large output vocabularies, We use hierarchical structure of the output layer. In a nutshell, the output vocabulary is fi rst clustered and represented by a binary tree. 6

Related Work(2/2) Each internal node of the tree holds a word cluster which is divided in two sub-clusters and so on. Leaves correspond to words at the end of this recursive representation of the vocabulary. A shortcoming of this approach is the recursive binary structure. If one word is poorly clustered, this error affects all the internal nodes (or clusters) which lead to this word. 7

SOUL NNLM(1/2) We assume that the output vocabulary is structured by a clustering tree, where each word belongs to only one class and its associated sub-classes. Iseeyou W1W2W3 W3 Class ……... Sub-Class … Word ……... C1 C2 CD 8

SOUL NNLM(2/2) Words in the short-list are a special case since each of them represents its own class without subclasses ( D =1 in this case). 9

SOUL NNLM – Word clustering Step 1: Train a standard NNLM model with the short-list as an output. Step 2: Reduce the dimension of the context space using a principal component analysis ( fi nal dimension is 10 in our experiments). Step 3: Perform the recursive K-means word clustering based on the distributed representation induced by the context space (except for words in the short-list). 10

SOUL NNLM – Training Training of a NNLM is performed by maximizing log- likelihood of the training data. This optimization is performed by stochastic back-propagation. The training time of each epoch for a SOUL model is only 1.5 times longer than for 8k short-list NNLMs and equal to that of 12k short-list NNLMs. 11

SOUL NNLM – Fig (1/2) 12

SOUL NNLM – Fig (2/2) 13

Experimental Setup To segment Chinese phrases in words, we make use of the simple longest-match segmentation algorithm based on word vocabulary used in previous LIMSI Mandarin Chinese STT systems [13]. The baseline LM is a word-based 4-gram LM. 14

Experimental Results(1/3) State-of-the-art n-gram language models are rarely of an order larger than 4. when increasing the n-gram order from 4 to 5 is almost negligible while the size of models increases drastically. The increase in context length at the input layer results in only at most linear growth in complexity. Thus training longer-context neural network models is still feasible. 15

Experimental Results(2/3) Contrary to classical back-off n-gram LMs, increasing the NNLM context length signi fi cantly improves the results both in terms of perplexity and CER, without any major impact on the training and probability computation time. 16

Experimental Results(3/3) The gains attained with SOUL NNLMs correspond to a relative improvement of 23% in perplexity and 7-9% in CER. SOUL NNLMs do also outperform short-list NNLMs due to the fact they predict all words from the vocabulary (as other parameters are kept the same). The most signi fi cant improvement with SOUL models is obtained for the longer-context(6-gram) NNLMs con fi guration. 17

Conclusions and Future Work It combines two techniques that were proved to improve the STT system performance for large-scale tasks, namely neural network and class-based language models. This approach allows training of neural network LMs with full vocabularies without con fi ning their power to predicting words from limited short-lists. Russian or Arabic,though being completely different in grammar and morphology, are characterized by large number of wordforms for a given lemma. This results in vocabularies that are several times larger than the ones used for Chinese or English. 18