Yajie Miao Florian Metze

Slides:

Advertisements

Similar presentations

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Advertisements

A Comparison of Rule-Based versus Exemplar-Based Categorization Using the ACT-R Architecture Matthew F. RUTLEDGE-TAYLOR, Christian LEBIERE, Robert THOMSON,

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Yajie Miao, Florian Metze Carnegie Mellon University 報告人：許曜麒

Learning Convolutional Feature Hierarchies for Visual Recognition

Deep Learning for Speech Recognition

Overview of Back Propagation Algorithm

Speech Recognition Deep Learning and Neural Nets Spring 2015.

Stanford CS224S Spring 2014 CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2014 Lecture 16: Acoustic Modeling.

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

Kuan-Chuan Peng Tsuhan Chen

Semantic Information Fusion Shashi Phoha, PI Head, Information Science and Technology Division Applied Research Laboratory The Pennsylvania State.

Video Tracking Using Learned Hierarchical Features

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.

A Cross-Sensor Evaluation of Three Commercial Iris Cameras for Iris Biometrics Ryan Connaughton and Amanda Sgroi June 20, 2011 CVPR Biometrics Workshop.

Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Deep Convolutional Nets

Feedforward semantic segmentation with zoom-out features

Association Mining via Co-clustering of Sparse Matrices Brian Thompson *, Linda Ness †, David Shallcross †, Devasis Bassu † *†

Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.

Convolutional LSTM Networks for Subcellular Localization of Proteins

Abstract Deep neural networks are becoming a fundamental component of high performance speech recognition systems. Performance of deep learning based systems.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

Convolutional Neural Networks at Constrained Time Cost (CVPR 2015) Authors : Kaiming He, Jian Sun (MSR) Presenter : Hyunjun Ju 1.

Olivier Siohan David Rybach

Deep Learning for Dual-Energy X-Ray

Convolutional Neural Network

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

2 Research Department, iFLYTEK Co. LTD.

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Ajita Rattani and Reza Derakhshani,

Deep Neural Networks based Text- Dependent Speaker Verification

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Bird-species Recognition Using Convolutional Neural Network

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Convolutional Neural Networks for sentence classification

Introduction to Neural Networks

A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,

Deep Learning Hierarchical Representations for Image Steganalysis

Deep learning Introduction Classes of Deep Learning Networks

Automatic Speech Recognition: Conditional Random Fields for ASR

CSC 578 Neural Networks and Deep Learning

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Visualizing and Understanding Convolutional Networks

Object Tracking: Comparison of

Analysis of Trained CNN (Receptive Field & Weights of Network)

Convolutional Neural Networks

Sequence Student-Teacher Training of Deep Neural Networks

Heterogeneous convolutional neural networks for visual recognition

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

3. Adversarial Teacher-Student Learning (AT/S)

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Deep Neural Network Language Models

Bidirectional LSTM-CRF Models for Sequence Tagging

CSC 578 Neural Networks and Deep Learning

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Hao Zheng, Shanshan Zhang, Liwei Qiao, Jianping Li, Wenju Liu

Presentation transcript:

Improving Language-Universal Feature Extraction with Deep Maxout and Convolutional Neural Networks Yajie Miao Florian Metze Language Technologies Institute, School of Computer Science, Carnegie Mellon University ymiao, fmetze@cs.cmu.edu 1. Introduction 3. Sparse Feature Extraction 5. Experiment Results and Observations DNNs become the state of the art for speech recognition. DNNs provide an architecture particularly suitable for multi-lingual and cross-lingual ASR. DNN-based language universal feature extraction (LUFE) was proposed in [1]. A multilingual DNN can be learned with hidden layers shared across languages. Each language has its own input features and softmax output layer. On the new language, the shared hidden layers act as a deep feature extractor. Hybrid DNN models are built over feature representations from this extractor. This realizes cross-language knowledge transfer. Goal: improve LUFE via maxout and convolutional networks; generate sparse and invariant feature representations On the target language Tagalog, the identical DNN topology is used for hybrid systems over different feature extractors. We report WERs(%) on a 2-hour testing set of Tagalog. pSparsity is computed as an average over the entire Tagalog training set 1. Maxout Networks for Sparse Feature Extraction Maxout networks [3] partition the hidden units into groups. Each group outputs the max value within it as the activation. After the maxout network is trained, sparse representations can be generated from any of the maxout layers via a non-maximum masking operation. Non-maximum masking only happens during the feature extraction stage. The training stage always applies max-pooling. Models WER% pSparsity Monolingual Baseline Monolingual DNN 70.8 ----- Monolingual CNN 68.2 DNN-LUFE 69.6 21.3 CNN-LUFE 67.1 20.4 LUFE with CNNs Lang 1 softmax Lang 2 softmax Lang 3 softmax Rectifier-LUFE 68.2 10.7 Maxout-LUFE 67.5 17.7 LUFE with Maxout Hybrid DNN … maxout network maxout layer non-max masking CNN+Maxout Maxout-CNN-LUFE 65.9 16.6 Feature Extractor … Source Target 2. More Comparison Application of LUFE improves the monolingual DNN consistently The CNN extractor outperforms the DNN extractor by 2.5% absolute WER Maxout networks generate sparse features and better WERs. Rectifier networks output even sparser features but worse WERs. Over-sparsification may hurt speech recognition performance!! Combining maxout and CNN results in the best feature extractor … Rectifier networks [4] also generate zero-sparsity features Quantitatively measure sparsity via the popularity sparsity metric [5]. Given one speech frame, we assume fm is the feature representation: Lang 1 input Lang 2 input Lang 3 input Target Lang pSparsity = 2. LUFE with Convolutional Networks 3. Combination of CNN and Maxout Networks 6. References Keep the convolutional layers unchanged. The fully connected layers are replaced by maxout layers This generates both invariant and also sparse feature representation CNN inputs: 11 frames of 30-dim fbank Convolution only on the frequency axis Network structure: 11x30  100x11x5  200x100x4  1024:1024:1024 Pooling size of 2 [1] J. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in Proc. ICASSP, 2013. [2] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in Proc. ICASSP, pp. 8614-8618, 2013. [3] Y. Miao, F. Metze, and S. Rawat, “Deep maxout networks for low-resource speech recognition,” in Proc. ASRU, 2013. [4] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. AISTATS, 2011. [5] J. Ngiam, P. Koh, Z. Chen, S. Bhaskar, and A. Y. Ng, “Sparse filtering,” in Proc. NIPS, 2013 4. Experimental Setup BABEL corpus: Base Period languages Tagalog (IARPA-babel106-v0.2f) Cantonese (IARPA-babel101-v0.4c) Turkish (IARPA-babel105b-v0.4) Pashto (IARPA-babel104b-v0.4aY) Features Statistics Target Source Tagalog Cantonese Turkish Pashto # training speakers 132 120 121 training (hours) 10.7 17.8 9.8 dict size 8k 7k 12k Acknowledgements This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. # tied states 1920 1867 1854 1985