2 Research Department, iFLYTEK Co. LTD.

Slides:



Advertisements
Similar presentations
Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia
Advertisements

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Yajie Miao Florian Metze
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Cut-And-Stitch: Efficient Parallel Learning of Linear Dynamical Systems on SMPs Lei Li Computer Science Department School of Computer Science Carnegie.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
A Simulated-annealing-based Approach for Simultaneous Parameter Optimization and Feature Selection of Back-Propagation Networks (BPN) Shih-Wei Lin, Tsung-Yuan.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Speech Enhancement based on
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Introduction to Machine Learning, its potential usage in network area,
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
Deep Neural Network for Robust Speech Recognition With Auxiliary Features From Laser-Doppler Vibrometer Sensor Zhipeng Xie1,2,Jun Du1, Ian McLoughlin3.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Reza Yazdani Albert Segura José-María Arnau Antonio González
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Learning Amin Sobhani.
基于多核加速计算平台的深度神经网络 分割与重训练技术
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
WSRec: A Collaborative Filtering Based Web Service Recommender System
핵심어 검출을 위한 단일 끝점 DTW 알고리즘 Yong-Sun Choi and Soo-Young Lee
Edinburgh Napier University
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Author: Hsun-Ping Hsieh, Shou-De Lin, Yu Zheng
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Yu-Guang Chen1,2, Wan-Yu Wen1, Tao Wang2,
Word2Vec CS246 Junghoo “John” Cho.
Bird-species Recognition Using Convolutional Neural Network
CSC 578 Neural Networks and Deep Learning
Distributed Learning of Multilingual DNN Feature Extractors using GPUs
Applying Key Phrase Extraction to aid Invalidity Search
Pyramid Sketch: a Sketch Framework
Automatic Speech Recognition: Conditional Random Fields for ASR
Word Embedding Word2Vec.
Department of Electrical Engineering
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Research on the Modeling of Chinese Continuous Speech Recognition
Sequence Student-Teacher Training of Deep Neural Networks
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Actively Learning Ontology Matching via User Interaction
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
3. Adversarial Teacher-Student Learning (AT/S)
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Object Detection Implementations
2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Deep Neural Network Language Models
Vector Representation of Text
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

2 Research Department, iFLYTEK Co. LTD. Cluster-Based Senone Selection for the Efficient Calculation of Deep Neural Network Acoustic Models Jun-Hua Liu1,2, Zheng-Hua Ling1, Si Wei2, Guo-Ping Hu3, and Li-Rong Dai1 1 National Engineering Laboratory for Speech and Language Processing, USTC 2 Research Department, iFLYTEK Co. LTD. 3 Key Laboratory of Ministry of Public Security for Intelligent Speech Technology 2016-10-17 Good morning everyone. My name is Junhua liu, and in this report, I will present our works on efficient calculation of Deep neural networks.

Outline Introduction Proposed Methods Experiments Conclusion The report will be organized in 4 sections, namely, introduction, proposed methods, experimental results, and conclusion,

Introduction Problem State-of-the-art ASR systems utilize various neural networks, e.g. DNN, RNN, etc. Decoding process is time consuming, since neural network calculation is very slow. One of the burden is the large size of output layer First of all, let me show you the problem we try to deal with. We know that current state of art automatic speech recognition systems usually based various neural networks, such as DNN, RNN, etc. However the decoding process is very time consuming, since neural network calculation is very slow. One of the computation burden is due to large output layer size, for example, in a typical 6X2048 DNN with 6004 output senones(tied HMM state), parameters involving the last layer ocupy 35% of total weights. In this work, we focus on how to efficiently calculate the senone posteriors in DNN framework, which would be easily extended to RNN etc..

Introduction Current DNN optimization approaches Reduce hidden layers computation Low rank weight factorization / SVD [1, 2] Reshaping DNN structure by node pruning [3] Reduce output layer computation Lazy evaluation during decoding, i.e. only calculating senone posteriors when needed [4] There are two approaches to optimize DNN computation, the first try to reduce hidden layers computation, such as low rank weight decomposition or SVD, and reshaping DNN structure by node pruning. The second approach is to reduce output layer computation, and lazy evaluation is often employed. Senones posterior is caculated when it is needed in Viterbi decoding. But performance degraded, since softmax posteriors estimated based on an unstable sub-set of senones, influenced by decoding network and decoding parameters.

Introduction Motivation DNN output posteriors are dominated by a few large senones Only compute the senones with large posteriors, and set remaining outputs to a floor value In our experimental analysis, DNN output posteriors are dominated by a few very large senones, others are close to 0. This is a demonstration that, we sorted every frame posteriors and average them together. As we see, top 60 senone accumulated posteriors have exceed 0.9, and top120 exceeded 0.95. So we wonder is it possible only compute these large senone posteriors, and set remaining outputs to a floor value.

Outline Introduction Proposed Methods Experiments Conclusion The report will be organized in 4 sections, namely, introduction, proposed methods, experimental results, and conclusion,

Proposed methods Question Solution How to determine the senones with large posterior for each frame ? Solution Cluster all frame features Estimate the senones with large posterior for each cluster Select senones for each frame according to its cluster

Proposed methods Cluster-based senone selection DNN structure

Proposed methods Cluster Two-level K-means clustering for all frames Input feature: last hidden layer linear output Two-level clustering for efficiency C1=C2=32 in our experiment

Proposed methods Estimate senones with large posterior Suppose 𝐾 𝑐 frames for the c-th cluster Compute the average output 𝒚 𝐿 = 1 𝐾 𝑐 𝑡=1 𝐾 𝑐 𝒚 𝐿 (𝑡) Sort 𝒚 𝐿 , the largest 𝑁 𝑐 senones are selected, of which accumulated posteriors exceeds a predefined confidence threshold ɑ Alpha is a control parameter.

Outline Introduction Proposed Methods Experiments Conclusion The report will be organized in 4 sections, namely, introduction, proposed methods, experimental results, and conclusion,

Experiments Experiment setup Evaluate the method on both ASR and KWS tasks Dataset spontaneous Mandrin continuous telephone speech 325 hours for clustering and senone selection 1.5 hours speech for development 2.3 hours speech for LVCSR evaluation 14 hours speech for KWS evaluation(274 keywords) Baseline DNN structure: (72+3)x11 input features, 6x2048 hidden layer, and 6004 output senones.

Experiments Evaluation criteria ACC/CRR for LVCSR performance Recall rate and F1 for KWS performance Real-Time(RT) factor to indicate speed performance

Experiments Parameter tuning on dev. Data Influence of selection confidence M: average of selected output senones for all frames Influence of floor value ɑ Baseline 0.95 0.96 0.97 0.98 M 6004 725 824 973 1220 ACC(%) 75.80 75.38 75.63 75.74 75.86 Alpha is a predefined accumulated posterior threshold. We found it is feasible to only compute large posterior senones, and set others to a floor value. So we test it on real applications next. 𝑽 𝒇 Baseline 1.0E-05 1.0E-06 1.0E-07 ACC(%) 75.80 70.40 75.90 75.86 68.82

Experiments ASR and KWS results Single CPU of Intel Xeon Dual Core E5-2620 LVCSR performance kept, while KWS performance slightly degraded Last layer 6004 -> 2222, parameters 35M->27.2M Whole speech process speed was accelerated by about 13% relatively Hardware is single CPU of E5-2620 LVCSR performance kept, while KWS performance slightly degraded. The reason may be that small posterior senenes are set to the same value, and their distinctions are eliminated, so other candidates rather the best path are influenced. However the degradation is very small, and could be accepted in practice. More importantly, we could get 13% speech acceleration. Methods LVCSR Task KWS Task CRR(%) RT Recall(%) F1(%) Baseline 75.98 0.3084 83.22 74.74 0.3095 Proposed 75.99 0.268 83.1 74.59 0.2669

Outline Introduction Proposed Methods Experiments Conclusion The report will be organized in 4 sections, namely, introduction, proposed methods, experimental results, and conclusion,

Conclusion In this work, we present an efficient method to calculate DNN output senone posteriors It could be combined with low rank weight factorization, and extended to RNNs Future work will focus on other clustering strategies and DNN retraining with senone selection

References T. N. Sainath, etc.“Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in Proc. ICASSP, 2013, pp. 6655–6659. J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition,” in Proc. Interspeech, 2013. T. He, Y. Fan, Y. Qian, T. Tan, and K. Yu, “Reshaping deep neural network for fast decoding by node-pruning,” in Proc. ICASSP, 2014, pp. 245–249. X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, “Accurate and compact large vocabulary speech recognition on mobile devices,”in Proc. Interspeech, 2013, pp. 662–665.

Thanks & Questions