RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Slides:

Advertisements

Similar presentations

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Advertisements

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

Radial Basis Functions

Text Independent Speaker Recognition with Added Noise Jason Cardillo & Raihan Ali Bashir April 11, 2005.

Speaker Adaptation for Vowel Classification

Chapter 6: Multilayer Neural Networks

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Radial Basis Function Networks

An Introduction to Support Vector Machines Martin Law.

Biointelligence Laboratory, Seoul National University

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Classification / Regression Neural Networks 2

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Artificial Intelligence Techniques Multilayer Perceptrons.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

An Introduction to Support Vector Machines (M. Law)

Algoritmi e Programmazione Avanzata

A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -

Non-Bayes classifiers. Linear discriminants, neural networks.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Intro. ANN & Fuzzy Systems Lecture 11. MLP (III): Back-Propagation.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Multiple-Layer Networks and Backpropagation Algorithms

Optical RESERVOIR COMPUTING

Deep Feedforward Networks

Deep Learning Amin Sobhani.

Randomness in Neural Networks

LECTURE 11: Advanced Discriminant Analysis

2 Research Department, iFLYTEK Co. LTD.

Data Mining, Neural Network and Genetic Programming

Conditional Random Fields for ASR

Intelligent Information System Lab

Hybrid computing using a neural network with dynamic external memory

Neural Networks: Improving Performance in X-ray Lithography Applications ECE 539 Ryan T. Hogg May 10, 2000.

Classification / Regression Neural Networks 2

Lecture 11. MLP (III): Back-Propagation

Artificial Neural Network & Backpropagation Algorithm

Camera Calibration Using Neural Network for Image-Based Soil Deformation Measurement Systems Zhao, Honghua Ge, Louis Civil, Architectural, and Environmental.

Computer Vision Chapter 4

Artificial Intelligence Chapter 3 Neural Networks

Ch4: Backpropagation (BP)

Introduction to Radial Basis Function Networks

Neural networks (3) Regularization Autoencoder

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Prediction Networks Prediction A simple example (section 3.7.3)

August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab

Introduction to Neural Network

Ch4: Backpropagation (BP)

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Artificial Intelligence Chapter 3 Neural Networks

Presentation transcript:

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION Author: Thad Hughes and Keir Mierle Speaker: 佳樺

Outline INTRODUCTION RECURRENT NEURAL NETWORK MODEL TRAINING PROCEDURE EXPERIMENTS CONCLUSION

What is voice activity detection? 通訊方面：節省能量耗損。語音辨識方面：降低噪音影響。

INTRODUCTION One way to build a VAD system involves two GMMs speech frames non-speech frames A problem inherent to many current VAD techniques frame classification and temporal smoothing cannot be easily optimized simultaneously. HMM VAD systems typically have a small number of hidden states processing each frame independently fails to account for the lack of temporal conditional independence of speech frames discrete HMM state space implies that the model cannot “remember” much about the past RNNs address these limitations be discriminatively optimized for frame classification simultaneously learning

Quadratic nodes our RNN nodes compute quadratic functions of their inputs, followed by an optional non-linearity: 𝑉 𝑥 =𝑓( 𝑥 𝑇 𝑊 𝑄 𝑥+ 𝜔 𝐿 𝑇 𝑥+ 𝜔 𝐵 ) 𝑊 𝑄 is an upper-triangular sparse matrix with weights for quadratic terms. 𝜔 𝐿 is a vector of linear weights similar to those in MLPs. 𝜔 𝐵 is a scalar bias

VAD RNN architecture and initialization quadratic node

VAD RNN architecture and initialization The sparsity pattern of 𝑊 𝑄 for all nodes in layer 𝐻1; other layers follow the same pattern: 𝑥 𝑛 𝑇 = 𝐻0 𝑇 𝐻0 𝑇−1 𝐻1 𝑇−1

VAD RNN architecture and initialization 𝐻1 𝑇 denotes a vector of the outputs of the nodes in layer H1 at timestep T, then for all nodes n in layer 𝐻1, the input vector 𝑥 𝑛 𝑇 ; The sparsity pattern of 𝑊 𝑄 for all nodes in layer 𝐻1; other layers follow the same pattern: 𝑥 𝑛 𝑇 = 𝐻0 𝑇 𝐻0 𝑇−1 𝐻1 𝑇−1 ,𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦 𝑊 𝑄 = 𝐼 3𝑥3 𝐼 3𝑥3 0 0 𝐼 3𝑥3 0 0 0 0

VAD RNN architecture and initialization Compute the difference of the RNN’s output node 𝑁 𝑂𝑢𝑡𝑝𝑢𝑡 [𝑇] with a slightly delayed target output: 𝐸𝑅𝑅𝑂𝑅 𝑇 = 𝑁 𝑂𝑢𝑡𝑝𝑢𝑡 𝑇 −𝑇𝑎𝑟𝑔𝑒𝑡[𝑇−∆] fixed delay ∆=10.

TRAINING PROCEDURE Input :13-dimensional PLP features, without deltas or double-deltas. The target output at each timestep is a single value indicating whether the frame ∆ timesteps ago is speech or non-speech generated by forced alignment of the audio with a human transcription using a monophone acoustic model.

TRAINING PROCEDURE Ceres Solver Automatic differentiation minimize the sum of the squares of all errors. Automatic differentiation compute their exact first derivatives, which greatly simplifies experimentation. Two-stage training we fix all recurrent parameters and only train the feedforward parameters (those associated with vertical arrows in Fig. 1). we optimize all the parameters together, including the 𝜔 𝐿 weights controlling the tapped delay line.

EXPERIMENTS

EXPERIMENTS

EXPERIMENTS trained on 1000 utterances tested on another 1000 utterances approximately half the frames are labeled as speech

EXPERIMENTS FA 錯誤接受：應該不能通過辨識的卻通過了 FR 錯誤拒絕：應該通過辨識的卻沒通過 ASR system can recover more easily from false accept (FA) errors than false reject (FR) errors. 𝑅𝑁𝑁 𝐴 : 27,000 utterances averaging 4.4 seconds.

EXPERIMENTS 1 2

CONCLUSION We have shown that our RNN architecture can outperform considerably larger GMM-based systems on VAD tasks. reducing the per-frame false alarm rate by 26% increasing overall recognition speed by 17% modest 1% relative decrease in the word error rate Our RNN architecture, with multiple layers and quadratic nodes, also seems to outperform traditional MLP-like RNNs.