RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Author: Thad Hughes and Keir Mierle Speaker: 佳樺

Outline INTRODUCTION RECURRENT NEURAL NETWORK MODEL TRAINING PROCEDURE
EXPERIMENTS CONCLUSION

What is voice activity detection?
通訊方面：節省能量耗損。語音辨識方面：降低噪音影響。

INTRODUCTION One way to build a VAD system involves two GMMs
speech frames non-speech frames A problem inherent to many current VAD techniques frame classification and temporal smoothing cannot be easily optimized simultaneously. HMM VAD systems typically have a small number of hidden states processing each frame independently fails to account for the lack of temporal conditional independence of speech frames discrete HMM state space implies that the model cannot “remember” much about the past RNNs address these limitations be discriminatively optimized for frame classification simultaneously learning

Quadratic nodes our RNN nodes compute quadratic functions of their inputs, followed by an optional non-linearity: 𝑉 𝑥 =𝑓( 𝑥 𝑇 𝑊 𝑄 𝑥+ 𝜔 𝐿 𝑇 𝑥+ 𝜔 𝐵 ) 𝑊 𝑄 is an upper-triangular sparse matrix with weights for quadratic terms. 𝜔 𝐿 is a vector of linear weights similar to those in MLPs. 𝜔 𝐵 is a scalar bias

VAD RNN architecture and initialization
quadratic node

The sparsity pattern of 𝑊 𝑄 for all nodes in layer 𝐻1; other layers follow the same pattern: 𝑥 𝑛 𝑇 = 𝐻0 𝑇 𝐻0 𝑇−1 𝐻1 𝑇−1

𝐻1 𝑇 denotes a vector of the outputs of the nodes in layer H1 at timestep T, then for all nodes n in layer 𝐻1, the input vector 𝑥 𝑛 𝑇 ; The sparsity pattern of 𝑊 𝑄 for all nodes in layer 𝐻1; other layers follow the same pattern: 𝑥 𝑛 𝑇 = 𝐻0 𝑇 𝐻0 𝑇−1 𝐻1 𝑇−1 ,𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦 𝑊 𝑄 = 𝐼 3𝑥3 𝐼 3𝑥 𝐼 3𝑥

Compute the difference of the RNN’s output node 𝑁 𝑂𝑢𝑡𝑝𝑢𝑡 [𝑇] with a slightly delayed target output: 𝐸𝑅𝑅𝑂𝑅 𝑇 = 𝑁 𝑂𝑢𝑡𝑝𝑢𝑡 𝑇 −𝑇𝑎𝑟𝑔𝑒𝑡[𝑇−∆] fixed delay ∆=10.

TRAINING PROCEDURE Input :13-dimensional PLP features, without deltas or double-deltas. The target output at each timestep is a single value indicating whether the frame ∆ timesteps ago is speech or non-speech generated by forced alignment of the audio with a human transcription using a monophone acoustic model.

TRAINING PROCEDURE Ceres Solver Automatic differentiation
minimize the sum of the squares of all errors. Automatic differentiation compute their exact first derivatives, which greatly simplifies experimentation. Two-stage training we fix all recurrent parameters and only train the feedforward parameters (those associated with vertical arrows in Fig. 1). we optimize all the parameters together, including the 𝜔 𝐿 weights controlling the tapped delay line.

EXPERIMENTS

EXPERIMENTS trained on 1000 utterances
tested on another 1000 utterances approximately half the frames are labeled as speech

EXPERIMENTS FA 錯誤接受：應該不能通過辨識的卻通過了 FR 錯誤拒絕：應該通過辨識的卻沒通過 ASR system can recover more easily from false accept (FA) errors than false reject (FR) errors. 𝑅𝑁𝑁 𝐴 : 27,000 utterances averaging 4.4 seconds.

EXPERIMENTS 1 2

CONCLUSION We have shown that our RNN architecture can outperform considerably larger GMM-based systems on VAD tasks. reducing the per-frame false alarm rate by 26% increasing overall recognition speed by 17% modest 1% relative decrease in the word error rate Our RNN architecture, with multiple layers and quadratic nodes, also seems to outperform traditional MLP-like RNNs.

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Similar presentations

Presentation on theme: "RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Similar presentations

Presentation on theme: "RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION"— Presentation transcript:

Similar presentations

About project

Feedback