Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Mapping FOR SPEAKER Diarization IN NOisy conditions

Similar presentations


Presentation on theme: "Feature Mapping FOR SPEAKER Diarization IN NOisy conditions"— Presentation transcript:

1 Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Weixin Zhu1, Wu Guo1, Guoping Hu2 1National Engineering Laboratory of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P.R.China 2Key Laboratory of Intelligent Speech Technology, Ministry of Public Security, Hefei, China DNN-based feature mapping have been demonstrated as a valid method for speech enhancement in noisy condition, in this paper, we first used it in speaker diarization , enhanced PLP (with DNN-based feature mapping) can achieve about 7% relative diarization error rate (DER) reduction in IFLY-DIAR-II test set which is a noisy data set, the results shows the effectiveness of the enhanced features. Consensus clustering using different clustering result can improve the cluster purity, and consensus clustering using PLP and Enhanced PLP can achieve about 21% relative DER reduction in IFLY-DIAR-II test set, the results demonstrate the effectiveness of combining the DNN-based feature mapping and consensus clustering. This regression DNN performs as a highly non-linear regression function to obtain clean speech features from noisy speech features. Training of DNN: Since it is difficult and expensive to collect so much noisy training data from real scenarios, the noisy training data is produced by corrupting the clean speech data with different types of noise at various signal-to-noise-ratio (SNR) levels. The acoustic context information along both time axis (with multiple neighboring frames) and frequency axis (with full frequency bins) is utilized by DNN. The DNN output contains the same number of frames as the input, using the minimizing mean squared error (MMSE) object function: Consensus clustering is to access the stability of the discovered clusters and discard the unstable clusters. When it is used for cluster purification, it can remove the incorrect assignments of speech segments in speaker clusters before cluster modeling Consensus clustering process: Firstly, we obtain the cluster results of two sub-systems using different features. For instance, system 1 is using PLP, system 2 is using Enhanced PLP. Secondly, we get the common cluster results from two sub-systems to obtain more purified clusters, in this example, seg 2 and seg 4 are removed. And then, the pure speaker segments (seg 1,3, 5) are used to train initial speaker probability models, and agglomerative hierarchical clustering is followed using the initial speaker probability models. Finally, the label of impure speaker segments is labeled in the Viterbi re-segmentation step. Data base: IFLY-DIAR-II, Chinese talk shows; duration vary from 20 minutes to one hour; number of speakers ranges from 2 to 9 ; corrupted by music, laughter, applause, or other noises;171 recordings for training set, 171 recordings for development set, 367 recordings for test set. Features:13 dimensional PLP or MFCC features, plus first and second derivatives,39 dimensional Enhanced PLP. Regression DNN:IFLYTEK-HIFICM database is used for training, is composed of clean Mandarin utterances; Additive white Gaussian noise (AWGN), music, applause and laughter are used as noise signals, SNR level from 0dB to 20dB with an increment of 5dB; a context window of 11 frames; the DNN architecture was Tabel 1. Experimental results for different features Tabel 2. Results of consensus clustering based on different acoustic feature combinations without consensus clustering in the IFLY-DIAR-II test set. ACPR:average cluster purity rate The enhanced PLP can achieve about 7% relative DER reduction and about 1.75% absolute ACPR improvement compared to PLP or MFCC, the results demonstrate the effectiveness of the enhanced speech features obtained by regression DNN. Compared to PLP-MFCC, PLP-EnPLP can achieve 9.15% relative DER reduction and 2.23% absolute ACPR improvement, it shows that the enhanced features can add complementary information to the original short-term features. The results in Table 1 and Table 2 show that PLP-EnPLP can achieve 21.26% relative DER reduction and 6.97% absolute ACPR improvement compared to the original PLP feature, it shows the validity of combining the DNN-based feature mapping and consensus clustering. Experiments ABSTRCT Consensus Clustering SYSTEM FRAMEWORK Feature mapping: a regression DNN is used to obtain cleaner features in noisy conditions and we use the enhanced features in clustering. Consensus Clustering: we access the reliability of the segments obtained in initial clustering stage, impure segments are dropped. SYSTEM DER(%) ACPR(%) MFCC-EnPLP 7.82 91.29 PLP-EnPLP 7.74 91.41 PLP-MFCC 8.52 89.18 Feature DER(%) ACPR(%) PLP 9.83 84.44 MFCC 9.89 84.36 EnPLP 9.11 86.18 Conclusions DNN-based feature mapping


Download ppt "Feature Mapping FOR SPEAKER Diarization IN NOisy conditions"

Similar presentations


Ads by Google