Abstract This article investigates the importance of the vocal source information for speaker recogni- tion. We propose a novel feature extraction scheme to exploit the time-frequency propert- ies of the LP residual signal. The new feature, named Wavelet Octave Coefficients of Residues (WOCOR), provides additional speaker discriminative power and is demonstrated to improve the overall performance of speaker recognition system with the conventional vocal tract feature, the MFCCs. Speaker Specific Vocal Source Signal Acknowledgement This effort was partially supported by a research grant awarded by the Hong Kong Research Grant Council. The authors wish to acknowledge Dr. Frank Soong for instructive discussions and suggestions during this work. Time-Frequency Analysis of Vocal Source Signal for Speaker Recognition Nengheng Zheng, P.C. Ching and Tan Lee Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR of China Conclusion The vocal source time-frequency information provides addition speaker discriminative power and improves the overall performance of speaker recognition system Source-tract separation by LP inverse filtering Estimating the AR coefficients of V(z) by linear prediction analysis Inverse filtering s(n) for the output e(n) e(n) is highly related to u(n) and is speaker dependent. Speech production Feature Extraction With Time-Frequency Analysis on the Residual Signal Voice activity detection and pitch tracking Only voiced segments are interested Energy and zero-crossing detection for VAD Cepstrum analysis for pitch tracking LP inverse filtering Inverse filter the voiced frame speech Pitch synchronous wavelet transform Exact pitch tracking by detecting the residual bursts Wavelet transform on every two pitch cycles residual signal with one pitch cycle overlap Vocal tract features are widely used in speaker recognition system, i.e., MFCC, LPCC, etc. The vocal-cords vibrating mechanism is speaker dependent We are aiming at capturing the time frequency properties of the glottal source. Time-frequency feature generation Firstly, divide the wavelet coefficients into octave groups Glottal pulsesVocal tractSpeech signal u(n)u(n) H(z)s(n)s(n) Experiments Corpus Read Cantonese HK ID number 40 male speakers 4 enrollment and 6 testing sessions Microphone and telephone speech Baseline system MFCC_D_A 128 component GMM Recognition results Recognition error rate with WOCOR α PerformanceIDER (%)EER (%) MFCC_D_A MIC TEL MFCC_D_A + WOCOR 4 MIC TEL Recognition error rate with fused source-tract information Information fusion in score level Secondly, generate the feature vector, named the first order Wavelet Octave Coefficients of Residues (WOCOR 1 ) Furthermore, to obtain more temporal details, divide each octave into sub-groups and generate high order WOCOR α LP Inverse Filtering VAD and Pitch Tracking Pitch Synchronous Wavelet Transform s(n) Time-Frequency Feature Generation WOCOR e(n)F0F0 We(a,b)We(a,b) Comments Temporal details of vocal source signal is useful for speaker recognition In telephone speech, the relative improvement by source information is 24% for identification and 14% for verification; in microphone, 22% and 11% for identification and verification, respectively. with w t experimentally determined s(n)s(n) e(n)e(n)