IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech Using Rate of Change of Spectral Moments A. R. Jayan P. S. Rajath Bhat P. C. Pandey {arjayan, rajathbhat, EE Dept, IIT Bombay 30 th January, 2011
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 2/21 PRESENTATION OUTLINE 1. Introduction Speech landmarks Landmark detection Clear speech Automated speech intelligibility enhancement 2. Methodology Band energy parameters Spectral moments Rate of change function 3. Evaluation and results VCV utterances Sentences 4. Conclusion
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 3/21 1. INTRODUCTION Speech landmarks Regions, associated with spectral transitions, containing important information for speech perception Landmarks and related events [Park, 2008] Segment typeLandmarkDescription VowelVowel (V)Vowel nucleus GlideGlide (G)Slow formant transitions Consonant Glottis (g) Sonorant (s) Burst (b) Vocal fold vibration Nasal closure / release Turbulence noise
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 4/21 Landmark detection Processing Extraction of parameters characterizing the landmark Computation of the rate of change (ROC) of parameters Locating the landmark using ROC(s) Applications Intelligibility enhancement Speech recognition Vocal tract shape estimation
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 5/21 Clear speech Speech produced with clear articulation when talking to a hearing- impaired listener, or in a noisy environment More intelligible for ▪ Hearing impaired listeners (~17% higher, Picheny et al.,1985) ▪ Listeners in noisy environments (Payton et al., 1994) ▪ Non-native listeners (Bradlow and Bent, 2002) ▪ Children with learning disabilities (Bradlow et al., 2003) Pronounced acoustic landmarks
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 6/21 Conv. Clear Example: ‘The book tells a story’ (Recordings from
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 7/21 Automated speech intelligibility enhancement Automated detection of landmarks High detection rate with low false detections Good temporal accuracy (5-10 ms) Computational efficiency Modification of speech characteristics Intensity / duration / spectral modifications around landmarks with minimal perceptual distortions of the acoustic cues in the speech signal
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 8/21 Problems in stop consonant perception Transient sound with low intensity Severely affected by noise / hearing impairment Stop landmarks : Closure Burst onset Onset of voicing Example: /apa/
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 9/21 Some of the earlier landmark detection techniques Liu (1996): Rate-of-rise measures of parameters from a set of fixed spectral bands (Speech recognition, g, s, b landmarks, 80 TIMIT sentences, detection rate: 84 % at ms, 50 % at 5-10 ms) Salomon et al. (2002): Temporal parameters related to periodicity, envelope, spectral fine structure (Speech recognition, onsets and offsets of vowels, sonorants, & consonants, 120 TIMIT sentences, detection rate: 90 % at 20 ms) Sainath and Hazan (2006): Sinusoidal model parameters (Speech segmentation, 453 TIMIT sentences, word error rates: 20 % ) Niyogi & Sondhi (2002): Stop landmark detection using total energy, energy above 3 kHz & Wiener entropy (Speech recognition, stop consonants, 320 TIMIT sentences, detection rate: 90 % at 20 ms) Jayan & Pandey (2009): Stop landmark detection using GMM parameters (Speech enhancement, 50 TIMIT sentences, detection rate: 73 % at 5 ms)
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 10/21 Improving landmark detection Parameters ▪ Capturing spectral transitions ▪ Adaptation to speech variability Rate of change measure ▪ Range of parameter variations ▪ Correlation among parameters Adaptive time steps ▪ Small time step for abrupt variations ▪Large time step for slow variations Objective of the present investigation Detection of burst landmarks for automated intelligibility enhancement
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 11/21 2. METHODOLOGY Band energy parameters Log of spectral peaks in three bands ▪ b1: kHz ▪ b2: kHz ▪ b3: kHz Mag. spectrum (10 kHz sampling) computed using 512-point DFT, 6 ms Hanning window, 1 frame per ms, and smoothed by 20-point moving average. Smoothed mag. spectrum X(n, k) used for calculating log of spectral peak in band i n = time index, k = frequency index
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 12/21 Example : Band energy parameters for /aga/ Time (ms) (a) Speech waveform (b) Band energy's
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 13/21 Spectral moments Normalized spectrum Centroid : frequency of energy concentration n = time index, k = frequency index, N = DFT size Variance : spread of energy around the centroid Skewness : measure of spectral symmetry Kurtosis : measure of spectral peakiness
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 14/21 Example : Band energy parameters & spectral moments for /aga/ Time (ms) (a) Waveform (b) (c) (d)
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 15/21 Measures of rate of change ● First difference based rate of change (ROC) K = time step ● Mahalanobis distance based rate of change (ROC-MD) A single measure indicative of the overall variation, taking care of parameter range and correlation effects y ( n ) = parameter set at time n K = time step = covariance matrix, pre-calculated using the parameter set from segments with energy above a threshold
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 16/21 Detection of voicing offset and onset ▪ Band energy in Hz ▪ ROC( n ) computed with time step 50 ms ▪ Voicing offset [g-] : ROC( n ) -12 dB ▪ Voicing onset [g+] : ROC( n ) +12 dB Burst onset landmark detection Most prominent peak in the ROC-MD( n ) between g- and g+ Example /aga/ (b) ROC-MD (c) ROC Time (ms) (a) Waveform
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 17/21 3. EVALUTATION & RESULTS Effects of rate of change functions & parameters on burst detection ROC and parameters 1 ) ROC(BE): Sum of normalized ROCs of [ E b1, E b2, E b3 ] 2 ) ROC-MD(BE): ROC-MD of [ E b1, E b2, E b3 ] 3 ) ROC-MD(SM): ROC-MD of [ F c, F , F k, F s ] 4 ) ROC-MD(BE,SM): ROC-MD of [E b1, E b2, E b3, F c, F , F k, F s ] Material: VCV utterances, TIMIT sentences Time steps: 3, 6 ms Temporal accuracies: 3, 5, 10, 15, 20 ms
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 18/21 VCV utterances ▪ 6 stop consonants ( b, d, g, p, t, k ) ▪ 3 vowel contexts ( a, i, u ) ▪ 10 speakers (5 M, 5 F) ▪ 180 tokens
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 19/21 TIMIT Sentences ▪ 5 speakers (2 M, 3 F) ▪ 10 sentences from each speaker ▪ 238 tokens Error type Insertion rates (%) ROC(BE)ROC-MD(BE)ROC-MD(SM)ROC-MD(BE,SM) Vowel / sem. vowel Frication Glottal stops / clicks4334
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 20/21 4. CONCLUSION Increase in time steps reduced detection accuracy. Mahalanobis distance based ROC was more effective than first- difference based rate of change. Spectral moments were useful as additional parameters in improving burst-onset detection.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 21/21 Thank you