IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, EE Dept, IIT Bombay 3 rd February, 2008
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 2/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. PRESENTATION OUTLINE 1.Introduction Acoustic properties of clear speech Landmark detection Need for high time resolution 2.Automated landmark detection with high resolution Pass 1 Pass 2 3.Experimental results 4.Summary and conclusion
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 3/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 1. INTRODUCTION Acoustic properties of clear speech Clear speech: Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments Examples - ‘the book tells a story’ ‘the boy forgot his book’ ConversationalClear Intelligibility of clear speech ▪ Picheny et al.,1985: ~17% more intelligible than conversational speech ▪ More intelligible for different classes of listeners & listening conditions
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 4/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Acoustic differences between clear and conversational speech Sentence level ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) ▪ Larger variation in fundamental frequency ▪ Increased number of pauses, more pause durations Word level ▪ Less sound deletions ▪ More sound insertions Phonetic level ▪ Context dependent, non-linear increase in segment durations ▪ More targeted vowel formants ▪ Increase in consonant intensity
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 5/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improvement in intelligibility of conversational speech by incorporating properties of clear speech Consonant–vowel intensity ratio ( CVR ) enhancement Increasing energy of consonant segment Consonant duration enhancement Increasing CV and VC transitions (burst duration, VOT, formant transition) Challenges Accurate detection of regions for modification Analysis-modification-synthesis with low processing artifacts Processing without increasing overall speaking rate, increase in transition regions with a corresponding dicrease in srteady state segments
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 6/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Intelligibility enhancement using properties of clear speech Hazan & simpson, 1998 manually labeled VCV and sentences intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB spectral modification by filtering Colotte & Laprie, 2000 automated method for identifying regions based on mel-cepstral analysis stops and unvoiced fricatives amplified by +4 dB transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 7/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection Speech landmarks Regions containing important information for speech perception Associated with spectral transitions Landmarks types 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators 2. Abrupt (A) -Fast glottal or velum activity 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction 4. Vocalic (V) - Vowel landmarks Abrupt (~68%) Vocalic (~29%) Non-abrupt (~3%)
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 8/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmarks
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 9/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Liu, 1996 ▪ Based on energy variation in 6 spectral bands 0-0.4, , , , , kHz ▪ Parameter: First difference of maximum energy (log) in each spectral band time-step = 50 ms in coarse level, 26 ms in fine level ▪ Matching of peaks across bands for locating boundaries ▪ Detects glottal, sonorant closures, releases, stop closures, releases Application : Extraction of features for supporting speech recognition
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 10/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection rate vs. temporal resolution 73 % 83 % 88 % 44 % Uses same processing for all types of landmarks
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 11/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Niyogi & Sondhi, 2002 for stop consonants total energy & energy above 3 k Hz in log scale measure of spectral flatness non-linear operator optimized for burst detection Salomon et al., 2002 Hilbert transform based envelope to extract temporal parameters spectral information adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions) Alani & Deriche, 1999 wavelet transform based decomposition energy variations in 6 bands
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 12/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Need for high temporal resolution and detection rate Application dependent Speech recognition: Analysis is performed around landmarks for parameter extraction ▪ high accuracy ▪ moderate temporal resolution (20-30 ms) Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions Landmark type ▪ Short duration events (bursts) need high time resolution ▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 13/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations ▪ short-time energy variation in spectral bands weak burst may not get detected ▪ centroid frequency not well defined during low energy segments ▪ fixed band boundaries may not adapt to speech variability ▪ Smoothening performed during parameter extraction ▪ temporal smoothening on spectrum affects time resolution ▪ Type of distance measure ▪ first difference operation not optimized for all types of landmarks ▪ time-step 10 ms is too high for burst detection ▪ Effect of noise on parameters
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 14/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Acoustic cues for the different phonetic events are distributed non-homogeneously in the time-frequency plane Separate detectors are required for each phonetic class Each detector must use a method most suited for the phonetic event Objective Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 15/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 2. AUTOMATED LANDMARK DETECTION
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 16/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection using spectral peaks and centroids Pass 1 Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ frame rate 1 ms. Parameters ▪ maximum energy in each spectral band, every 1 ms ▪ band centroids estimated in each band, every 1 ms ▪ features similar to formant peaks and formant frequencies ▪ c an be estimated easily ▪ not much affected by noise
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 17/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Peak energy Centroid frequency Rate-of-rise functions Transition index tracks simultaneous variation of energy and centroid centroids given less weighting in low energy areas
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 18/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid contours kHz kHz kHz kHz kHz
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 19/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid ROR contours Time step = 26 ms kHz kHz kHz kHz kHz
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 20/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 26 ms
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 21/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 4 ms Less sensitive to slow transitions
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 22/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Problems Large time step ( > 20 ms) ▪ detects with less temporal accuracy ▪ detects slowly varying events also (more detection rate) Small time step (< 5 ms) ▪ detects abrupt transitions with good resolution ▪ misses slow transitions. Pass 2: Analyze landmarks detected in Pass 1 with a small time-step
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 23/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improving Temporal resolution : Pass 2 ▪ 40 ms window centered around burst landmarks detected in pass 1 ▪ decomposed to 6 levels by discrete Meyer Wavelet ▪ detail (high frequency) contents in the lower two levels used for localizing bursts Parameters ▪ short time energy variation ▪ zero crossing rate Compute normalized RORs with a time-step of 3 ms Get a new transition index as Relocate landmark to the location corresponding to the peak in T ez ( n )
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 24/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 25/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 26/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 27/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Stop30 ms20 ms10 ms5 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/ /t/ /k/ Det. % Stop10 ms7 ms5 ms3 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/ /t/ /k/ Det. % EXPERIMENTAL RESULTS Test material: VCV syllables ▪ 2 speakers (1 male, 1 female) ▪ 3 stop consonants (/ p /, / t /, / k /) ▪ 3 initial and 3 final vowel contexts (/ a /, / i /, / u /) ▪ Total 54 tokens Pass 1 Pass 2
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 28/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Test material: TIMIT sentences ▪ 5 speakers (2 male, 3 female) ▪ 10 sentences per speaker ▪ closure and burst onsets of / b /, / d /, / g /, / p /, / t /, / k / ▪ total 418 tokens Phoneme class 30ms20 ms10 ms Det. (%) Pass Stop (548) Fricative(266) Nasal (154) Vowel (614) S. vowel (213) Overall det. (%) Detection ratesLocalization error
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 29/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 4. SUMMARY & CONCLUSION Pass 2 improves temporal resolution of stop landmarks ▪ Significant improvement in stop burst localization in VCV syllables 30% improvement for 5 ms resolution ▪ Marginal improvement in sentences 4 % improvement for stop landmarks at 10 ms resolution Possible reasons ▪ reduced closure duration in sentences ▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms ▪ use of 40 ms window in Pass 2, may need modification ▪ errors in the manual labels ▪ Future work: Evaluation of the method in presence of noise