Download presentation
Presentation is loading. Please wait.
Published byJessica Garrett Modified over 9 years ago
1
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3 rd February, 2008
2
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 2/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. PRESENTATION OUTLINE 1.Introduction Acoustic properties of clear speech Landmark detection Need for high time resolution 2.Automated landmark detection with high resolution Pass 1 Pass 2 3.Experimental results 4.Summary and conclusion
3
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 3/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 1. INTRODUCTION Acoustic properties of clear speech Clear speech: Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm ‘the book tells a story’ ‘the boy forgot his book’ ConversationalClear Intelligibility of clear speech ▪ Picheny et al.,1985: ~17% more intelligible than conversational speech ▪ More intelligible for different classes of listeners & listening conditions
4
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 4/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Acoustic differences between clear and conversational speech Sentence level ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) ▪ Larger variation in fundamental frequency ▪ Increased number of pauses, more pause durations Word level ▪ Less sound deletions ▪ More sound insertions Phonetic level ▪ Context dependent, non-linear increase in segment durations ▪ More targeted vowel formants ▪ Increase in consonant intensity
5
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 5/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improvement in intelligibility of conversational speech by incorporating properties of clear speech Consonant–vowel intensity ratio ( CVR ) enhancement Increasing energy of consonant segment Consonant duration enhancement Increasing CV and VC transitions (burst duration, VOT, formant transition) Challenges Accurate detection of regions for modification Analysis-modification-synthesis with low processing artifacts Processing without increasing overall speaking rate, increase in transition regions with a corresponding dicrease in srteady state segments
6
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 6/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Intelligibility enhancement using properties of clear speech Hazan & simpson, 1998 manually labeled VCV and sentences intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB spectral modification by filtering Colotte & Laprie, 2000 automated method for identifying regions based on mel-cepstral analysis stops and unvoiced fricatives amplified by +4 dB transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)
7
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 7/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection Speech landmarks Regions containing important information for speech perception Associated with spectral transitions Landmarks types 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators 2. Abrupt (A) -Fast glottal or velum activity 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction 4. Vocalic (V) - Vowel landmarks Abrupt (~68%) Vocalic (~29%) Non-abrupt (~3%)
8
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 8/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmarks
9
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 9/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Liu, 1996 ▪ Based on energy variation in 6 spectral bands 0-0.4, 0.8-1.5, 1.2-2.0, 2.0-3.5, 3.5-5.0, 5.0-8 kHz ▪ Parameter: First difference of maximum energy (log) in each spectral band time-step = 50 ms in coarse level, 26 ms in fine level ▪ Matching of peaks across bands for locating boundaries ▪ Detects glottal, sonorant closures, releases, stop closures, releases Application : Extraction of features for supporting speech recognition
10
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 10/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection rate vs. temporal resolution 73 % 83 % 88 % 44 % Uses same processing for all types of landmarks
11
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 11/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Niyogi & Sondhi, 2002 for stop consonants total energy & energy above 3 k Hz in log scale measure of spectral flatness non-linear operator optimized for burst detection Salomon et al., 2002 Hilbert transform based envelope to extract temporal parameters spectral information adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions) Alani & Deriche, 1999 wavelet transform based decomposition energy variations in 6 bands
12
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 12/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Need for high temporal resolution and detection rate Application dependent Speech recognition: Analysis is performed around landmarks for parameter extraction ▪ high accuracy ▪ moderate temporal resolution (20-30 ms) Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions Landmark type ▪ Short duration events (bursts) need high time resolution ▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration
13
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 13/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations ▪ short-time energy variation in spectral bands weak burst may not get detected ▪ centroid frequency not well defined during low energy segments ▪ fixed band boundaries may not adapt to speech variability ▪ Smoothening performed during parameter extraction ▪ temporal smoothening on spectrum affects time resolution ▪ Type of distance measure ▪ first difference operation not optimized for all types of landmarks ▪ time-step 10 ms is too high for burst detection ▪ Effect of noise on parameters
14
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 14/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Acoustic cues for the different phonetic events are distributed non-homogeneously in the time-frequency plane Separate detectors are required for each phonetic class Each detector must use a method most suited for the phonetic event Objective Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement
15
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 15/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 2. AUTOMATED LANDMARK DETECTION
16
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 16/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection using spectral peaks and centroids Pass 1 Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ frame rate 1 ms. Parameters ▪ maximum energy in each spectral band, every 1 ms ▪ band centroids estimated in each band, every 1 ms ▪ features similar to formant peaks and formant frequencies ▪ c an be estimated easily ▪ not much affected by noise
17
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 17/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Peak energy Centroid frequency Rate-of-rise functions Transition index tracks simultaneous variation of energy and centroid centroids given less weighting in low energy areas
18
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 18/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid contours 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz
19
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 19/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid ROR contours Time step = 26 ms 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz
20
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 20/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 26 ms
21
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 21/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 4 ms Less sensitive to slow transitions
22
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 22/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Problems Large time step ( > 20 ms) ▪ detects with less temporal accuracy ▪ detects slowly varying events also (more detection rate) Small time step (< 5 ms) ▪ detects abrupt transitions with good resolution ▪ misses slow transitions. Pass 2: Analyze landmarks detected in Pass 1 with a small time-step
23
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 23/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improving Temporal resolution : Pass 2 ▪ 40 ms window centered around burst landmarks detected in pass 1 ▪ decomposed to 6 levels by discrete Meyer Wavelet ▪ detail (high frequency) contents in the lower two levels used for localizing bursts Parameters ▪ short time energy variation ▪ zero crossing rate Compute normalized RORs with a time-step of 3 ms Get a new transition index as Relocate landmark to the location corresponding to the peak in T ez ( n )
24
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 24/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks
25
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 25/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks
26
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 26/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks
27
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 27/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Stop30 ms20 ms10 ms5 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/---------112 /t/---------112 /k/---1--1-1333 Det. % 10098.196.368.5 Stop10 ms7 ms5 ms3 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/------------ /t/----------1- /k/------------ Det. % 100 98.1 3. EXPERIMENTAL RESULTS Test material: VCV syllables ▪ 2 speakers (1 male, 1 female) ▪ 3 stop consonants (/ p /, / t /, / k /) ▪ 3 initial and 3 final vowel contexts (/ a /, / i /, / u /) ▪ Total 54 tokens Pass 1 Pass 2
28
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 28/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Test material: TIMIT sentences ▪ 5 speakers (2 male, 3 female) ▪ 10 sentences per speaker ▪ closure and burst onsets of / b /, / d /, / g /, / p /, / t /, / k / ▪ total 418 tokens Phoneme class 30ms20 ms10 ms Det. (%) Pass121212 Stop (548)949682866266 Fricative(266)95 90 7679 Nasal (154)807970 5351 Vowel (614)777970715857 S. vowel (213)697068676061 Overall det. (%) 84.185.776.478.061.763.0 Detection ratesLocalization error
29
IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 29/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 4. SUMMARY & CONCLUSION Pass 2 improves temporal resolution of stop landmarks ▪ Significant improvement in stop burst localization in VCV syllables 30% improvement for 5 ms resolution ▪ Marginal improvement in sentences 4 % improvement for stop landmarks at 10 ms resolution Possible reasons ▪ reduced closure duration in sentences ▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms ▪ use of 40 ms window in Pass 2, may need modification ▪ errors in the manual labels ▪ Future work: Evaluation of the method in presence of noise
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.