Presentation is loading. Please wait.

Presentation is loading. Please wait.

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.

Similar presentations


Presentation on theme: "IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro."— Presentation transcript:

1 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3 rd February, 2008

2 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 2/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. PRESENTATION OUTLINE 1.Introduction  Acoustic properties of clear speech  Landmark detection  Need for high time resolution 2.Automated landmark detection with high resolution  Pass 1  Pass 2 3.Experimental results 4.Summary and conclusion

3 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 3/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 1. INTRODUCTION Acoustic properties of clear speech Clear speech: Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm ‘the book tells a story’ ‘the boy forgot his book’ ConversationalClear Intelligibility of clear speech ▪ Picheny et al.,1985: ~17% more intelligible than conversational speech ▪ More intelligible for different classes of listeners & listening conditions

4 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 4/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Acoustic differences between clear and conversational speech  Sentence level ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) ▪ Larger variation in fundamental frequency ▪ Increased number of pauses, more pause durations  Word level ▪ Less sound deletions ▪ More sound insertions  Phonetic level ▪ Context dependent, non-linear increase in segment durations ▪ More targeted vowel formants ▪ Increase in consonant intensity

5 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 5/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improvement in intelligibility of conversational speech by incorporating properties of clear speech  Consonant–vowel intensity ratio ( CVR ) enhancement Increasing energy of consonant segment  Consonant duration enhancement Increasing CV and VC transitions (burst duration, VOT, formant transition) Challenges  Accurate detection of regions for modification  Analysis-modification-synthesis with low processing artifacts  Processing without increasing overall speaking rate, increase in transition regions with a corresponding dicrease in srteady state segments

6 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 6/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Intelligibility enhancement using properties of clear speech Hazan & simpson, 1998  manually labeled VCV and sentences  intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB  spectral modification by filtering Colotte & Laprie, 2000  automated method for identifying regions based on mel-cepstral analysis  stops and unvoiced fricatives amplified by +4 dB  transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)

7 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 7/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection Speech landmarks  Regions containing important information for speech perception  Associated with spectral transitions Landmarks types 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators 2. Abrupt (A) -Fast glottal or velum activity 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction 4. Vocalic (V) - Vowel landmarks  Abrupt (~68%)  Vocalic (~29%)  Non-abrupt (~3%)

8 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 8/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmarks

9 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 9/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Liu, 1996 ▪ Based on energy variation in 6 spectral bands 0-0.4, 0.8-1.5, 1.2-2.0, 2.0-3.5, 3.5-5.0, 5.0-8 kHz ▪ Parameter: First difference of maximum energy (log) in each spectral band time-step = 50 ms in coarse level, 26 ms in fine level ▪ Matching of peaks across bands for locating boundaries ▪ Detects glottal, sonorant closures, releases, stop closures, releases Application : Extraction of features for supporting speech recognition

10 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 10/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection rate vs. temporal resolution 73 % 83 % 88 % 44 % Uses same processing for all types of landmarks

11 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 11/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Niyogi & Sondhi, 2002  for stop consonants  total energy & energy above 3 k Hz in log scale  measure of spectral flatness  non-linear operator optimized for burst detection Salomon et al., 2002  Hilbert transform based envelope to extract temporal parameters  spectral information  adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions) Alani & Deriche, 1999  wavelet transform based decomposition  energy variations in 6 bands

12 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 12/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Need for high temporal resolution and detection rate  Application dependent  Speech recognition: Analysis is performed around landmarks for parameter extraction ▪ high accuracy ▪ moderate temporal resolution (20-30 ms)  Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions  Landmark type ▪ Short duration events (bursts) need high time resolution ▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration

13 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 13/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations ▪ short-time energy variation in spectral bands weak burst may not get detected ▪ centroid frequency not well defined during low energy segments ▪ fixed band boundaries may not adapt to speech variability ▪ Smoothening performed during parameter extraction ▪ temporal smoothening on spectrum affects time resolution ▪ Type of distance measure ▪ first difference operation not optimized for all types of landmarks ▪ time-step 10 ms is too high for burst detection ▪ Effect of noise on parameters

14 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 14/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum.  Acoustic cues for the different phonetic events are distributed non-homogeneously in the time-frequency plane  Separate detectors are required for each phonetic class  Each detector must use a method most suited for the phonetic event Objective Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement

15 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 15/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 2. AUTOMATED LANDMARK DETECTION

16 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 16/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection using spectral peaks and centroids Pass 1  Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ frame rate 1 ms. Parameters ▪ maximum energy in each spectral band, every 1 ms ▪ band centroids estimated in each band, every 1 ms ▪ features similar to formant peaks and formant frequencies ▪ c an be estimated easily ▪ not much affected by noise

17 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 17/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum.  Peak energy  Centroid frequency  Rate-of-rise functions  Transition index  tracks simultaneous variation of energy and centroid  centroids given less weighting in low energy areas

18 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 18/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid contours 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz

19 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 19/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid ROR contours Time step = 26 ms 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz

20 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 20/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 26 ms

21 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 21/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 4 ms Less sensitive to slow transitions

22 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 22/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Problems Large time step ( > 20 ms) ▪ detects with less temporal accuracy ▪ detects slowly varying events also (more detection rate) Small time step (< 5 ms) ▪ detects abrupt transitions with good resolution ▪ misses slow transitions. Pass 2: Analyze landmarks detected in Pass 1 with a small time-step

23 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 23/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improving Temporal resolution : Pass 2 ▪ 40 ms window centered around burst landmarks detected in pass 1 ▪ decomposed to 6 levels by discrete Meyer Wavelet ▪ detail (high frequency) contents in the lower two levels used for localizing bursts Parameters ▪ short time energy variation ▪ zero crossing rate Compute normalized RORs with a time-step of 3 ms Get a new transition index as Relocate landmark to the location corresponding to the peak in T ez ( n )

24 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 24/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks

25 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 25/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks

26 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 26/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks

27 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 27/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Stop30 ms20 ms10 ms5 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/---------112 /t/---------112 /k/---1--1-1333 Det. % 10098.196.368.5 Stop10 ms7 ms5 ms3 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/------------ /t/----------1- /k/------------ Det. % 100 98.1 3. EXPERIMENTAL RESULTS Test material: VCV syllables ▪ 2 speakers (1 male, 1 female) ▪ 3 stop consonants (/ p /, / t /, / k /) ▪ 3 initial and 3 final vowel contexts (/ a /, / i /, / u /) ▪ Total 54 tokens Pass 1 Pass 2

28 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 28/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Test material: TIMIT sentences ▪ 5 speakers (2 male, 3 female) ▪ 10 sentences per speaker ▪ closure and burst onsets of / b /, / d /, / g /, / p /, / t /, / k / ▪ total 418 tokens Phoneme class 30ms20 ms10 ms Det. (%) Pass121212 Stop (548)949682866266 Fricative(266)95 90 7679 Nasal (154)807970 5351 Vowel (614)777970715857 S. vowel (213)697068676061 Overall det. (%) 84.185.776.478.061.763.0 Detection ratesLocalization error

29 IIT Bombay arjayan@ee.iitb.ac.in, pcpandey@ee.iitb.ac.in 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 29/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 4. SUMMARY & CONCLUSION Pass 2 improves temporal resolution of stop landmarks ▪ Significant improvement in stop burst localization in VCV syllables 30% improvement for 5 ms resolution ▪ Marginal improvement in sentences 4 % improvement for stop landmarks at 10 ms resolution Possible reasons ▪ reduced closure duration in sentences ▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms ▪ use of 40 ms window in Pass 2, may need modification ▪ errors in the manual labels ▪ Future work: Evaluation of the method in presence of noise


Download ppt "IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro."

Similar presentations


Ads by Google