Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

Similar presentations


Presentation on theme: "New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,"— Presentation transcript:

1 New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner sdusan@caip.rutgers.edu Center for Advanced Information Processing Rutgers University Piscataway, New Jersey, U.S.A. ASAT Meeting, Rutgers University, NJOct. 13, 2006

2 OUTLINE If more knowledge from speech perception and acoustic-phonetic studies are integrated into ASR these systems should provide better performance. Two types of acoustic-phonetic correlates are evaluated in this study with links to studies of vowel/consonant perception: A.Evaluate the distribution of information (or the relevance for vowel classification) of various acoustic patterns and features: Static MFCC features outside the currently accepted vowel (phoneme) boundaries Segmental durational features Dynamical MFCC features with two slopes B.Evaluate the temporal correlation between maximum spectral transition positions and phone boundaries: Compute the spectral transition measure (STM) using static MFCC features and find its peaks for the training part of TIMIT containing 172,460 between-phone boundaries. Analyze the deviation between these peaks and phone boundaries Sorin Dusan & Larry RabinerASAT Meeting, Rutgers University, NJOct. 13, 2006

3 METHODS A.Evaluate the vowel information within and outside vowel boundaries as done by Strange et al. 1976 and Furui, 1986 with listeners but by performing automatic ML classification of 9 vowels coarticulated in three left- and three right-consonant contexts. Evaluate 8 acoustic patterns: 1.Spectral feature vector at the center of the vowel in CV or VC biphones. 2.Spectral feature vector at 20 ms after vowel onset in CV biphones or 20 ms before vowel offset in VC biphones. 3.Spectral feature vector at the CV or VC transition position. 4.A vector containing the overall slope of each spectral feature computed on a 40 ms interval, centered at the CV or VC transition position. 5.A vector containing the slopes of each spectral feature computed on 20 ms intervals on the left- and on the right-side of the given CV or VC transition position. This vector can discriminate among the monotonic and non- monotonic spectral transitions between phonemes (Dusan, 2005). 6.Spectral feature vector at the center of the preceding consonant in CV biphones or the following consonant in VC biphones. 7.A vector containing the vowel and the consonant durations in CV or VC biphones. This vector accounts for both the intrinsic duration of vowels and the vowel durational effect due to coarticulation with consonants. 8.Spectral feature vector at the beginning of the consonant in CV biphones or at the end of the consonant in VC biphones. ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

4 METHODS B.Investigate the relation between the perceptual critical points (Furui, 1986) for consonant and syllable identification and the phone boundaries by analyzing the temporal correlation between the maximum spectral transition positions and phone boundaries: Compute the spectral transition measure (STM) as the dynamic (delta) MFCC features. The dynamic features are computed using the first 10 static MFCC features (excluding the energy). Find the peaks of the STM for the training part of TIMIT containing 172,460 between-phone boundaries. Compute the deviation between the positions of the peaks and phone boundaries Quantify this deviation in bins of 0, 10, 20, 30, and 40 ms. If the STM peaks are in close proximity to phone boundaries this means that the perceptual critical points are in close proximity to phone boundaries and this could have implications to ASR ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

5 DISTRIBUTION OF INFORMATION Figure 1. Vowel classification scores in left- and right-consonant contexts for the 8 patterns ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

6 DISTRIBUTION OF INFORMATION Figure 2. Vowel classification scores for the static MFCC patterns in left- and right-consonant contexts. ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

7 DISTRIBUTION OF INFORMATION Figure 3. Vowel classification scores in left-consonant contexts for combinations of all 8 patterns 5.8% (~38% relative error reduction) ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

8 DISTRIBUTION OF INFORMATION Figure 4. Vowel classification scores in right-consonant contexts for combinations of all 8 patterns 6.9% (~37% relative error reduction) ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

9 STM PEAKS AND PHONE BOUNDARIES Figure 5. Example 1:(a) Speech with manual phone boundaries, (b) STM with automatically detected phone boundaries, (c) STM and missed boundaries, (d) STM and inserted boundaries (a) (b) (c) (d) Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

10 Table 1. Results of the automatic phone boundary detection based on the STM function. Approximately 85% of the manually located phone boundaries are automatically detected Total Boundaries (Manual) Detected Boundaries (Automatic) Missed Boundaries (Automatic) Inserted Boundaries (Automatic) Count172,460145,95026,51048,566 Percent100%84.6%15.4%28.2% Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

11 Figure 6. Normalized histogram showing the absolute deviation between the145,950 automatically detected boundaries and the corresponding 145,950 manually located boundaries. Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

12 Figure 7. Normalized histogram showing the absolute deviation between the145,950 automatically detected boundaries and the corresponding 145,950 manually located boundaries. Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

13 Table 2. Percentage of the 145,950 detected boundaries which are within various time spans from the manually located phone boundaries Within 10 msWithin 20 msWithin 30 msWithin 40 ms Percentage70%89%95%97%  An analysis of the time difference between the 145,950 automatically detected boundaries and the corresponding 145,950 manually located phone boundaries is shown in Table 2 ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

14 CONCLUSIONS  The new acoustic patterns (two located outside vowel boundaries, the double slope dynamic pattern at the boundary, and the durational pattern) contain significant vowel information.  There is more information about the vowel identity at the beginning (in CV) or end (in VC) of the adjacent consonants than at the center of these consonants.  The STM peaks are in close proximity of phone boundaries: 27% within 0 ms, 70% within 10 ms, 89% within 20 ms, 95% within 30 ms, and 97% within 40 ms from the manually located phone boundaries.  The current study complements Furui’s perceptual study and shows that the phone boundaries are in close proximity to the maximum spectral transition positions and thus to the perceptual critical points. ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner


Download ppt "New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,"

Similar presentations


Ads by Google