New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

Slides:



Advertisements
Similar presentations
Distinctive Feature Detection For Automatic Speech Recognition
Advertisements

Acoustic Characteristics of Consonants
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Acoustic Characteristics of Vowels
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Speech perception 2 Perceptual organization of speech.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
Cognitive Processes PSY 334 Chapter 2 – Perception April 9, 2003.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Communications & Multimedia Signal Processing Analysis of the Effects of Train noise on Recognition Rate using Formants and MFCC Esfandiar Zavarehei Department.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Cognitive Processes PSY 334 Chapter 2 – Perception.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Epenthetic vowels in Japanese: a perceptual illusion? Emmanual Dupoux, et al (1999) By Carl O’Toole.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
Automatic Speech Attribute Transcription (ASAT) Project Period: 10/01/04 – 9/30/08 The ASAT Team –Mark Clements –Sorin Dusan.
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Perceptual and Neural Modeling Automatic Speech Attribute Transcription (ASAT) Project Sorin Dusan Center for Advanced Information Processing Rutgers University.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Speech Perception.
Nuclear Accent Shape and the Perception of Syllable Pitch Rachael-Anne Knight LAGB 16 April 2003.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Automated Detection of Speech Landmarks Using
Cognitive Processes PSY 334
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Memory and Melodic Density : A Model for Melody Segmentation
Statistical Models for Automatic Speech Recognition
Segment-Based Speech Recognition
Towards Automatic Fluency Assessment
Giovanni M. Di Liberto, James A. O’Sullivan, Edmund C. Lalor 
Human colour discrimination based on a non-parvocellular pathway
Volume 64, Issue 3, Pages (November 2009)
Measuring the Similarity of Rhythmic Patterns
Low Level Cues to Emotion
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway, New Jersey, U.S.A. ASAT Meeting, Rutgers University, NJOct. 13, 2006

OUTLINE If more knowledge from speech perception and acoustic-phonetic studies are integrated into ASR these systems should provide better performance. Two types of acoustic-phonetic correlates are evaluated in this study with links to studies of vowel/consonant perception: A.Evaluate the distribution of information (or the relevance for vowel classification) of various acoustic patterns and features: Static MFCC features outside the currently accepted vowel (phoneme) boundaries Segmental durational features Dynamical MFCC features with two slopes B.Evaluate the temporal correlation between maximum spectral transition positions and phone boundaries: Compute the spectral transition measure (STM) using static MFCC features and find its peaks for the training part of TIMIT containing 172,460 between-phone boundaries. Analyze the deviation between these peaks and phone boundaries Sorin Dusan & Larry RabinerASAT Meeting, Rutgers University, NJOct. 13, 2006

METHODS A.Evaluate the vowel information within and outside vowel boundaries as done by Strange et al and Furui, 1986 with listeners but by performing automatic ML classification of 9 vowels coarticulated in three left- and three right-consonant contexts. Evaluate 8 acoustic patterns: 1.Spectral feature vector at the center of the vowel in CV or VC biphones. 2.Spectral feature vector at 20 ms after vowel onset in CV biphones or 20 ms before vowel offset in VC biphones. 3.Spectral feature vector at the CV or VC transition position. 4.A vector containing the overall slope of each spectral feature computed on a 40 ms interval, centered at the CV or VC transition position. 5.A vector containing the slopes of each spectral feature computed on 20 ms intervals on the left- and on the right-side of the given CV or VC transition position. This vector can discriminate among the monotonic and non- monotonic spectral transitions between phonemes (Dusan, 2005). 6.Spectral feature vector at the center of the preceding consonant in CV biphones or the following consonant in VC biphones. 7.A vector containing the vowel and the consonant durations in CV or VC biphones. This vector accounts for both the intrinsic duration of vowels and the vowel durational effect due to coarticulation with consonants. 8.Spectral feature vector at the beginning of the consonant in CV biphones or at the end of the consonant in VC biphones. ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

METHODS B.Investigate the relation between the perceptual critical points (Furui, 1986) for consonant and syllable identification and the phone boundaries by analyzing the temporal correlation between the maximum spectral transition positions and phone boundaries: Compute the spectral transition measure (STM) as the dynamic (delta) MFCC features. The dynamic features are computed using the first 10 static MFCC features (excluding the energy). Find the peaks of the STM for the training part of TIMIT containing 172,460 between-phone boundaries. Compute the deviation between the positions of the peaks and phone boundaries Quantify this deviation in bins of 0, 10, 20, 30, and 40 ms. If the STM peaks are in close proximity to phone boundaries this means that the perceptual critical points are in close proximity to phone boundaries and this could have implications to ASR ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

DISTRIBUTION OF INFORMATION Figure 1. Vowel classification scores in left- and right-consonant contexts for the 8 patterns ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

DISTRIBUTION OF INFORMATION Figure 2. Vowel classification scores for the static MFCC patterns in left- and right-consonant contexts. ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

DISTRIBUTION OF INFORMATION Figure 3. Vowel classification scores in left-consonant contexts for combinations of all 8 patterns 5.8% (~38% relative error reduction) ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

DISTRIBUTION OF INFORMATION Figure 4. Vowel classification scores in right-consonant contexts for combinations of all 8 patterns 6.9% (~37% relative error reduction) ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

STM PEAKS AND PHONE BOUNDARIES Figure 5. Example 1:(a) Speech with manual phone boundaries, (b) STM with automatically detected phone boundaries, (c) STM and missed boundaries, (d) STM and inserted boundaries (a) (b) (c) (d) Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner

Table 1. Results of the automatic phone boundary detection based on the STM function. Approximately 85% of the manually located phone boundaries are automatically detected Total Boundaries (Manual) Detected Boundaries (Automatic) Missed Boundaries (Automatic) Inserted Boundaries (Automatic) Count172,460145,95026,51048,566 Percent100%84.6%15.4%28.2% Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

Figure 6. Normalized histogram showing the absolute deviation between the145,950 automatically detected boundaries and the corresponding 145,950 manually located boundaries. Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

Figure 7. Normalized histogram showing the absolute deviation between the145,950 automatically detected boundaries and the corresponding 145,950 manually located boundaries. Frame step = 10 ms ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

Table 2. Percentage of the 145,950 detected boundaries which are within various time spans from the manually located phone boundaries Within 10 msWithin 20 msWithin 30 msWithin 40 ms Percentage70%89%95%97%  An analysis of the time difference between the 145,950 automatically detected boundaries and the corresponding 145,950 manually located phone boundaries is shown in Table 2 ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES

CONCLUSIONS  The new acoustic patterns (two located outside vowel boundaries, the double slope dynamic pattern at the boundary, and the durational pattern) contain significant vowel information.  There is more information about the vowel identity at the beginning (in CV) or end (in VC) of the adjacent consonants than at the center of these consonants.  The STM peaks are in close proximity of phone boundaries: 27% within 0 ms, 70% within 10 ms, 89% within 20 ms, 95% within 30 ms, and 97% within 40 ms from the manually located phone boundaries.  The current study complements Furui’s perceptual study and shows that the phone boundaries are in close proximity to the maximum spectral transition positions and thus to the perceptual critical points. ASAT Meeting, Rutgers University, NJOct. 13, 2006Sorin Dusan & Larry Rabiner