Automated Detection of Speech Landmarks Using

Slides:

Advertisements

Similar presentations

Basic Spectrogram & Clinical Application: Consonants

Advertisements

Acoustic Characteristics of Consonants

Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.

Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Speech Recognition in Noise

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.

A PRESENTATION BY SHAMALEE DESHPANDE

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh

Representing Acoustic Information

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

LE 460 L Acoustics and Experimental Phonetics L-13

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

IIT Bombay ICA 2004, Kyoto, Japan, April 4 - 9, 2004   Introdn HNM Methodology Results Conclusions IntrodnHNM MethodologyResults.

Topics covered in this chapter

SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,

Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,

Second International Conference on Intelligent Interactive Technologies and Multimedia (IITM 2013), March 2013, Allahabad, India 09 March 2013 Speech.

1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.

1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:

♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.

Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

IIT Bombay 1/26 Automated CVR Modification for Improving Perception of Stop Consonants A. R. Jayan & P. C. Pandey EE Dept, IIT.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti / DSP 2009, Santorini, 5-7 July DSP 2009 (Santorini, Greece. 5-7 July 2009), Session: S4P,

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.

(Extremely) Simplified Model of Speech Production

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

Performance Comparison of Speaker and Emotion Recognition

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,

Vocal Tract & Lip Shape Estimation By MS Shah & Vikash Sethia Supervisor: Prof. PC Pandey EE Dept, IIT Bombay AIM-2003, EE Dept, IIT Bombay, 27 th June,

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

EE Dept., IIT Bombay Workshop “Radar and Sonar Signal Processing,” NSTL Visakhapatnam, Aug 2015 Coordinator: Ms. M. Vijaya.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.

IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Acoustic Phonetics 3/14/00.

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,

IIT Bombay ICSCN International Conference on Signal Processing, Communications and Networking 1/30 Intro.Intro. Clear speech.

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

EE513 Audio Signals and Systems

Digital Systems: Hardware Organization and Design

Speech Perception (acoustic cues)

Presented by Chen-Wei Liu

Presenter: Shih-Hsiang(士翔)

Measuring the Similarity of Rhythmic Patterns

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Automated Detection of Speech Landmarks Using Gaussian Mixture Modeling A. R. Jayan P. C. Pandey {arjayan, pcpandey}@ee.iitb.ac.in EE Dept., IIT Bombay February, 2008

A. R. Jayan and P. C. Pandey, "Automated detection of speech landmarks using Gaussian Mixture Modeling", Frontiers of Research on Speech and Music (FRSM-08), Feb. 20-21, 2008, Jadavpur University, Kolkata, India. Abstract-Landmarks in speech signal are regions with abrupt spectral variations. Automated detection of these regions is important for several applications in speech processing. Performance of landmark detection using parameters extracted from predefined spectral bands generally gets limited by speaker related spectral variability. This paper presents a landmark detection technique which adapts to the acoustic properties of speech. Parameters are extracted from Gaussian mixture modeling (GMM) of smoothed spectral envelope. A single rate of rise function, obtained from the set of GMM parameters, is used for locating landmark regions. The method was evaluated using manually labeled VCV syllables and sentences. It was possible to detect 85 % of stop release bursts in VCV syllables and 82 % in sentences, with an accuracy of 5 ms, compared to the manually located landmarks. Address: SPI Lab, EE Dept., IIT Bombay, Powai Mumbai 400 076, India Web: http://www.ee.iitb.ac.in/~spilab E-mail: {arjayan, pcpandey}@ee.iitb.ac.in

PRESENTATION OUTLINE Introduction Gaussian Mixture Modeling (GMM) 3. Experimental results 4. Summary and conclusion

1. INTRODUCTION Landmark detection Speech landmarks  Regions containing important information for speech perception  Associated with spectral transitions Landmarks types 1. Abrupt-consonantal (AC) - Tight constrictions of primary articulators 2. Abrupt (A) - Fast glottal or velum activity 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction 4. Vocalic (V) - Vowel landmarks  Abrupt (~68%)  Vocalic (~29%)  Non-abrupt (~3%)

Applications of landmark detection Example of landmarks Applications of landmark detection  Feature extraction for supporting speech recognition  Intelligibility enhancement

Earlier studies on automated landmark detection  Schutte and Glass, 2005 ▪ Mel frequency cepstral coefficients, support vector machines (SVMs) ▪ Application: Extraction of features for speech recognition  Sainath and Hazen, 2006 ▪ Sinusoidal model, short-time energy, signal harmonicity  Liu, 1996 ▪ 512-point DFT on 6 ms frames, frame shift 1 ms ▪ 20 point moving average along time to get smooth parameter tracks ▪ First difference of maximum spectral component in 6 spectral bands ▪ Application: Extraction of features for speech recognition Det. time (ms) Det. rate (%)

Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations ▪ Short-time energy variation in spectral bands : weak burst may not get detected ▪ Centroid frequency : not well defined during low energy segments ▪ Fixed band boundaries : may not adapt to speech variability ▪ Temporal smoothening of parameter tracks ▪ Time resolution affected ▪ Detection operation ▪ First difference operation not optimized for all types of landmarks ▪ Time-step 10 ms may be too large for burst detection ▪ Effect of noise on parameters ▪ Cepstral features - sensitive to noise ▪ Band energy or spectral peaks - not much affected ▪ Band centroids -sensitive to noise

Need for high temporal resolution and detection rate  Application dependent  Speech recognition: Analysis performed around landmarks for parameter extraction. Landmarks detected with ▪ high accuracy ▪ moderate temporal resolution (20-30 ms)  Intelligibility enhancement: Modification of landmark regions, detected with ▪ good temporal resolution (0-5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions  Landmark type  Short duration events (bursts) need high time resolution.  Voicing onsets/offsets may not require high resolution as signal properties remain same for a long duration.

Enhancing landmark regions Improvement in intelligibility of conversational speech by incorporating properties of clear speech: Enhancing landmark regions  Consonant–vowel intensity ratio (CVR) enhancement Increasing energy of consonant segment.  Consonant duration enhancement Increasing CV and VC transitions (burst duration, VOT, formant transition). Challenges  Accurate detection of regions for modification.  Analysis-modification-synthesis with low processing artifacts.  Processing without increasing overall speaking rate, increase in transition regions with a corresponding decrease in steady state segments.

Earlier studies on intelligibility enhancement  Colotte & Laprie, 2000  Identifying regions based on mel-cepstral analysis  Stops and unvoiced fricatives amplified by +4 dB  Transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)  Skowronski & Harris, 2006  Spectral transition measure based voiced/unvoiced classification  Energy redistribution in voiced / unvoiced segments (ERVU)  Amplifying low energy regions critical to intelligibility  Jayan & Pandey, 2007  Variation of maximum energy and centroid in 5 spectral bands  VC and CV transition segments expanded, steady-state segments compressed → less temporal masking by nearby vowel  Intensity scaling of transition segments  Overall speech duration is kept unaltered

Fixed spectral band based landmark detection  Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ Frame shift: 1 ms. ▪ Peak spectral component and band centroid in each band, every 1 ms (related to formant peaks and formant frequencies)  Peak energy  Centroid frequency  Rate-of-rise functions  Transition index

Limitations Possible reasons Gaussian Mixture Modeling (GMM) Only 60 % release bursts in VCV syllables detected within 5 ms of manual labels. Possible reasons ▪ Poor approximation of formant peaks and frequencies by maximum energy and centroid in spectral bands with fixed boundaries. ▪ Temporal smoothening performed on parameter tracks. Gaussian Mixture Modeling (GMM) ▪ Provides parametric representation of smoothened spectra. ▪ Can be used to extract formant like features. ▪ Gaussian mean → formant frequency, amplitude → formant peak, variance → formant bandwidth. ▪ Abrupt spectral variations results in abrupt variations in Gaussian parameters. ▪ Parameter extraction by smoothening in the spectral domain, no smoothening in temporal domain → improved temporal resolution.

Earlier studies on Gaussian modeling of speech spectra ▪ Zolfaghari & Robinson, 1996 ▪ Cepstral smoothened speech spectrum modeled by GMMs. ▪ Formant analysis, formant vocoder. ▪ Formant tracks followed LPC based tracker, higher formant bandwidths. ▪ Stuttle & Gales, 2002 ▪ Low pass filtered speech spectrum modeled by GMMs. ▪ GMM features used with MFCC features in speech recognition. ▪ GMM parameters found effective in noisy environments. ▪ Omar et al., 2001 ▪ Used Gaussian model of phonetic boundaries. ▪ Improvement in phoneme recognition accuracy. ▪ Lindblom & Samuelsson, 2003 ▪ Bounded support expectation maximization algorithm for modeling speech source spectra (EMBS).

Objective Automated detection of landmarks for stop consonants with high temporal resolution, using Gaussian Mixture Modeling of speech spectra Landmark detection using GMM parameters.

2. GAUSSIAN MIXTURE MODELING ▪ Speech signal sampled at 10 k samples / second ▪ 512 point DFT on 6 ms frames ▪ Frame shift = 1 ms ▪ Spectral smoothening by low pass filtering spectral envelope, filter impulse response → 20 point raised cosine window. ▪ Parameter extraction by expectation maximization (EM) algorithm. GMM approximation of smoothened spectral envelope Initialization ▪ Means → equal spacing along k, ▪ Equal mixture weights = 1/M ▪ Equal standard deviations N/(2M)

Detection of burst landmarks Gaussian parameters  Gaussian amplitudes consistent during vowel, consonant, and silence segments.  Mean and variances, not well defined during low energy segments.  Parameter tracks derived using Gaussian amplitudes. Detection of burst landmarks  Rate of rise (ROR) function derived using Gaussian amplitudes except that of first Gaussian.  Normalized to 0-1 range, 10 point median filtering.  Square root operation to make ROR more sensitive to burst onsets.

Normalized mean squarederror for no. of Gaussian components 3. RESULTS AND DISCUSSION  Number of Gaussians for modeling decided by computing norm. mean squared error between smoothed spectrum and Gaussian modeled spectrum.  e(n) computed for vowels /a/, /i/, /u/, and fricatives /v/, /z/, /f/, /s/.  Voiced sounds modeled more accurately.  Not much improvement in increasing number of Gaussians above 3.  Selected M = 4, for modeling 4 significant vocal tract resonances. Phone-me Normalized mean squarederror for no. of Gaussian components 1 2 3 4 5 /a/ 0.22 0.08 0.06 0.05 0.04 /i/ 0.45 /u/ 0.35 0.12 0.07 /v/ 0.18 0.03 /z/ 0.49 0.10 0.01 /f/ 0.43 0.28 0.20 0.19 /s/ 0.77 0.16 0.13 0.11

Evaluation using VCV syllables Test material : 3 vowels /a/, /i/, /u/ and 6 stops /b/, /d/, /g/, /p/, /t/, /k/ and 6 speakers (3 male, 3 female) → 108 manually labeled tokens. Signal Spectrogram GMM spectrogram Amplitude Mean Variance ROR

Evaluation using sentences Test material : 15 Marathi sentences with 98 manually labeled tokens. 1 speaker 'kamal, ki thi kam kar the ?' Signal Amplitude Mean Variance Burst landmarks ROR

Comparison of results Observations M1 - GMM based, M2- maximum spectral component + centroid in spectral bands with fixed boundaries VCV syllables Sentences Observations  M1 outperforms M2 in terms of detection rates for temporal resolution < 10 ms.  M2 detects more landmarks than M1, but with lesser temporal resolution.

4. SUMMARY & CONCLUSION Landmark detection using Gaussian parameters investigated: ▪ Good temporal resolution compared to parameters extracted from spectral bands with fixed boundaries. ▪ Most of the landmarks detected within 10 ms of manual labels. Future work: ▪ Evaluation of landmark detection method in presence of noise. ▪ Method is more computation intensive and needs further investigations for real-time detection of landmarks