IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.

Slides:



Advertisements
Similar presentations
Advances in Speech Synthesis
Advertisements

Acoustic/Prosodic Features
Sounds that “move” Diphthongs, glides and liquids.
Basic Spectrogram & Clinical Application: Consonants
Acoustic Characteristics of Consonants
Speech Perception Dynamics of Speech
Glides (/w/, /j/) & Liquids (/l/, /r/) Degree of Constriction Greater than vowels – P oral slightly greater than P atmos Less than fricatives – P oral.
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
“Connecting the dots” How do articulatory processes “map” onto acoustic processes?
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Nasal Stops.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
EE Dept., IIT Bombay Workshop “AICTE Sponsored Faculty Development Programme on Signal Processing and Applications", Dept. of Electrical.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
IIT Bombay ICA 2004, Kyoto, Japan, April 4 - 9, 2004   Introdn HNM Methodology Results Conclusions IntrodnHNM MethodologyResults.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Second International Conference on Intelligent Interactive Technologies and Multimedia (IITM 2013), March 2013, Allahabad, India 09 March 2013 Speech.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Acoustic Phonetics 3/9/00. Acoustic Theory of Speech Production Modeling the vocal tract –Modeling= the construction of some replica of the actual physical.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
Speech Science VII Acoustic Structure of Speech Sounds WS
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
Say “blink” For each segment (phoneme) write a script using terms of the basic articulators that will say “blink.” Consider breathing, voicing, and controlling.
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
IIT Bombay 1/26 Automated CVR Modification for Improving Perception of Stop Consonants A. R. Jayan & P. C. Pandey EE Dept, IIT.
Pitch Determination by Wavelet Transformation Santhosh Bellikoth ECE Speech Processing Instructor: Dr Kepuska.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
EE Dept., IIT Bombay IEEE Workshop on Intelligent Computing, IIIT Allahabad, Oct Signal processing for improving speech.
Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti / DSP 2009, Santorini, 5-7 July DSP 2009 (Santorini, Greece. 5-7 July 2009), Session: S4P,
Performance Comparison of Speaker and Emotion Recognition
Vocal Tract & Lip Shape Estimation By MS Shah & Vikash Sethia Supervisor: Prof. PC Pandey EE Dept, IIT Bombay AIM-2003, EE Dept, IIT Bombay, 27 th June,
EE Dept., IIT Bombay Workshop “Radar and Sonar Signal Processing,” NSTL Visakhapatnam, Aug 2015 Coordinator: Ms. M. Vijaya.
Speech Perception.
1 Introduction1 Introduction 2 Noise red. tech 3 Spect. Subtr. 4. QBNE 5 Invest. QBNE 6 Conc., & future work2 Noise red. tech 3 Spect. Subtr.4. QBNE5 Invest.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
1 Acoustic Phonetics 3/28/00. 2 Nasal Consonants Produced with nasal radiation of acoustic energy Sound energy is transmitted through the nasal cavity.
Acoustic Phonetics 3/14/00.
1 Introduction1 Introduction 2 Spectral subtraction 3 QBNE 4 Results 5 Conclusion, & future work2 Spectral subtraction 3 QBNE4 Results5 Conclusion, & future.
IIT Bombay ICSCN International Conference on Signal Processing, Communications and Networking 1/30 Intro.Intro. Clear speech.
Automated Detection of Speech Landmarks Using
Structure of Spoken Language
Speech Perception.
Speech Perception (acoustic cues)
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, EE Dept, IIT Bombay 3 rd February, 2008

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 2/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. PRESENTATION OUTLINE 1.Introduction  Acoustic properties of clear speech  Landmark detection  Need for high time resolution 2.Automated landmark detection with high resolution  Pass 1  Pass 2 3.Experimental results 4.Summary and conclusion

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 3/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 1. INTRODUCTION Acoustic properties of clear speech Clear speech: Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments Examples - ‘the book tells a story’ ‘the boy forgot his book’ ConversationalClear Intelligibility of clear speech ▪ Picheny et al.,1985: ~17% more intelligible than conversational speech ▪ More intelligible for different classes of listeners & listening conditions

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 4/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Acoustic differences between clear and conversational speech  Sentence level ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) ▪ Larger variation in fundamental frequency ▪ Increased number of pauses, more pause durations  Word level ▪ Less sound deletions ▪ More sound insertions  Phonetic level ▪ Context dependent, non-linear increase in segment durations ▪ More targeted vowel formants ▪ Increase in consonant intensity

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 5/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improvement in intelligibility of conversational speech by incorporating properties of clear speech  Consonant–vowel intensity ratio ( CVR ) enhancement Increasing energy of consonant segment  Consonant duration enhancement Increasing CV and VC transitions (burst duration, VOT, formant transition) Challenges  Accurate detection of regions for modification  Analysis-modification-synthesis with low processing artifacts  Processing without increasing overall speaking rate, increase in transition regions with a corresponding dicrease in srteady state segments

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 6/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Intelligibility enhancement using properties of clear speech Hazan & simpson, 1998  manually labeled VCV and sentences  intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB  spectral modification by filtering Colotte & Laprie, 2000  automated method for identifying regions based on mel-cepstral analysis  stops and unvoiced fricatives amplified by +4 dB  transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 7/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection Speech landmarks  Regions containing important information for speech perception  Associated with spectral transitions Landmarks types 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators 2. Abrupt (A) -Fast glottal or velum activity 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction 4. Vocalic (V) - Vowel landmarks  Abrupt (~68%)  Vocalic (~29%)  Non-abrupt (~3%)

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 8/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmarks

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 9/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Liu, 1996 ▪ Based on energy variation in 6 spectral bands 0-0.4, , , , , kHz ▪ Parameter: First difference of maximum energy (log) in each spectral band time-step = 50 ms in coarse level, 26 ms in fine level ▪ Matching of peaks across bands for locating boundaries ▪ Detects glottal, sonorant closures, releases, stop closures, releases Application : Extraction of features for supporting speech recognition

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 10/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Detection rate vs. temporal resolution 73 % 83 % 88 % 44 % Uses same processing for all types of landmarks

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 11/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Niyogi & Sondhi, 2002  for stop consonants  total energy & energy above 3 k Hz in log scale  measure of spectral flatness  non-linear operator optimized for burst detection Salomon et al., 2002  Hilbert transform based envelope to extract temporal parameters  spectral information  adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions) Alani & Deriche, 1999  wavelet transform based decomposition  energy variations in 6 bands

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 12/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Need for high temporal resolution and detection rate  Application dependent  Speech recognition: Analysis is performed around landmarks for parameter extraction ▪ high accuracy ▪ moderate temporal resolution (20-30 ms)  Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions  Landmark type ▪ Short duration events (bursts) need high time resolution ▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 13/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations ▪ short-time energy variation in spectral bands weak burst may not get detected ▪ centroid frequency not well defined during low energy segments ▪ fixed band boundaries may not adapt to speech variability ▪ Smoothening performed during parameter extraction ▪ temporal smoothening on spectrum affects time resolution ▪ Type of distance measure ▪ first difference operation not optimized for all types of landmarks ▪ time-step 10 ms is too high for burst detection ▪ Effect of noise on parameters

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 14/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum.  Acoustic cues for the different phonetic events are distributed non-homogeneously in the time-frequency plane  Separate detectors are required for each phonetic class  Each detector must use a method most suited for the phonetic event Objective Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 15/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 2. AUTOMATED LANDMARK DETECTION

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 16/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Landmark detection using spectral peaks and centroids Pass 1  Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ frame rate 1 ms. Parameters ▪ maximum energy in each spectral band, every 1 ms ▪ band centroids estimated in each band, every 1 ms ▪ features similar to formant peaks and formant frequencies ▪ c an be estimated easily ▪ not much affected by noise

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 17/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum.  Peak energy  Centroid frequency  Rate-of-rise functions  Transition index  tracks simultaneous variation of energy and centroid  centroids given less weighting in low energy areas

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 18/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid contours kHz kHz kHz kHz kHz

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 19/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Peak & centroid ROR contours Time step = 26 ms kHz kHz kHz kHz kHz

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 20/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 26 ms

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 21/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Example: /uka/ Transition index derived from RORs with time step = 4 ms Less sensitive to slow transitions

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 22/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Problems Large time step ( > 20 ms) ▪ detects with less temporal accuracy ▪ detects slowly varying events also (more detection rate) Small time step (< 5 ms) ▪ detects abrupt transitions with good resolution ▪ misses slow transitions. Pass 2: Analyze landmarks detected in Pass 1 with a small time-step

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 23/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Improving Temporal resolution : Pass 2 ▪ 40 ms window centered around burst landmarks detected in pass 1 ▪ decomposed to 6 levels by discrete Meyer Wavelet ▪ detail (high frequency) contents in the lower two levels used for localizing bursts Parameters ▪ short time energy variation ▪ zero crossing rate Compute normalized RORs with a time-step of 3 ms Get a new transition index as Relocate landmark to the location corresponding to the peak in T ez ( n )

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 24/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 25/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 26/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Relocating stop landmarks

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 27/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Stop30 ms20 ms10 ms5 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/ /t/ /k/ Det. % Stop10 ms7 ms5 ms3 ms Initial vowel Initial vowel Initial vowel Initial vowel aiuaiuaiuaiu /p/ /t/ /k/ Det. % EXPERIMENTAL RESULTS Test material: VCV syllables ▪ 2 speakers (1 male, 1 female) ▪ 3 stop consonants (/ p /, / t /, / k /) ▪ 3 initial and 3 final vowel contexts (/ a /, / i /, / u /) ▪ Total 54 tokens Pass 1 Pass 2

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 28/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. Test material: TIMIT sentences ▪ 5 speakers (2 male, 3 female) ▪ 10 sentences per speaker ▪ closure and burst onsets of / b /, / d /, / g /, / p /, / t /, / k / ▪ total 418 tokens Phoneme class 30ms20 ms10 ms Det. (%) Pass Stop (548) Fricative(266) Nasal (154) Vowel (614) S. vowel (213) Overall det. (%) Detection ratesLocalization error

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 29/27 Intro.Intro. Landmark detection Exp. Res. Sum.Landmark detectionExp. Res.Sum. 4. SUMMARY & CONCLUSION Pass 2 improves temporal resolution of stop landmarks ▪ Significant improvement in stop burst localization in VCV syllables 30% improvement for 5 ms resolution ▪ Marginal improvement in sentences 4 % improvement for stop landmarks at 10 ms resolution Possible reasons ▪ reduced closure duration in sentences ▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms ▪ use of 40 ms window in Pass 2, may need modification ▪ errors in the manual labels ▪ Future work: Evaluation of the method in presence of noise