MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence Laboratory November 20th 2006.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

“Connecting the dots” How do articulatory processes “map” onto acoustic processes?
Acoustic Characteristics of Vowels
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Speech Recognition with Hidden Markov Models Winter 2011
Fundamental Frequency & Jitter Lab 2. Fundamental Frequency Pitch is the perceptual correlate of F 0 Perception is not equivalent to measurement: –Pitch=
Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
The Human Voice Chapters 15 and 17. Main Vocal Organs Lungs Reservoir and energy source Larynx Vocal folds Cavities: pharynx, nasal, oral Air exits through.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Speech Recognition in Noise
Anatomic Aspects Larynx: Sytem of muscles, cartileges and ligaments.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
A PRESENTATION BY SHAMALEE DESHPANDE
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
CSD 5100 Introduction to Research Methods in CSD The Introduction Section.
Topics covered in this chapter
June 28th, 2004 BioSecure, SecurePhone 1 Automatic Speaker Verification : Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Speech Acoustics1 Clinical Application of Frequency and Intensity Variables Frequency Variables Amplitude and Intensity Variables Voice Disorders Neurological.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Male Cheerleaders and their Voices. Background Information: What Vocal Folds Look Like.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
HMM-Based Synthesis of Creaky Voice
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
National Taiwan University, Taiwan
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Performance Comparison of Speaker and Emotion Recognition
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Introduction1 Introduction 2 Spectral subtraction 3 QBNE 4 Results 5 Conclusion, & future work2 Spectral subtraction 3 QBNE4 Results5 Conclusion, & future.
ASAT Project Two main research thrusts –Feature extraction –Evidence combiner Feature extraction –The classical distinctive features are well explored,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
High Quality Voice Morphing
Laryngeal correlates of the English tense/lax vowel contrast
Automated Detection of Speech Landmarks Using
Voice source characterisation
Emotional Speech Julia Hirschberg CS /16/2019.
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence Laboratory November 20th 2006

Arantza del CUED Machine Intelligence Laboratory 2 OUTLINE  Speech repair  Tracheoesophageal (TE) speech  Laryngectomy  Acoustic properties  Main limitations  Excitation repair  Previous attempts  Adopted approach  Baseline system  Enhanced system  Results  Duration repair  Preliminary experiments  Regression tree modelling  Improving TE recognition  Fixing recognition artifacts  Results  Conclusions and future work

Arantza del CUED Machine Intelligence Laboratory 3 OUTLINE  Speech repair  Tracheoesophageal (TE) speech  Laryngectomy  Acoustic properties  Main limitations  Excitation repair  Previous attempts  Adopted approach  Baseline system  Enhanced system  Results  Duration repair  Preliminary experiments  Regression tree modelling  Improving TE recognition  Fixing recognition artifacts  Results  Conclusions and future work

Arantza del CUED Machine Intelligence Laboratory 4 SPEECH REPAIR SPEECH REPAIR SYSTEM Speech Model Deviant features Correction algorithms

Arantza del CUED Machine Intelligence Laboratory 5 OUTLINE  Speech repair  Tracheoesophageal (TE) speech  Laryngectomy  Acoustic properties  Main limitations  Excitation repair  Previous attempts  Adopted approach  Baseline system  Enhanced system  Results  Duration repair  Preliminary experiments  Regression tree modelling  Improving TE recognition  Fixing recognition artifacts  Results  Conclusions and future work

Arantza del CUED Machine Intelligence Laboratory 6 Laryngectomy  Laryngectomy is a surgical procedure which involves the removal of the larynx, i.e. vocal cords, epiglottis and tracheal rings  Speech rehabilitation after laryngectomy  Esophageal speech  TE speech  Electrolaryngeal speech  TE speech is the most frequently used voice restoration technique after laryngectomy

Arantza del CUED Machine Intelligence Laboratory 7 Acoustic properties of TE speech  Voicing source highly variable and deviant  Lower F0 (female) and higher jitter and shimmer  Higher high-frequency noise and lower harmonic-to- noise-ratio (HNR), glottal-to-noise excitation ratio (GNE), band-energy difference (BED)  Some evidence of higher formant values in Spanish and Dutch TE speech  Shorter maximum phonation time, longer vowel duration and slower speaking rates

Arantza del CUED Machine Intelligence Laboratory 8 Main limitations of TE speech  Inability to properly control the EXCITATION  deviant glottal waveforms  irregular pitch and amplitude contours  higher turbulence noise  spectral envelope deviations caused by coupling  DURATION deviations caused by the disconnection between the lungs and the vocal tract  more pauses  longer vowels  slower rates  rushes before breaks

Arantza del CUED Machine Intelligence Laboratory 9 OUTLINE  Speech repair  Tracheoesophageal (TE) speech  Laryngectomy  Acoustic properties  Main limitations  Excitation repair  Previous attempts  Adopted approach  Baseline system  Enhanced system  Results  Duration repair  Preliminary experiments  Regression tree modelling  Improving TE recognition  Fixing recognition artifacts  Results  Conclusions and future work

Arantza del CUED Machine Intelligence Laboratory 10 Previous excitation repair attempts  Qi et al.  Resynthesis of female TE words with a synthetic glottal waveform and with smoothed and raised F0  Replacement of voice source and conversion of spectral envelopes  Limitations of previous repair attempts  Only most obvious deviant features have been tackled  Evaluation limited to sustained vowels and words  Only a small number of TE speakers and qualities have been tested  Degree of perceptual enhancement has not been quantified

Arantza del CUED Machine Intelligence Laboratory 11 Adopted approach  DATA  13 TE speakers (11 male, 2 female)  Patients of the Speech and Language Therapy Department of Addenbrookes Hospital, Cambridge  Control group of 11 normal speakers (8 male, 3 female)  BASELINE SYSTEM  Glottal resynthesis  Jitter and shimmer reduction  ENHANCED SYSTEM  Spectral envelope smoothing and Tilt reduction Feature correction Perceptual evaluation DEVIANT FEATURES: -voice source -jitter & shimmer -spectral envelope

Arantza del CUED Machine Intelligence Laboratory 12 Baseline system  Glottal resynthesis  breathiness reduction  Jitter and shimmer reduction  roughness reduction Lip radiation VT

Arantza del CUED Machine Intelligence Laboratory 13 Enhanced system (1/2)  Resynthesised speech still has a harsh quality caused by deviations in TE spectral envelopes (SE)  Spectral envelope analysis  Higher std of formant gains, frequencies and bandwidths and spectral distortion  Lower relative gain difference between 1st and 3rd formants and spectral tilt

Arantza del CUED Machine Intelligence Laboratory 14 Enhanced system (2/2)  Enhancement algorithm  To reduce differences between estimated consecutive SE  LSF median smoothing  To decrease spectral tilt  Low-pass filtering

Arantza del CUED Machine Intelligence Laboratory 15 Results  Perceptual tests originalbaselineenhanced “more breathy” 82.69%17.31% “ harsher ” 73.72%26.28% “more normal speaker” 58.33%41.67% 38.78%61.22%

Arantza del CUED Machine Intelligence Laboratory 16 OUTLINE  Speech repair  Tracheoesophageal (TE) speech  Laryngectomy  Acoustic properties  Main limitations  Excitation repair  Previous attempts  Adopted approach  Baseline system  Enhanced system  Results  Duration repair  Preliminary experiments  Regression tree modelling  Improving TE recognition  Fixing recognition artifacts  Results  Conclusions and future work

Arantza del CUED Machine Intelligence Laboratory 17 Preliminary experiments  Duration deviations  more pauses  longer vowels  slower rates  rushes before breaks  Possible duration repair approaches  Rule-based  Reduce pauses, reduce vowels, increase speech rate, increase duration of phones before breaks, etc. Difficult to obtain adequate reduction/increase rates Break sentence rhythm  Transplantation of average normal phone durations  Phone durations obtained with Forced Alignment (FA) Overall improvement which increased naturalness of TE sentences Sentence rhythm was preserved  Duration repair algorithm is an automatization of the transplantation experiment

Arantza del CUED Machine Intelligence Laboratory 18 Regression tree modelling (1/2)  Classification and regression trees (CART) are widely used for duration modelling in TTS systems  Employed features are extracted from text  Phone identity  Identities of previous and next phones  Position of syllable in word  Position of word in sentence  Number of syllables before/after a break  Type of lexical stress  Lexical stress type of previous and next syllables ...  A speech repair framework constrains the possible feature space to recognisable features  For TE speech repair, assumed that only phone recognition is viable  Features relying on word, syllable or lexical stress information cannot be used

Arantza del CUED Machine Intelligence Laboratory 19 Regression tree modelling (2/2)  Several CART trees were built with different features  Explored features  Phone identity  Identities of previous and next phones  Positions of phones in the sentence  Pitch and energy (as an attempt to incorporate some stress info)  Short pauses (SP) not regarded as phones, modelled independently  Trees  T1  F1: phone identity  T2  F2: F1 + previous & next phone identities (broad class)  T3  F3: F2+ position of phone in sentence  T4  F4: F3+ pitch (positive/negative/no slope)  T5  F5: F4+ energy (positive/negative/no slope)  TSP  number of phones since previous sp & until next sp  Performance measured as Mean Squared Error (MSE) between normal mean durations used for transplantation and predicted values  T3>T2>T1>T5>T4  Substitution of T3+TSP predicted durations of TE sentences with FA phone segmentation almost indistinguishable from transplantation

Arantza del CUED Machine Intelligence Laboratory 20 Improving TE recognition (1/2)  Little work on automatic TE speech recognition  Haderlein et al. (2004) adapted a speech recogniser trained on normal speech to single TE speakers by unsupervised HMM interpolation and obtained an average word accuracy of 36.4%  Focus on improving TE phone recognition  Novel performance measures which take recognition (r), segmentation (s) and duration prediction (p) errors into account FA REC

Arantza del CUED Machine Intelligence Laboratory 21 Improving TE recognition (2/2)  Explored systems  Baseline (BL): monophone HMM trained on WSJCAM0  R1: BL + CMN + CMLLR  R2: R1 + MAP  R3: R1 + bigram LM  R4: R1 + trigram LM  R5: CUHTK 2003 BN LVCSR + CMLLR  phone level output  Results  R5>R4>R3>R1>R2 BLR1R2R3R4R5 SPC [%] SPE [ms]

Arantza del CUED Machine Intelligence Laboratory 22 Fixing recognition artifacts  Use of best recognised labels for duration repair still produced artifacts  Method for robust duration modification (RM)  Take recognition confidence into account  computed from  TE phone duration probability distributions  recogniser confidence scores  takes phone confusions into account in R4

Arantza del CUED Machine Intelligence Laboratory 23 Results  Objective evaluation: MSE between repaired sentences and target transplanted durations  R5+RM>R5>R4+RM>R4>original TE durations  Subjective evaluation: perceptual test RANK (1-5) OTR PREFERENCE TEST R4 48% R5 52% ->=< T - M M - O T - O

Arantza del CUED Machine Intelligence Laboratory 24 OUTLINE  Speech repair  Tracheoesophageal (TE) speech  Laryngectomy  Acoustic properties  Main limitations  Excitation repair  Previous attempts  Adopted approach  Baseline system  Enhanced system  Results  Duration repair  Preliminary experiments  Regression tree modelling  Improving TE recognition  Fixing recognition artifacts  Results  Conclusions and future work

Arantza del CUED Machine Intelligence Laboratory 25 CONCLUSIONS AND FUTURE WORK  Deviant TE excitation and duration features have been identified and repaired  Synthetic quality of excitation repaired speech nullifies results in some cases  Future work  Improve excitation resynthesis quality  Improve TE speech recognition step  Attempt text-based features for duration modelling