Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

Slides:



Advertisements
Similar presentations
Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.
Advertisements

Acoustic Characteristics of Vowels
Auditory Neuroscience - Lecture 1 The Nature of Sound auditoryneuroscience.com/lectures.
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
ACOUSTICS OF SPEECH AND SINGING MUSICAL ACOUSTICS Science of Sound, Chapters 15, 17 P. Denes & E. Pinson, The Speech Chain (1963, 1993) J. Sundberg, The.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Anatomy of the vocal mechanism
The Human Voice. I. Speech production 1. The vocal organs
ACOUSTICAL THEORY OF SPEECH PRODUCTION
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
Eva Björkner Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing HUT, Helsinki, Finland KTH – Royal Institute of Technology.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Communications & Multimedia Signal Processing Meeting 7 Esfandiar Zavarehei Department of Electronic and Computer Engineering Brunel University 23 November,
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Communications & Multimedia Signal Processing Formant Tracking LP with Harmonic Plus Noise Model of Excitation for Speech Enhancement Qin Yan Communication.
Communications & Multimedia Signal Processing Refinement in FTLP-HNM system for Speech Enhancement Qin Yan Communication & Multimedia Signal Processing.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
Hoarse meeting in Liverpool April 22, 2005 Subglottal pressure and NAQ variation in Classically Trained Baritone Singers Eva Björkner*†, Johan Sundberg†,
Topics covered in this chapter
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
MUSIC 318 MINI-COURSE ON SPEECH AND SINGING
Prepared by: Waleed Mohamed Azmy Under Supervision:
Speech Enhancement Using Spectral Subtraction
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Speech Acoustics1 Clinical Application of Frequency and Intensity Variables Frequency Variables Amplitude and Intensity Variables Voice Disorders Neurological.
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
SPEECH CODING Maryam Zebarjad Alessandro Chiumento.
Eva Björkner Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing HUT, Helsinki, Finland KTH – Royal Institute of Technology.
Male Cheerleaders and their Voices. Background Information: What Vocal Folds Look Like.
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Lombard Speech Synthesis  Humans modify their voice according to the social situation/context  Shouting or loud speech is an important mode of speaking.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
COPYRIGHT © All rights reserved by Sound acoustics Germany The averaged quality measures over all test cases indicate the real influence of a test object.
HMM-Based Synthesis of Creaky Voice
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
1 Audio Coding. 2 Digitization Processing Signal encoder Signal decoder samplingquantization storage Analog signal Digital data.
Performance Comparison of Speaker and Emotion Recognition
More On Linear Predictive Analysis
1. SPEECH PRODUCTION MUSIC 318 MINI-COURSE ON SPEECH AND SINGING
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
HOW WE TRANSMIT SOUNDS? Media and communication 김경은 김다솜 고우.
High Quality Voice Morphing
The Human Voice. 1. The vocal organs
P105 Lecture #26 visuals 18 March 2013.
Mr. Darko Pekar, Speech Morphing Inc.
Vocoders.
The Human Voice. 1. The vocal organs
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Speech Perception CS4706.
1. SPEECH PRODUCTION MUSIC 318 MINI-COURSE ON SPEECH AND SINGING
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
Vocoders.
†Department of Speech Music Hearing, KTH, Stockholm, Sweden
Speech Perception (acoustic cues)
An Introduction to Sound
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 2 Shout is the loudest mode of vocal communication It is used for increasing the signal- to-noise ratio (SNR) when communicating over an interfering noise over a distance Shouting is also used for expressing emotions or intentions Shout

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 3 Shout is produced by raising the subglottal pressure and increasing the vocal fold tension In effect, shout is characterized by Increased sound pressure level (SPL) Increased fundamental frequency (f0) Increased amplitudes in mid-frequencies (1—4 kHz) Increased duration and energy of vowels Decreased duration and energy of consonants Less accurate articulation Properties of shout

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 4 Fortunately, shouting is used rarely, but it is an essential part of human vocal communication Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters Why perform shout synthesis?

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 5 In this study Normal and shouted speech was recorded Properties of normal and shouted speech were analyzed Methods for producing natural sounding HMM-based synthetic shout are investigated In this study…

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 6 Normal and shouted speech was recorded in an anechoid chamber 22 Finnish speakers 24 sentences of speech and shout from each speaker A total of 1056 sentences Subjects were asked to use very loud voice in shouting In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes Recording of normal and shouted speech

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 7

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 8 The following acoustic properties were analyzed from the recorded shouted and normal speech: sound pressure level (SPL) duration fundamental frequency (f0) spectrum properties of the voice source: shape of the glottal pulse H1-H2 parameter NAQ parameter Acoustic analysis of shout

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 9 On average (speech  shout) SPL increased 21 dB for females and 22 dB for males Sentence duration increased 20% for females and 24% for males f0 increased 71% for females and 152% for males Spectrum was emphasized in the 1–4 kHz area Acoustic analysis of shout – Results

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 10

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 11 Overall Voiced Unvoiced FemaleMale

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 12

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 13 Differences between normal speech and shout are large This induces problems in many speech processing algorithms: Due to high f0, the accurate estimation of speech spectrum is difficult This is due to the biasing effect of the sparse harmonic structure of the shouted voice source Especially linear prediction (LP) is prone to this type of bias Problems…

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 14 The biasing effect of the harmonics must be reduced For this purpose, e.g. weighted linear prediction (WLP) can be used In WLP, the effect of the excitation to spectrum is reduced This is done by weighting the squared residual with a specific function Spectrum estimation of shout

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 15 LP vs. weighted linear prediction (WLP) Conventional LP: Weighted LP:

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 16

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 17 Following spectrum estimation methods were compared for normal speech and shout: 1.Conventional linear prediction (LP) 2.WLP with STE weight (STE-WLP) 3.WLP with AME weight (AME-WLP) STE – short time energy AME – attenuation of the main excitation Spectrum estimation of shout

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 18 Subjective listening tests indicate that WLP-AME performs best with normal speech WLP-STE performs best with shout LP WLP-STE WLP-AME LP vs. WLP in resynthesis

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 19 Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation) FemaleMale LP vs. WLP in HMM-based speech synthesis

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 20 HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout Synthesis of shout (1) Speech data Statistical model Synthetic speech Training Synthesis Text

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 21 It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice Shout data Synthesis of shout (2)

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 22 Statistical adaptation of the normal speech model was used to generate synthetic shouted speech Statistical model Shout data Adaptation Training Synthesis Text Synthetic shout Speech data Synthesis of shout (3)

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 23 Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech Shout data Voice conversion Statistical model Training Synthesis Text Synthetic shout Speech data Synthesis of shout (4)

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 24 The following speech types were selected for the test: 1.Natural normal speech 2.Natural shout 3.Synthetic normal speech 4.Synthetic shout (adapted) 5.Synthetic shout (voice conversion) Evaluation (1)

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 25 MOS style listening test: the following properties were rated: 1.How would you rate the quality of the speech sample? 2.How much the sample resembles shouting? 3.How much effort did speaker use for producing speech? Scale from 1 to 5 with verbal anchors Loudness of the speech samples was normalized so that the ratings are based on other aspects than SPL 11 test subjects evaluated 50 samples each Evaluation (2)

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 26 Results – Naturalness 26 Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected) Normal synthesis Shout synthesis

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 27 Results – Impression of shouting 27 The impression of shouting is, however, fairly well preserved Natural shout Synthetic shout

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 28 Results – Vocal effort 28 Adaptation produces better impression of the used vocal effort compared to voice conversion method Adapted shout Voice conversion shout

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 29 Synthesis of shout is challenging for many reasons: 1.It is difficult to obtain large amounts of shout data with consistent quality 2.Differences between normal speech and shout are large, which induces problems in many speech processing algorithms In this work, the biasing effect of high-pitched shout was reduced by using weighted linear predictive (WLP) methods Subjective listening tests show the that WLP models work better with shout than conventional LP Summary (1)

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 30 In this study, synthetic shout was produced with two different techniques: 1.Adaptation 2.Voice conversion of the synthetic normal speech Methods were rated equal in quality Impression of shouting and the use of vocal effort were better preserved in the adapted shout Summary (2)

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 31 Thank you! MaleFemale Samples

Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 32