Download presentation
Presentation is loading. Please wait.
Published byClinton Cross Modified over 9 years ago
2
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio
3
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 2 Shout is the loudest mode of vocal communication It is used for increasing the signal- to-noise ratio (SNR) when communicating over an interfering noise over a distance Shouting is also used for expressing emotions or intentions Shout
4
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 3 Shout is produced by raising the subglottal pressure and increasing the vocal fold tension In effect, shout is characterized by Increased sound pressure level (SPL) Increased fundamental frequency (f0) Increased amplitudes in mid-frequencies (1—4 kHz) Increased duration and energy of vowels Decreased duration and energy of consonants Less accurate articulation Properties of shout
5
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 4 Fortunately, shouting is used rarely, but it is an essential part of human vocal communication Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters Why perform shout synthesis?
6
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 5 In this study Normal and shouted speech was recorded Properties of normal and shouted speech were analyzed Methods for producing natural sounding HMM-based synthetic shout are investigated In this study…
7
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 6 Normal and shouted speech was recorded in an anechoid chamber 22 Finnish speakers 24 sentences of speech and shout from each speaker A total of 1056 sentences Subjects were asked to use very loud voice in shouting In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes Recording of normal and shouted speech
8
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 7
9
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 8 The following acoustic properties were analyzed from the recorded shouted and normal speech: sound pressure level (SPL) duration fundamental frequency (f0) spectrum properties of the voice source: shape of the glottal pulse H1-H2 parameter NAQ parameter Acoustic analysis of shout
10
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 9 On average (speech shout) SPL increased 21 dB for females and 22 dB for males Sentence duration increased 20% for females and 24% for males f0 increased 71% for females and 152% for males Spectrum was emphasized in the 1–4 kHz area Acoustic analysis of shout – Results
11
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 10
12
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 11 Overall Voiced Unvoiced FemaleMale
13
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 12
14
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 13 Differences between normal speech and shout are large This induces problems in many speech processing algorithms: Due to high f0, the accurate estimation of speech spectrum is difficult This is due to the biasing effect of the sparse harmonic structure of the shouted voice source Especially linear prediction (LP) is prone to this type of bias Problems…
15
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 14 The biasing effect of the harmonics must be reduced For this purpose, e.g. weighted linear prediction (WLP) can be used In WLP, the effect of the excitation to spectrum is reduced This is done by weighting the squared residual with a specific function Spectrum estimation of shout
16
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 15 LP vs. weighted linear prediction (WLP) Conventional LP: Weighted LP:
17
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 16
18
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 17 Following spectrum estimation methods were compared for normal speech and shout: 1.Conventional linear prediction (LP) 2.WLP with STE weight (STE-WLP) 3.WLP with AME weight (AME-WLP) STE – short time energy AME – attenuation of the main excitation Spectrum estimation of shout
19
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 18 Subjective listening tests indicate that WLP-AME performs best with normal speech WLP-STE performs best with shout LP WLP-STE WLP-AME LP vs. WLP in resynthesis
20
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 19 Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation) FemaleMale LP vs. WLP in HMM-based speech synthesis
21
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 20 HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout Synthesis of shout (1) Speech data Statistical model Synthetic speech Training Synthesis Text
22
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 21 It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice Shout data Synthesis of shout (2)
23
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 22 Statistical adaptation of the normal speech model was used to generate synthetic shouted speech Statistical model Shout data Adaptation Training Synthesis Text Synthetic shout Speech data Synthesis of shout (3)
24
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 23 Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech Shout data Voice conversion Statistical model Training Synthesis Text Synthetic shout Speech data Synthesis of shout (4)
25
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 24 The following speech types were selected for the test: 1.Natural normal speech 2.Natural shout 3.Synthetic normal speech 4.Synthetic shout (adapted) 5.Synthetic shout (voice conversion) Evaluation (1)
26
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 25 MOS style listening test: the following properties were rated: 1.How would you rate the quality of the speech sample? 2.How much the sample resembles shouting? 3.How much effort did speaker use for producing speech? Scale from 1 to 5 with verbal anchors Loudness of the speech samples was normalized so that the ratings are based on other aspects than SPL 11 test subjects evaluated 50 samples each Evaluation (2)
27
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 26 Results – Naturalness 26 Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected) Normal synthesis Shout synthesis
28
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 27 Results – Impression of shouting 27 The impression of shouting is, however, fairly well preserved Natural shout Synthetic shout
29
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 28 Results – Vocal effort 28 Adaptation produces better impression of the used vocal effort compared to voice conversion method Adapted shout Voice conversion shout
30
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 29 Synthesis of shout is challenging for many reasons: 1.It is difficult to obtain large amounts of shout data with consistent quality 2.Differences between normal speech and shout are large, which induces problems in many speech processing algorithms In this work, the biasing effect of high-pitched shout was reduced by using weighted linear predictive (WLP) methods Subjective listening tests show the that WLP models work better with shout than conventional LP Summary (1)
31
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 30 In this study, synthetic shout was produced with two different techniques: 1.Adaptation 2.Voice conversion of the synthetic normal speech Methods were rated equal in quality Impression of shouting and the use of vocal effort were better preserved in the adapted shout Summary (2)
32
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 31 Thank you! MaleFemale Samples
33
Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 32
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.