Implementation of a speech Analysis-Synthesis Toolbox using Harmonic plus Noise Model Didier Cadic 1, engineering student supervised by Olivier Cappé 1, Maurice Charbit 1, Gérard Chollet 1, Eric Moulines 1 (presented here by Guido Aversano 1,2 ) 2 IIASS, Vietri sul Mare (SA), Italy 1 Département TSI, ENST, Paris, France
Plan of the presentation Text-to-speech: classic methods HNM model Analysis Synthesis Analysis-Synthesis examples Conclusions
Text-To-Speech by concatenation English, male English, female (vocal server example) English, female (another vocal server example) German, male French, female Examples realized on the AT&T web site:
Text-To-Speech by concatenation 2 major challenges : smooth connection between acoustic units flexible prosody
TD-PSOLA method Analysis : Pitch estimation Pitch-synchronous windowing Synthesis : Rearrangement of frames
TD-PSOLA method Some very good-quality results: Singing, original Singing, modified Time-scaling Cello, original Cello, modified Pitch-shifting
TD-PSOLA method "rain", original "rain", 0.5 rate "ss", original "ss", slowed down (classic method) "ss", slowed down (improved) Artifacts appearing in non-voiced sounds:
Phase Vocoder method Intuitive description: Compression/stretching of (narrow-band) spectrogram’s time-frequency scales… time-scaling pitch-shifting
Phase Vocoder method Examples : "rain", male voice Slow-motion by Vocoder (PSOLA : ) "The quick fox …", female voice Slow-motion by Vocoder Main problem : phase coherence is lost in the synthesized signal
TD-PSOLA and Vocoder allow basic prosodic modifications. The problem of unit concatenation for TTS is not solved. Other kinds of modifications (timbre, denoising, …) should be considered. We need a parametric model
Harmonic plus Noise Model (HNM) Main assumption : stationary segments of a speech signal can be always seen as the superposition of a periodic and a noisy part
HNM Model Modelling : S(t)H(t)B(t) =+ where :H(t) = A k cos ( 2 k f 0 t + k ) and B(t) = white noise passed through an AR filter
HNM analysis of a frame 1.Pitch estimation Spectral comb method
HNM analysis of a frame 1.Pitch estimation Good results are obtained In some cases the method erroneously returns f0/2 Possibility of tracking… "aka…aga"
HNM analysis of a frame 2.Harmonic part: extraction of amplitudes Least squares method H(t) = a k cos ( 2 k f 0 t ) + b k sin ( 2 k f 0 t ) min s(t) – H(t) 2 a k, b k
HNM analysis of a frame 2.Extraction of amplitudes Problem: the noisy part gives a non-null contribution to the spectral power Gain correction for the harmonics (using an euristic formula g(DV), where DV is the estimated voicing degree)
HNM analysis of a frame 2.Extraction of amplitudes Residual:R(t) = s(t) - H(t)
HNM analysis of a frame 2.Extraction of amplitudes Possibility of improving harmonic estimation
where Bg = gaussian white noise and F(t) = AR filter, F(z) = HNM analysis of a frame 3.AR filter estimation for the residual: Linear prediction method R(t) = Bg F(t) a 0 + a 1 z -1 + … + a N z -N 1
HNM Synthesis Interpolation for each harmonic between two succesive frames H(t) = a k (t) cos ( 2 k f 0 (t) t ) + b k (t) sin ( 2 k f 0 (t) t ) = = A k (t) cos k (t) = A k (t) cos k (t) k (t a ) = 2 k f 0 (t a ) is known by pitch analysis. A k (t a ) and k (t a ) are known at analysis instants t a
HNM Synthesis Erroneous pitch (usually f0/2) harmonic correspondence problem is solved introducing fictitious harmonics
HNM Synthesis A k cos k (t) Linear interpolation Unwrapping + cubic interpolation
HNM Synthesis Noisy part Generation of normally distributed random numbers AR filtering (abrupt changes of coefficients between 2 windows have no incidence…)
HNM Synthesis Results "Carottes" : synthesizedoriginal "Lawyer" : synthesizedoriginal Tuba : synthesizedoriginal "wazi" : synthesizedoriginal a-e-i-o-u : synthesizedoriginal singing : synthesizedoriginal
HNM Synthesis Results Discours : synthesizedoriginal "aka aga" : synthesizedoriginal Dussolier : synthesized original Andie : synthesizedoriginal noisy part "coiffe" : synthesizedoriginal
Synthesis with time-stretching Synthesis instants (t s ) Analysis instants (t a ) The following parameters remain unchanged: Noisy part parameters The pitch The amplitudes A k of the harmonics
Synthesis with time-stretching Simple phase trajectories resampling or "harmonic" rephasing Phase adaptation a-e-i-o-u : slow-motion with phase "stretching" original slow-motion with "harmonic" rephasing
Final results Original 1 Synthesized with rate : "carottes" : "lawyer" : tuba : "wazi" : singing : "a-e-i-o-u" : Dussolier : Discours : Andie : "aka aga": "coiffe" :
Conclusions Good results, showing method’s potential for different applications including TTS Future work will include other kinds of modifications (pitch shifting, timbre etc.)