High Quality Voice Morphing Hui Ye & Steve Young Cambridge University Engineering Department August 2004 Hui YE @ Cambridge University Engineering Department
Baseline System Pitch Synchronous Harmonic Model for speech representation and modification. Pitch scale 1.7 Time scale 1.7
Transform-based Conversion Source Speaker Target Speaker Extract Spectral Envelope Time alignment Estimate Transforms Spectral envelopes conversion Converted speech Source Speaker Extract Spectral Envelope
Analysis of the distortion in the baseline suggested 3 problems areas: Spectral Distortion Unnatural Phase Dispersion Transformation of Unvoiced sounds Solutions have therefore been developed in each of these areas
Spectral Distortion Formant structure has been transformed Spectral details lost due to reduced LSF dimensionality Spectral peaks broadened by the averaging effect of least square error estimation
Spectral Residual Selection Idea: reintroduce the lost spectral details to the converted envelopes Use a codebook selection method to construct a residual Post-filtering applies a perceptual filter to the converted spectral envelope
2. Unnatural Phase Dispersion In the baseline system, the converted spectral envelope was combined with the original phases. This results in converted speech with a “harsh” quality. Spectral magnitudes and phases of human speech are highly correlated. To simultaneously model the magnitudes and phases and then convert them both via a single unified transform is extremely difficult.
Target spectral envelopes Phase Prediction If we can predict the waveform shape, then we can predict the phases. St Estimator GMM Target spectral envelopes Template signal T1 . TM St’ Soft classifier P(CM|vt) P(C1|vt) vt Extract phases Φt
Phase Prediction Implementation The set of template signal (codebook entries) T=[T1,…,TM] can be estimated by minimizing the waveform shape prediction error
Phase Prediction Result: Phase prediction vs copying src phase Phase prediction SNR 7.2 3.2 Original signal Copy src phases Amplitude Time
Phase Prediction Result Phase prediction vs codebook Phase prediction SNR 7.2 6.1 Original signal Phase codebook Amplitude Time
3. Transforming Unvoiced Sounds In our baseline system, the unvoiced sounds are not transformed. In reality, many unvoiced sounds have some vocal tract colouring which affects the speech characteristics. A unit selection approach was therefore developed to transform the unvoiced sounds.
Experiments Training Data: OGI Voice Corpus – 12 speakers, each speaker has about 5 minutes parallel speech data. Four Conversion Tasks: male to male, male to female, female to male, female to female
Subjective Evaluation ABX test (identify target speaker) Preference test (which is more natural) Baseline system Enhanced system ABX 86.4% 91.8% Baseline system Enhanced system Preference 38.9% 61.1%
Baseline+shifted pitch Examples Voice Transformation with parallel training data Source Baseline+shifted pitch Enhanced +tgt prosody target M to F F to M F to F M to M
Unknown Speaker Voice Transformation No pre-existed training data is available from the source speaker, although there is still a reasonable amount of speech data from designated target speaker. Use speech recognition to create a mapping between the unknown input source speech and the target vectors. Source Converted Target Female Male
Summary A complete solution to the voice morphing problem has been developed which can deliver reasonable quality. However, there still some way to go before these techniques can support high fidelity studio applications. Future Work Improve the quality of the converted speech Unknown speaker voice conversion Cross language voice conversion