Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Quality Voice Morphing

Similar presentations


Presentation on theme: "High Quality Voice Morphing"— Presentation transcript:

1 High Quality Voice Morphing
Hui Ye & Steve Young Cambridge University Engineering Department August 2004 Hui Cambridge University Engineering Department

2 Baseline System Pitch Synchronous Harmonic Model
for speech representation and modification. Pitch scale 1.7 Time scale 1.7

3 Transform-based Conversion
Source Speaker Target Speaker Extract Spectral Envelope Time alignment Estimate Transforms Spectral envelopes conversion Converted speech Source Speaker Extract Spectral Envelope

4 Analysis of the distortion in the baseline suggested 3 problems areas:
Spectral Distortion Unnatural Phase Dispersion Transformation of Unvoiced sounds Solutions have therefore been developed in each of these areas

5 Spectral Distortion Formant structure has been transformed
Spectral details lost due to reduced LSF dimensionality Spectral peaks broadened by the averaging effect of least square error estimation

6 Spectral Residual Selection
Idea: reintroduce the lost spectral details to the converted envelopes Use a codebook selection method to construct a residual Post-filtering applies a perceptual filter to the converted spectral envelope

7 2. Unnatural Phase Dispersion
In the baseline system, the converted spectral envelope was combined with the original phases. This results in converted speech with a “harsh” quality. Spectral magnitudes and phases of human speech are highly correlated. To simultaneously model the magnitudes and phases and then convert them both via a single unified transform is extremely difficult.

8 Target spectral envelopes
Phase Prediction If we can predict the waveform shape, then we can predict the phases. St Estimator GMM Target spectral envelopes Template signal T1 . TM St’ Soft classifier P(CM|vt) P(C1|vt) vt Extract phases Φt

9 Phase Prediction Implementation
The set of template signal (codebook entries) T=[T1,…,TM] can be estimated by minimizing the waveform shape prediction error

10 Phase Prediction Result:
Phase prediction vs copying src phase Phase prediction SNR 7.2 3.2 Original signal Copy src phases Amplitude Time

11 Phase Prediction Result
Phase prediction vs codebook Phase prediction SNR 7.2 6.1 Original signal Phase codebook Amplitude Time

12 3. Transforming Unvoiced Sounds
In our baseline system, the unvoiced sounds are not transformed. In reality, many unvoiced sounds have some vocal tract colouring which affects the speech characteristics. A unit selection approach was therefore developed to transform the unvoiced sounds.

13 Experiments Training Data: OGI Voice Corpus – 12 speakers, each speaker has about 5 minutes parallel speech data. Four Conversion Tasks: male to male, male to female, female to male, female to female

14 Subjective Evaluation
ABX test (identify target speaker) Preference test (which is more natural) Baseline system Enhanced system ABX 86.4% 91.8% Baseline system Enhanced system Preference 38.9% 61.1%

15 Baseline+shifted pitch
Examples Voice Transformation with parallel training data Source Baseline+shifted pitch Enhanced +tgt prosody target M to F F to M F to F M to M

16 Unknown Speaker Voice Transformation
No pre-existed training data is available from the source speaker, although there is still a reasonable amount of speech data from designated target speaker. Use speech recognition to create a mapping between the unknown input source speech and the target vectors. Source Converted Target Female Male

17 Summary A complete solution to the voice morphing problem has been developed which can deliver reasonable quality. However, there still some way to go before these techniques can support high fidelity studio applications. Future Work Improve the quality of the converted speech Unknown speaker voice conversion Cross language voice conversion


Download ppt "High Quality Voice Morphing"

Similar presentations


Ads by Google