Download presentation
Presentation is loading. Please wait.
Published byPenelope Chandler Modified over 9 years ago
1
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi University, İstanbul, Turkey oytun@sestek.com.tr arslanle@boun.edu.tr SESTEK Inc. Boğaziçi University SELECTIVE PRE-EMPHASIS TRAINING 1.Same utterances from source & target speaker 2.Automatic alignment by Sentence-HMMs and/or manual alignment 3.Selective pre-emphasis 4.Generate LSF codebooks for each sub-band component SELECTIVE PRE-EMPHASIS TRANSFORMATION 1.Selective pre-emphasis applied on source utterance 2.Full-band excitation spectrum modified separately 3.Weighted average of corresponding target sub-band codebook entries 4.Full-band vocal tract spectrum estimated using selective pre- emphasis based synthesis MOTIVATION Vocal tract and pitch characteristics have dominant role in perception of speaker identity Detailed modeling of vocal tract and pitch contour is required for high quality voice conversion As the number of parameters increase, transformation of the vocal tract spectrum becomes problematic (distortion at the output) This study proposes two methods for detailed modeling and transformation of the vocal tract spectra and the pitch contours New methods are compared with existing ones in a subjective listening test 1 VOICE CONVERSION Formant frequencies, sinusoidal model parameters, and LSFs are used for transformation of the vocal tract spectrum Codebooks can be used to represent the mapping between the source and the target speaker’s acoustical spaces Vocal tract spectra and pitch are processed separately 2 SELECTIVE PRE-EMPHASIS SYSTEM Motivation Pre-emphasis enhances the numerical properties of LPC analysis We combine pre-emphasis with perceptual sub-band processing to model the vocal tract spectra in detail Detailed analysis with less LP order is possible at high sampling rates Example: LP order of 50 might be required at 44 KHz. LP order of 24 is sufficient with selective pre-emphasis 3 SELECTIVE PRE-EMPHASIS ANALYSIS 1.Bandpass filtering & frame-by-frame processing 2.LP analysis on each subband component 3.Fullband spectrum estimated as: 4.k 1 : lower cut-off of (i+1) th filter k 2 : higher cut-off of i th filter 4 SELECTIVE PRE-EMPHASIS SYNTHESIS 1.Use synthesis vocal tract and excitation spectra 2.Inverse Fourier Transform 3.Perform overlap-add synthesis EXAMPLE: LP vs. Selective pre-emphasis based spectral estimation 5 6 7 8 SEGMENTAL PITCH CONTOUR MODEL 1.Source&Target utterances aligned,pitch contours extracted 2.Target pitch contours linearly interpolated in unvoiced segments 3.Corresponding target segment found for each voiced source pitch contour segment 4.Pitch contour extracted for source utterance to be transformed 5.Minimum Mahalanobis distance source segments are found 6.Transformed contour is synthesized as a weighted average of corresponding target segments 9 EVALUATIONS Subjective test using 30 sentences, 50 words in Turkish, recorded @ 44KHz, 8 speakers (4 female, 4 male) Vocal tract conversion: STASC, DWT, Sel. Pre-emp. Pitch conversion: Mean-Variance, Segmental Transplantations: Vocal tract, Vocal tract + Pitch 10 subjects listened to 112 triples of sound files Each triple consisted of a source recording, a target recording, and an output recording Output recording is either a source or target recording, or the output of voice conversion, or acoustic feature transplantation Scores: Identity(0.0,0.5,1.0), Confidence(1-5), Quality(1-5) Means & Interquartile ranges calculated 10 EVALUATIONS Subjective Test Results 11 CONCLUSIONS Lower scores obtained when only vocal tract converted Confidence and quality scores decrease as processing increases STASC is more robust in different gender combinations Selective pre-emphasis performs well at a lower prediction order. It can be used for employing differing amounts of resolution at different sub-bands Segmental pitch improves identity scores Source, target and third speakers were identified perfectly References [1] Turk, O., New Methods For Voice Conversion, M.S. Thesis, Bogazici University, 2003. [2] Gutierrez-Arriola, J.M., Hsiao, Y.S., Montero, J.M., Pardo, J.M., and Childers, D.G., “Voice Conversion Based On Parameter Transformation”, Proc. of the ICSLP 1998, Vol. 3, pp. 987-990, Sydney, Australia. [3] Stylianou, Y., Cappe, O., and Moulines, E., “Continuous Probabilistic Transform for Voice Conversion”, IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 2, 1998, pp. 131-142. [4] Arslan, L.M., “Speaker Transformation Algorithm Using Segmental Codebooks”, Speech Communication 28 (1999), pp. 211-226. [5] Kain, A.B., and Macon, M., “Personalizing A Speech Synthesizer by Voice Adaptation”, in Proc. of the 3 rd ESCA/COCOSDA International Speech Synthesis Workshop, 1998, pp. 225-230. [6] Chappell, D.T., and Hansen, J.H.L., “Speaker-Specific Pitch Contour Modeling and Modification”, in Proc. of the ICASSP 1998, Vol. II, pp. 885-888, Seattle, USA. [7] Turk, O., and Arslan, L.M., “Subband Based Voice Conversion”, in Proc. of the ICSLP 2002, Vol. 1, pp.289-292, Denver, Colorado, USA. 12 EUROSPEECH 2003 (Interspeech 2003), 8 th European Conference on Speech Communication and Technology, September 1-4, 2003, Geneva, Switzerland
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.