Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,

Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University, Stanford, CA94305, USA

Score-to-Singing system Score Lyrics Singing style Rule system Sound synthesis Singing voice Phoneme Lyrics-to-phoneme Musical rules Parametric Database F0 Sound level Duration Vibrato Acoustic rendering Co-articulation rules

General sound synthesis approaches Physical Modelin g Physical Modelin g Spectral Modelin g Spectral Modelin g Source- filter Model Source- filter Model flexible/intuitive control expressive co-articulation easy Pros Cons analysis/re-synthesis easy analysis/re-synthesis difficult invasive measurements less expressive co-articulation difficult

Contributions A pseudo-physical model for singing voice synthesis which is an approximate physical model. can generate high-quality non-nasal singing voice. has analysis/re-synthesis ability. is computationally affordable. provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control

Outline Human voice production system Synthesis model Analysis procedure Vocal texture parametric model Vocal texture control demo Contributions and future directions

The human voice production system Nasal cavity Oral cavity Pharyngeal cavity Tongue hump Velum Vocal folds Muscle force Oral sound output Nasal sound output Lungs

Oscillation pattern of the vocal folds Open phase Close phase Opening period Closing period The oscillation results from the balancing of the subglottal pressure, the Bernoulli pressure and the elastic restoring force of the vocal folds. Prephonatory position : the initial configuration of the vocal folds before the beginning of oscillation.

Variation of vocal textures PressedNormalBreathy

Simplified human voice production model Glottal Source Vocal Tract Filter Radiation Aspiration noise Source-tract interaction: The glottal waveform in general depends on the vocal tract configuration. Neglect the source-tract interaction since the glottal impedance is very high most of the time.

Source-filter type synthesis model Glottal Source Vocal Tract Filter Radiation Aspiration noise Filter Derivative Glottal Wave Aspiration noise Vocal Tract Filter Glottal excitation Voice output

Overview of the proposed synthesis model Filter High-passed aspiration noise All Pole Filter Glottal excitation Voice output Derivative glottal wave Noise Residual Model Transformed Liljencrants-Fant Model

Transformed Liljencrants-Fant (LF) model The transformed LF model controls the wave shape of the derivative glottal wave via a single parameter, R d ( wave-shape control parameter).

Transformed Liljencrants-Fant (LF) model Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily. Synthesis: Mapping Direct synthesis timing parameters LF model Derivative glottal wave RdRd Analysis: Estimated derivative glottal wave LF fitting Direct synthesis timing parameters Mapping -1 RdRd Wave shape control parameter

Liljencrants-Fant (LF) model

Transformed Liljencrants-Fant (LF) model Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily. Synthesis: Mapping Direct synthesis timing parameters LF model Derivative glottal wave RdRd Analysis: Estimated derivative glottal wave LF fitting Direct synthesis timing parameters Mapping -1 RdRd Wave shape control parameter

Noise residual model Gaussian Noise Generator Amplitude Modulation Noise residual AnAn GCIL Noise floorBnBn +

Vocal tract filter An all-pole filter. The vocal tract is assumed to be a series of concatenated uniform lossless cylindrical acoustic tubes. Assume that sound waves obey planar propagation along the axis of the vocal tract.  A lip A1A1 ANAN A2A2 lip end glottis      UgUg U lip 1-k N -k N

Vocal tract filter Kelly-Lochbaum junction : -k m kmkm 1-k m 1+k m AmAm A m+1 Scattering coefficient If sampling period T = 2 , the transfer function of the vocal tract acoustic tubes can be shown to be an N th order all-pole filter. The autoregressive coefficients of the vocal tract filter can be converted to scattering coefficients by Durbin’s method. UmUm UmUm + - U m+1 + -  : the propagation time for sound wave to travel one acoustic tube. N : the number of acoustic tubes excluding the glottis and the lip end.

Overall synthesis model implementation    Transformed LF model Output voice Noise residual model Vocal texture model Degree of breathiness Glottal excitation strength E e E e, F 0 RdRd + (No noise input) 0.8 Fundamental frequency F 0

Analysis procedure Source-filter de-convolution Fitting the estimated derivative glottal wave via LF model Inverse filtered glottal excitation LF model coefficients Desired voice recording De-noising by Wavelet Packet Analysis High-passed aspiration noise

Source-filter de-convolution Synthesis model for analysis N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Low-pass filter Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) KLGLOTT88 (KL) derivative glottal wave

Source-filter de-convolution Synthesis model for analysis Low-pass filter Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) KLGLOTT88 (KL) derivative glottal wave N+1 order all pole filter Basic Voicing Waveform (a, b, OQ)

Source-filter deconvolution estimation flowchart Voice signal after removing the low frequency drift One glottal period signal Loop over different OQ values: Vocal tract filter and glottal source estimation via SUMT End Select and store 5 best estimates Loop for each period: Enforce continuity constraints via Dynamic Programming End Smoothing the vocal tract area by time averaging and linear interpolation Estimated model parameter sequence Loop for each period GCI detection Phase I Phase II

Convex optimization formulation Estimateby minimizing the error between the basic voicing waveform and the estimated one. N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Inverse filter

Convex optimization formulation A convex optimization problem Minimize Subject to Error for one glottal cycle in vector form, L 2 norm is used The above problem can be solved by SUMT (sequential unconstrained minimization technique).

De-convolution result (synthetic data)

Effective analysis/re-synthesis Normal phonation originalKLGLOTT88 Pressed phonation originalKLGLOTT88 Baritone examples: Low-pass filter Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) KLGLOTT88 (KL) derivative glottal wave

De-noising by Wavelet Packet Analysis A noisy data record: X = f + W De-noising by best basis thresholding : Transform the noisy data to another basis via Wavelet Packet Analysis : X B = f B + W B Thresholding out the smaller coefficients of X B by assuming that f can be compactly represented in the new basis by a few large coefficients. Select the wavelet filter by energy compactness criteria: 1/(number of coefficients needed to accumulate 0.9 of the total energy).

De-noising result (synthetic data)

Effective analysis/re-synthesis Normal phonation original Pressed phonation original Baritone examples: LF

Vocal texture control The parametric vocal texture control model determines the parameterizations of the glottal excitation to achieve the desired vocal texture. Reduce the control complexity by exploring the correlations between the model parameters. Transformed LF model Noise residual model Desired vocal texture Glottal excitation strength E e ? ? Non-breathy mode breathy mode RdRd RdRd Wave shape control parameter

Pressed and normal modes Wave-shape control parameter R d and normalized glottal excitation strength E e are highly correlated. Vocal texture control (non-breathy mode)

Degree of pressness (a press b press c press ) (a normal b normal c normal ) interpolation (a, b, c) RdRd Transformed LF model Glottal excitation Glottal excitation strength E e Wave shape control parameter

NHR is an indicator for the degree of breathiness. The contour of the noise strength is adjusted by NHR. Vocal texture control (breathy mode) Desired vocal texture EeEe NHR Transformed LF model RdRd window lag duty cycle B n =1 gain + Glottal excitation NHR per glottal cycle  High-passed noise energy Glottal excitation strength E e A n = 2.4138* B n + 0.213 Noise residual model

Overall synthesis model implementation    Transformed LF model Output voice Noise residual model Vocal texture model Degree of breathiness Glottal excitation strength E e E e, F 0 RdRd + 0.8 Fundamental frequency F 0 Glottal excitation

Vocal texture control demo

Contributions A pseudo-physical model for singing voice synthesis which is an approximate physical model. can generate high-quality non-nasal singing voice. has analysis/re-synthesis ability. is computationally affordable. provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control

Future research Build a complete score-to-singing system using the proposed synthesis model. Its associated analysis procedure will be used to construct the parametric database. Investigate potential usage of the source-filter deconvolution algorithm to low-bit rate high quality speech coding. Explore the application of the analysis procedure on sound transformation of vocal textures.

Thank you !

Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,

Similar presentations

Presentation on theme: "Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,

Similar presentations

Presentation on theme: "Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,"— Presentation transcript:

Similar presentations

About project

Feedback