Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,

Slides:

Advertisements

Similar presentations

Acoustic/Prosodic Features

Advertisements

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.

Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.

Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.

Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.

ACOUSTICAL THEORY OF SPEECH PRODUCTION

Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.

Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Speech Group INRIA Lorraine

Eva Björkner Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing HUT, Helsinki, Finland KTH – Royal Institute of Technology.

Complete Discrete Time Model Complete model covers periodic, noise and impulsive inputs. For periodic input 1) R(z): Radiation impedance. It has been shown.

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

Hossein Sameti Department of Computer Engineering Sharif University of Technology.

Extensions of wavelets

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

It was assumed that the pressureat the lips is zero and the volume velocity source is ideal  no energy loss at the input and output. For radiation impedance:

6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December

Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec

1 Interspeech Synthesis of Singing Challenge, Aug 28, 2007 Formant-based Synthesis of Singing Sten Ternström and Johan Sundberg KTH Music Acoustics, Speech.

Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.

Undecimated wavelet transform (Stationary Wavelet Transform)

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

Wavelet Transform A very brief look.

Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.

Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.

Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,

A PRESENTATION BY SHAMALEE DESHPANDE

EE513 Audio Signals and Systems Wiener Inverse Filter Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Source/Filter Theory and Vowels February 4, 2010.

Multimedia Specification Design and Production 2013 / Semester 2 / week 3 Lecturer: Dr. Nikos Gazepidis

Multimodal Interaction Dr. Mike Spann

Digital Systems: Hardware Organization and Design

Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.

Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.

ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.

Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.

Eva Björkner Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing HUT, Helsinki, Finland KTH – Royal Institute of Technology.

1 Linear Prediction. 2 Linear Prediction (Introduction) : The object of linear prediction is to estimate the output sequence from a linear combination.

1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:

1 Linear Prediction. Outline Windowing LPC Introduction to Vocoders Excitation modeling  Pitch Detection.

♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.

DR.D.Y.PATIL POLYTECHNIC, AMBI COMPUTER DEPARTMENT TOPIC : VOICE MORPHING.

Structure of Spoken Language

Speech Science VI Resonances WS Resonances Reading: Borden, Harris & Raphael, p Kentp Pompino-Marschallp Reetzp

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

CCN COMPLEX COMPUTING NETWORKS1 This research has been supported in part by European Commission FP6 IYTE-Wireless Project (Contract No: )

Vocal Tract & Lip Shape Estimation By MS Shah & Vikash Sethia Supervisor: Prof. PC Pandey EE Dept, IIT Bombay AIM-2003, EE Dept, IIT Bombay, 27 th June,

More On Linear Predictive Analysis

1. SPEECH PRODUCTION MUSIC 318 MINI-COURSE ON SPEECH AND SINGING

SPEECH CODING Maryam Zebarjad Alessandro Chiumento Supervisor : Sylwester Szczpaniak.

Autoregressive (AR) Spectral Estimation

APPLICATION OF A WAVELET-BASED RECEIVER FOR THE COHERENT DETECTION OF FSK SIGNALS Dr. Robert Barsanti, Charles Lehman SSST March 2008, University of New.

Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)

Speech Generation and Perception

1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.

Adv DSP Spring-2015 Lecture#11 Spectrum Estimation Parametric Methods.

AN ANALOG INTEGRATED- CIRCUIT VOCAL TRACT PRESENTED BY: NIEL V JOSEPH S7 AEI ROLL NO-46 GUIDED BY: MR.SANTHOSHKUMAR.S ASST.PROFESSOR E&C DEPARTMENT.

PERFORMANCE OF A WAVELET-BASED RECEIVER FOR BPSK AND QPSK SIGNALS IN ADDITIVE WHITE GAUSSIAN NOISE CHANNELS Dr. Robert Barsanti, Timothy Smith, Robert.

Linear Prediction.

1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.

Speech Generation and Perception

1. SPEECH PRODUCTION MUSIC 318 MINI-COURSE ON SPEECH AND SINGING

Richard M. Stern demo January 12, 2009

Linear Prediction.

Speech Generation and Perception

Speech Processing Final Project

Presentation transcript:

Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University, Stanford, CA94305, USA

Score-to-Singing system Score Lyrics Singing style Rule system Sound synthesis Singing voice Phoneme Lyrics-to-phoneme Musical rules Parametric Database F0 Sound level Duration Vibrato Acoustic rendering Co-articulation rules

General sound synthesis approaches Physical Modelin g Physical Modelin g Spectral Modelin g Spectral Modelin g Source- filter Model Source- filter Model flexible/intuitive control expressive co-articulation easy Pros Cons analysis/re-synthesis easy analysis/re-synthesis difficult invasive measurements less expressive co-articulation difficult

Contributions A pseudo-physical model for singing voice synthesis which is an approximate physical model. can generate high-quality non-nasal singing voice. has analysis/re-synthesis ability. is computationally affordable. provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control

Outline Human voice production system Synthesis model Analysis procedure Vocal texture parametric model Vocal texture control demo Contributions and future directions

The human voice production system Nasal cavity Oral cavity Pharyngeal cavity Tongue hump Velum Vocal folds Muscle force Oral sound output Nasal sound output Lungs

Oscillation pattern of the vocal folds Open phase Close phase Opening period Closing period The oscillation results from the balancing of the subglottal pressure, the Bernoulli pressure and the elastic restoring force of the vocal folds. Prephonatory position : the initial configuration of the vocal folds before the beginning of oscillation.

Variation of vocal textures PressedNormalBreathy

Simplified human voice production model Glottal Source Vocal Tract Filter Radiation Aspiration noise Source-tract interaction: The glottal waveform in general depends on the vocal tract configuration. Neglect the source-tract interaction since the glottal impedance is very high most of the time.

Source-filter type synthesis model Glottal Source Vocal Tract Filter Radiation Aspiration noise Filter Derivative Glottal Wave Aspiration noise Vocal Tract Filter Glottal excitation Voice output

Overview of the proposed synthesis model Filter High-passed aspiration noise All Pole Filter Glottal excitation Voice output Derivative glottal wave Noise Residual Model Transformed Liljencrants-Fant Model

Transformed Liljencrants-Fant (LF) model The transformed LF model controls the wave shape of the derivative glottal wave via a single parameter, R d ( wave-shape control parameter).

Transformed Liljencrants-Fant (LF) model Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily. Synthesis: Mapping Direct synthesis timing parameters LF model Derivative glottal wave RdRd Analysis: Estimated derivative glottal wave LF fitting Direct synthesis timing parameters Mapping -1 RdRd Wave shape control parameter

Liljencrants-Fant (LF) model

Transformed Liljencrants-Fant (LF) model Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily. Synthesis: Mapping Direct synthesis timing parameters LF model Derivative glottal wave RdRd Analysis: Estimated derivative glottal wave LF fitting Direct synthesis timing parameters Mapping -1 RdRd Wave shape control parameter

Noise residual model Gaussian Noise Generator Amplitude Modulation Noise residual AnAn GCIL Noise floorBnBn +

Vocal tract filter An all-pole filter. The vocal tract is assumed to be a series of concatenated uniform lossless cylindrical acoustic tubes. Assume that sound waves obey planar propagation along the axis of the vocal tract.  A lip A1A1 ANAN A2A2 lip end glottis      UgUg U lip 1-k N -k N

Vocal tract filter Kelly-Lochbaum junction : -k m kmkm 1-k m 1+k m AmAm A m+1 Scattering coefficient If sampling period T = 2 , the transfer function of the vocal tract acoustic tubes can be shown to be an N th order all-pole filter. The autoregressive coefficients of the vocal tract filter can be converted to scattering coefficients by Durbin’s method. UmUm UmUm + - U m  : the propagation time for sound wave to travel one acoustic tube. N : the number of acoustic tubes excluding the glottis and the lip end.

Overall synthesis model implementation    Transformed LF model Output voice Noise residual model Vocal texture model Degree of breathiness Glottal excitation strength E e E e, F 0 RdRd + (No noise input) 0.8 Fundamental frequency F 0

Analysis procedure Source-filter de-convolution Fitting the estimated derivative glottal wave via LF model Inverse filtered glottal excitation LF model coefficients Desired voice recording De-noising by Wavelet Packet Analysis High-passed aspiration noise

Source-filter de-convolution Synthesis model for analysis N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Low-pass filter Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) KLGLOTT88 (KL) derivative glottal wave

Source-filter de-convolution Synthesis model for analysis Low-pass filter Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) KLGLOTT88 (KL) derivative glottal wave N+1 order all pole filter Basic Voicing Waveform (a, b, OQ)

Source-filter deconvolution estimation flowchart Voice signal after removing the low frequency drift One glottal period signal Loop over different OQ values: Vocal tract filter and glottal source estimation via SUMT End Select and store 5 best estimates Loop for each period: Enforce continuity constraints via Dynamic Programming End Smoothing the vocal tract area by time averaging and linear interpolation Estimated model parameter sequence Loop for each period GCI detection Phase I Phase II

Convex optimization formulation Estimateby minimizing the error between the basic voicing waveform and the estimated one. N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Inverse filter

Convex optimization formulation A convex optimization problem Minimize Subject to Error for one glottal cycle in vector form, L 2 norm is used The above problem can be solved by SUMT (sequential unconstrained minimization technique).

De-convolution result (synthetic data)

Effective analysis/re-synthesis Normal phonation originalKLGLOTT88 Pressed phonation originalKLGLOTT88 Baritone examples: Low-pass filter Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) KLGLOTT88 (KL) derivative glottal wave

Analysis procedure Source-filter de-convolution Fitting the estimated derivative glottal wave via LF model Inverse filtered glottal excitation LF model coefficients Desired voice recording De-noising by Wavelet Packet Analysis High-passed aspiration noise

De-noising by Wavelet Packet Analysis A noisy data record: X = f + W De-noising by best basis thresholding : Transform the noisy data to another basis via Wavelet Packet Analysis : X B = f B + W B Thresholding out the smaller coefficients of X B by assuming that f can be compactly represented in the new basis by a few large coefficients. Select the wavelet filter by energy compactness criteria: 1/(number of coefficients needed to accumulate 0.9 of the total energy).

De-noising result (synthetic data)

Analysis procedure Source-filter de-convolution Fitting the estimated derivative glottal wave via LF model Inverse filtered glottal excitation LF model coefficients Desired voice recording De-noising by Wavelet Packet Analysis High-passed aspiration noise

Effective analysis/re-synthesis Normal phonation original Pressed phonation original Baritone examples: LF

Vocal texture control The parametric vocal texture control model determines the parameterizations of the glottal excitation to achieve the desired vocal texture. Reduce the control complexity by exploring the correlations between the model parameters. Transformed LF model Noise residual model Desired vocal texture Glottal excitation strength E e ? ? Non-breathy mode breathy mode RdRd RdRd Wave shape control parameter

Pressed and normal modes Wave-shape control parameter R d and normalized glottal excitation strength E e are highly correlated. Vocal texture control (non-breathy mode)

Degree of pressness (a press b press c press ) (a normal b normal c normal ) interpolation (a, b, c) RdRd Transformed LF model Glottal excitation Glottal excitation strength E e Wave shape control parameter

NHR is an indicator for the degree of breathiness. The contour of the noise strength is adjusted by NHR. Vocal texture control (breathy mode) Desired vocal texture EeEe NHR Transformed LF model RdRd window lag duty cycle B n =1 gain + Glottal excitation NHR per glottal cycle  High-passed noise energy Glottal excitation strength E e A n = * B n Noise residual model

Overall synthesis model implementation    Transformed LF model Output voice Noise residual model Vocal texture model Degree of breathiness Glottal excitation strength E e E e, F 0 RdRd Fundamental frequency F 0 Glottal excitation

Vocal texture control demo

Contributions A pseudo-physical model for singing voice synthesis which is an approximate physical model. can generate high-quality non-nasal singing voice. has analysis/re-synthesis ability. is computationally affordable. provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control

Future research Build a complete score-to-singing system using the proposed synthesis model. Its associated analysis procedure will be used to construct the parametric database. Investigate potential usage of the source-filter deconvolution algorithm to low-bit rate high quality speech coding. Explore the application of the analysis procedure on sound transformation of vocal textures.

Thank you !