National Cheng Kung University, Tainan, TAIWAN

National Cheng Kung University, Tainan, TAIWAN
Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis ***Problems using technology for system goal*** 老師、學長姊、同學、學弟妹，大家好我今天要報告的碩士論文題目為: xxxxxxx 我是今天的報告者: 純珊 Student: Ju-Yun Cheng Advisor: Prof. Chung-Hsien Wu Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, TAIWAN

Outline Introduction Singing voice synthesis system Evaluation
Background Motivation Related work Singing voice synthesis system Evaluation Discussion Conclusion Future work Outline部分會介紹 1.製作論文的背景 2.論文的動機、訴求 3.相關研究與參考 4.要解決的問題以及提出的方法 5.提出方法步驟 6.實驗討論 7.結果

Introduction - Background
Speech and singing are both important ways to communicate and present emotion Speech synthesizer can generate fluency and natural speech well, even with personal characteristics Singing voice synthesis has been one of the emerging and popular research topics recently enables computers to sing any songs without the need of the actual singing of human

There are two main methods in the corpus-based singing synthesis area sample-based approach: unit-selection appropriate sub-word units are selected from large speech databases Pros: high-quality speech at the waveform level Cons: require huge amount of recorded data, discontinuous, unstable quality, fixed voice characteristics lyrics Note Score editor Synthesis score Sample selection concatenation Synthesis output Singer Library

sample-based approach: unit-selection chosen from singing voice corpus with the lyrics of the song and corresponding MIDI file [Zhou, 2008] Vocaloid a singing synthesizer developed by Yamaha Corporation, initial released in January 2004 Pitch conversion and timbre manipulation to smoothing concatenate samples Vocal +loid [Zhou, 2008] :

There are two main methods in the corpus-based singing synthesis area statistical approach : HMM-based Parameters model with context-dependent HMMs and waveforms are generated from the HMMs. Pros: relatively little training data, smooth and stable quality, flexibility to control voice characteristics Cons: vocoder sound, over-smoothing Singing waveform labels labels parameter extraction Acoustic model training parameter generation Waveform generation Synthesis output Acoustic model parameters Singing parameters

statistical approach : HMM-based Sinsy A free on-line singing voice synthesis service which provide Japanese and English version Users can obtain synthesized singing voices by uploading musical scores represented in MusicXML

Another method for singing voice synthesis system HNM (Harmonic plus Noise Model) HNM parameters of a source syllable are used to synthesize singing syllables of diverse pitches and durations [Gu, 2008] Speech-to-singing Synthesize singing voice by parameters control model from lyrics of a song and its musical score [Akagi, 2007] lyrics are converted into speech by TTS, then melody control model convert speech signal into singing voice by modifying the acoustic parameters [Cai, 2011] [Gu, 2008] : Mandarin Singing-voice Synthesis Using an HNM Based Scheme [Cai, 2011] : A Lyrics to Singing Voice Synthesis System with Variable Timbre*

Introduction - Motivation
In order to synthesize smooth and continuous singing voice, we chose HMM-based method to build our singing voice synthesis system HMM can model temporal sequence of singing voice parameter generation from an HMM composed by concatenation of phoneme HMMs HMM state sequence State duration Spectral and lf0 parameters

Introduction - Improvement in Sinsy
These are a series of papers written by the producer of Sinsy’s team [An HMM-based Singing Voice Synthesis System,2006] The first paper about HMM-based singing voice synthesis system [HMM-based Singing Voice Synthesis System using Pitch-shifted Pseudo Training Data,2010] To increase the amount of F0 training data, pitch-shifted pseudo data can be prepared by shifting F0 up or down in halftone [Recent Development of the HMM-based Singing Voice Synthesis System – Sinsy ,2010] Introduce the free on-line singing voice synthesis service [Pitch Adaptive Training For HMM-based Singing Voice Synthesis ,2012] model-level normalization of pitch

Singing voice synthesis system - features extraction
STRAIGHT [H. Kawahara 1997] A high-quality analysis synthesis method and offers high flexibility in parameter manipulation with no further degradation extract parameters with relatively good performance in not professional recording environment Features: Pitch, Smoothed Spectrum, Aperiodic factors Fixed-point analysis F0 extraction Analysis waveform F0 Smoothed spectrum Aperiodic factors Mixed excitation with phase manipulation Synthesis Synthetic waveform

Singing voice synthesis system - Proposed method for Mandarin singing
Speech vs. Singing Pitch contour Database, Model definition, question set

Speech vs. Singing Music Score pitch: duration: key: tempo: beat:

Japanese Syllabary – hiragana
Singing voice synthesis system - Proposed method for Mandarin singing Different from Sinsy Language: from Japanese to Mandarin Database, model definition, question sets Refinement Japanese Syllabary – hiragana Japanese syllables are basically from "consonant + vowel" only five vowel Bopomofo Existing 37 (initials 21, finals 16)

Acoustic parameters Model Question sets linguistic info note info cue info Singing Database Different from Sinsy Different from TTS Only for Mandarin Specially for singing

Singing voice synthesis system - system structure
Training phase Singing voice database Excitation parameter extraction Spectral parameter extraction Aperiod parameter extraction Context-dependent HMMs & duration models CART-based state tying label Question set Training of HMM Synthesis phase Musical Score State selection by CART conversion label Excitation generation Synthesis filter Parameter generation from HMM Spectral generation Synthesized Singing Voice Aperiod generation

Singing Voice Database Construction Building a singing voice database for training and synthesis MHMC Singing Voice Database Mandarin singing Model definition Initial and final modification Medial modification Long duration models Question sets definition of decision trees Modification for Mandarin Refinements Pitch coverage by pitch-shift pseudo data Vibrato

Segmentation by phoneme
Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Singing corpus design process Music Score Corpus Songs selection Singing database Selected Scores Selected Scores Phonetic transcription Segmentation by phoneme Singing signal

Singing voice synthesis system - singing voice database construction
Songs selection Selecting scores Music book and internet version Choosing criteria and specialization Simple and no need many skills Phone coverage Digitizing data format: MusicXML Transposition to appropriate pitch range

Singing voice synthesis system - Model definition
MusicXML file Sheet Music score MusicXML format Key in Convert MusicXML is an XML-based file format for representing Western musical notation. The format is proprietary, but fully and openly documented.

Singing voice synthesis system - singing voice database construction
Singer selection and data processing Finding candidates to record demo 4 candidates Choosing singer the accuracy of pitch timbre Checking recorded data noise is not allowed exceed recording criterion Segmentation and normalization Phoneme Let the energy of singing voice data smaller avoid singing voice becomes loud suddenly Pitch scale is too large leading to bad synthesize

Singing voice synthesis system - singing voice database
NCKU Singing Voice Database We choose the 74 songs depends on the lyrics which can cover all mandarin phonemes Songs Nursery rhyme / children’s song Total 148 songs Singer One female Pitch range C4~B4 version 1, 2 Total time About 102 minutes Sample rate 48 kHz Resolution 16 bits Channels Mono File name data 小蜜蜂兩隻老虎火車快飛

text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label

Initial and final processing tone instead of the original tone of words, the main pitch of note is more significant e.g. 不 speech->bu wuH wuL sing->bu wu Vowel We define the phonemes by phonology The medial with the rime rather than the initial When yi(ㄧ) 、 wu(ㄨ)、yu(ㄩ) is medial, than medial and rime are collectively known as one kind of final. 介音 speech singing

Initial and final processing Single initial A syllable only has initial without finals followed with an empty rime “帀“ to pronounce 捲舌音: ㄓㄔㄕㄖ+ zr 平舌音: ㄗㄘㄙ+ sr Total phonemes are 59 (speech: 66) initial ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑ b p m f d t n l g k h j ch ㄒㄓㄔㄕㄖ帀1 ㄗㄘㄙ帀2 sh jr chr shr r zr tz tsz sz sr final 一ㄨㄩㄚㄛㄜㄞㄟㄠㄡㄢㄣㄤㄥㄦㄝ yi wu yu a o e ai ei au ou an en ang ng er eh final with medial 一ㄚ一ㄝ一ㄠ一ㄡ一ㄢ一ㄣㄧㄤ一ㄥㄨㄚㄨㄛㄨㄞㄨㄟㄨㄢ ia ieh iau Iou ian ien iang ing ua uo uai uei uan ㄨㄣㄨㄤㄨㄥㄩㄝㄩㄢㄩㄣㄩㄥ uen uang ung iueh iuan iuen iung

Singing voice synthesis system - singing voice database
phonetic coverage initial final final contains medial phone 59 Total phones 15300 total words 8448 song 148

Long duration model To express well in singing, long duration note is important. shorter notes will soon be over with no special effects. Long tone is different, it provide a larger space to express. Lengthen the short duration note cannot present long duration note completely half or whole note -> Final + “L” 一起飛飛就飛叫就叫

Riffs and runs processing A syllable corresponding to multiple notes Repeat the last tonal Pause processing In order to present the breathing pause or segmented pause when human singing The singer suspend more than a threshold (> 0.3 seconds) a rest

Linguistic information phoneme current phoneme, { preceding, succeeding } two phonemes syllable # of phonemes at {preceding, current, succeeding} syllable Phrase # of phonemes/syllables at {preceding, current, succeeding} phrase song # of average phonemes/syllables in measure in this song # of phrases in this song Riffs and Run

Singing is the act of producing musical sounds with the voice, and augments regular speech by the use of both tonality and rhythm Note pitch Pitches are compared as "higher" and "lower" in the sense associated with musical melodies Note duration An amount of time or a particular time interval. It is the length of a note and one of the bases of rhythm. Songs structure what kind of an overall musical form or structure the song adopts the order of a music score

User-defined phrase units phrasing may be necessary for the singer to take catch breaths or to achieve a certain style. definition in relation to music is ”a short passage or segment, often consisting of four measures or forming part of a smaller/larger unit” We defined the unit of phrase depend on the song structure. used in outside label to present breathing pause 4 measures / phrase 2 measures / phrase

Note Calculation the basic information is not enough to present one note completely Relative pitch means difference between the key note and the current note Key note depends on numbers of sharps or flats Note position different note positions in the measure or phrase may have different expression due to breathing unit: note, 0.1 second, thirty-second note, % Note length 0.1 second(absolute pitch), thirty-second note(relative length)

Note information Note Pitch Absolute pitch (C0-G9), relative pitch(0-11), the difference pitch between previous & current / current & next Note Duration Length of note by syllable, thirty-second note, 0.1 second Song Structure Beat: 2/4, 3/4, 4/4 Tempo: 90, 100, 120 key Position Count by note, 0.1 second, thirty-second note, percentage in the measure/phrase Number of phrases

Singing voice synthesis system - Question sets definition
Question sets definition for singing model clustering (1) Phoneme (current and { preceding, succeeding } two phonemes) Final With or without medial Initial Initials pronunciation category Finals pronunciation category (2) Note Pitch Tempo Beat Duration Position (3) phrase # of phonemes/syllables preceding, current, succeeding phrase (4) song # of phonemes/syllables # of phrases

Singing voice synthesis system - Refinement
Pitch-shift pseudo data Pitch coverage using the nearby notes from other songs and shift to corresponding Hertz

Singing voice synthesis system - Refinement

Evaluation Experimental Conditions Database condition
Number of songs 148 number of phonemes 15300 number of words 8443 Number of notes 9054 Total of time About 100 minutes Database condition Frame shift 5ms Window Length 25ms Window function Blackman window MGC order 49 dim MFCCs Sampling rate 48kHz Mel-cepstral analysis condition

Evaluation Experiments settings Baseline
RQ : Reduced Question sets duplicate questions, indirect questions, relative questions PS : Pitch-shift pseudo data VP : Vibrato post-processing

Evaluation - Subjective evaluation
Pitch contour Synthesized (baseline) vs. Music score Synthesized (baseline) vs. Original singing

Evaluation - Subjective evaluation
Mean Opinion Scores(MOS) 10 synthesize songs 12 subjects Quality and Intelligibility evaluation ABX test A subject is presented with two known samples (A, the reference, and B, the alternative. X is randomly selected from A and B, and the subject identifies X as being either A or B) Quality MOS Excellent 5 Good 4 Fair 3 Poor 2 Bad 1

Evaluation - Subjective
Quality evaluation Intelligibility evaluation mean variance baseline 2.76 0.173 RQ 2.49 0.132 RQ+PS 3.04 0.141 mean variance baseline 2.79 0.166 RQ 2.75 0.187 RQ+PS 3.11 0.008

Demo Outside Test baseline baseline+QR baseline+QR+PS 娃娃哭了叫媽媽推你摔下
你又站起來

Evaluation - Subjective
The score of quality and intelligibility is lower than baseline The question set we reduced including the important information to classify Too few question 5364->1257 Find out the better version of reduced question sets

Preference test Natural- Testing vibrato
different pitch and situation corresponding to different settings Vibrato is not essential in children’ songs original vibrato

Discussion Singing corpus quality Too blurred
Recording in professional environment Singer’s timbre Context factor coverage Too blurred Not enough training corpus modeled with priority of singing characteristics

Conclusion A Mandarin corpus-based singing voice synthesis system based on hidden Markov models (HMMs) was implemented We defined the Mandarin model definition for singing and the question sets for model clustering. We use three methods to refine our system, i.e. question set reduction, pitch-shift pseudo data and vibrato post-processing.

Demo Inside Test Outside test original Our system original Our system
火車快飛小星星妹妹揹著洋娃娃三輪車康定情歌蝴蝶

Thanks for listening & comments

National Cheng Kung University, Tainan, TAIWAN

Similar presentations

Presentation on theme: "National Cheng Kung University, Tainan, TAIWAN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

National Cheng Kung University, Tainan, TAIWAN

Similar presentations

Presentation on theme: "National Cheng Kung University, Tainan, TAIWAN"— Presentation transcript:

Similar presentations

About project

Feedback