Download presentation
Presentation is loading. Please wait.
1
National Cheng Kung University, Tainan, TAIWAN
Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis ***Problems using technology for system goal*** 老師、學長姊、同學、學弟妹,大家好 我今天要報告的碩士論文題目為: xxxxxxx 我是今天的報告者: 純珊 Student: Ju-Yun Cheng Advisor: Prof. Chung-Hsien Wu Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, TAIWAN
2
Outline Introduction Singing voice synthesis system Evaluation
Background Motivation Related work Singing voice synthesis system Evaluation Discussion Conclusion Future work Outline部分會介紹 1.製作論文的背景 2.論文的動機、訴求 3.相關研究與參考 4.要解決的問題以及提出的方法 5.提出方法步驟 6.實驗討論 7.結果
3
Introduction - Background
Speech and singing are both important ways to communicate and present emotion Speech synthesizer can generate fluency and natural speech well, even with personal characteristics Singing voice synthesis has been one of the emerging and popular research topics recently enables computers to sing any songs without the need of the actual singing of human
4
Introduction - Background
There are two main methods in the corpus-based singing synthesis area sample-based approach: unit-selection appropriate sub-word units are selected from large speech databases Pros: high-quality speech at the waveform level Cons: require huge amount of recorded data, discontinuous, unstable quality, fixed voice characteristics lyrics Note Score editor Synthesis score Sample selection concatenation Synthesis output Singer Library
5
Introduction - Background
sample-based approach: unit-selection chosen from singing voice corpus with the lyrics of the song and corresponding MIDI file [Zhou, 2008] Vocaloid a singing synthesizer developed by Yamaha Corporation, initial released in January 2004 Pitch conversion and timbre manipulation to smoothing concatenate samples Vocal +loid [Zhou, 2008] :
6
Introduction - Background
There are two main methods in the corpus-based singing synthesis area statistical approach : HMM-based Parameters model with context-dependent HMMs and waveforms are generated from the HMMs. Pros: relatively little training data, smooth and stable quality, flexibility to control voice characteristics Cons: vocoder sound, over-smoothing Singing waveform labels labels parameter extraction Acoustic model training parameter generation Waveform generation Synthesis output Acoustic model parameters Singing parameters
7
Introduction - Background
statistical approach : HMM-based Sinsy A free on-line singing voice synthesis service which provide Japanese and English version Users can obtain synthesized singing voices by uploading musical scores represented in MusicXML
8
Introduction - Background
Another method for singing voice synthesis system HNM (Harmonic plus Noise Model) HNM parameters of a source syllable are used to synthesize singing syllables of diverse pitches and durations [Gu, 2008] Speech-to-singing Synthesize singing voice by parameters control model from lyrics of a song and its musical score [Akagi, 2007] lyrics are converted into speech by TTS, then melody control model convert speech signal into singing voice by modifying the acoustic parameters [Cai, 2011] [Gu, 2008] : Mandarin Singing-voice Synthesis Using an HNM Based Scheme [Cai, 2011] : A Lyrics to Singing Voice Synthesis System with Variable Timbre*
9
Introduction - Motivation
In order to synthesize smooth and continuous singing voice, we chose HMM-based method to build our singing voice synthesis system HMM can model temporal sequence of singing voice parameter generation from an HMM composed by concatenation of phoneme HMMs HMM state sequence State duration Spectral and lf0 parameters
10
Introduction - Improvement in Sinsy
These are a series of papers written by the producer of Sinsy’s team [An HMM-based Singing Voice Synthesis System,2006] The first paper about HMM-based singing voice synthesis system [HMM-based Singing Voice Synthesis System using Pitch-shifted Pseudo Training Data,2010] To increase the amount of F0 training data, pitch-shifted pseudo data can be prepared by shifting F0 up or down in halftone [Recent Development of the HMM-based Singing Voice Synthesis System – Sinsy ,2010] Introduce the free on-line singing voice synthesis service [Pitch Adaptive Training For HMM-based Singing Voice Synthesis ,2012] model-level normalization of pitch
11
Singing voice synthesis system - features extraction
STRAIGHT [H. Kawahara 1997] A high-quality analysis synthesis method and offers high flexibility in parameter manipulation with no further degradation extract parameters with relatively good performance in not professional recording environment Features: Pitch, Smoothed Spectrum, Aperiodic factors Fixed-point analysis F0 extraction Analysis waveform F0 Smoothed spectrum Aperiodic factors Mixed excitation with phase manipulation Synthesis Synthetic waveform
12
Singing voice synthesis system - Proposed method for Mandarin singing
Speech vs. Singing Pitch contour Database, Model definition, question set
13
Singing voice synthesis system - Proposed method for Mandarin singing
Speech vs. Singing Music Score pitch: duration: key: tempo: beat:
14
Japanese Syllabary – hiragana
Singing voice synthesis system - Proposed method for Mandarin singing Different from Sinsy Language: from Japanese to Mandarin Database, model definition, question sets Refinement Japanese Syllabary – hiragana Japanese syllables are basically from "consonant + vowel" only five vowel Bopomofo Existing 37 (initials 21, finals 16)
15
Singing voice synthesis system - Proposed method for Mandarin singing
Acoustic parameters Model Question sets linguistic info note info cue info Singing Database Different from Sinsy Different from TTS Only for Mandarin Specially for singing
16
Singing voice synthesis system - system structure
Training phase Singing voice database Excitation parameter extraction Spectral parameter extraction Aperiod parameter extraction Context-dependent HMMs & duration models CART-based state tying label Question set Training of HMM Synthesis phase Musical Score State selection by CART conversion label Excitation generation Synthesis filter Parameter generation from HMM Spectral generation Synthesized Singing Voice Aperiod generation
17
Singing voice synthesis system - Proposed method for Mandarin singing
Singing Voice Database Construction Building a singing voice database for training and synthesis MHMC Singing Voice Database Mandarin singing Model definition Initial and final modification Medial modification Long duration models Question sets definition of decision trees Modification for Mandarin Refinements Pitch coverage by pitch-shift pseudo data Vibrato
18
Segmentation by phoneme
Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Singing corpus design process Music Score Corpus Songs selection Singing database Selected Scores Selected Scores Phonetic transcription Segmentation by phoneme Singing signal
19
Singing voice synthesis system - singing voice database construction
Songs selection Selecting scores Music book and internet version Choosing criteria and specialization Simple and no need many skills Phone coverage Digitizing data format: MusicXML Transposition to appropriate pitch range
20
Singing voice synthesis system - Model definition
MusicXML file Sheet Music score MusicXML format Key in Convert MusicXML is an XML-based file format for representing Western musical notation. The format is proprietary, but fully and openly documented.
21
Singing voice synthesis system - singing voice database construction
Singer selection and data processing Finding candidates to record demo 4 candidates Choosing singer the accuracy of pitch timbre Checking recorded data noise is not allowed exceed recording criterion Segmentation and normalization Phoneme Let the energy of singing voice data smaller avoid singing voice becomes loud suddenly Pitch scale is too large leading to bad synthesize
22
Singing voice synthesis system - singing voice database
NCKU Singing Voice Database We choose the 74 songs depends on the lyrics which can cover all mandarin phonemes Songs Nursery rhyme / children’s song Total 148 songs Singer One female Pitch range C4~B4 version 1, 2 Total time About 102 minutes Sample rate 48 kHz Resolution 16 bits Channels Mono File name data 小蜜蜂 兩隻老虎 火車快飛
23
Singing voice synthesis system - Model definition
text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label
24
Singing voice synthesis system - Model definition
text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label
25
Singing voice synthesis system - Model definition
Initial and final processing tone instead of the original tone of words, the main pitch of note is more significant e.g. 不 speech->bu wuH wuL sing->bu wu Vowel We define the phonemes by phonology The medial with the rime rather than the initial When yi(ㄧ) 、 wu(ㄨ)、yu(ㄩ) is medial, than medial and rime are collectively known as one kind of final. 介音 speech singing
26
Singing voice synthesis system - Model definition
Initial and final processing Single initial A syllable only has initial without finals followed with an empty rime “帀“ to pronounce 捲舌音: ㄓㄔㄕㄖ+ zr 平舌音: ㄗㄘㄙ+ sr Total phonemes are 59 (speech: 66) initial ㄅ ㄆ ㄇ ㄈ ㄉ ㄊ ㄋ ㄌ ㄍ ㄎ ㄏ ㄐ ㄑ b p m f d t n l g k h j ch ㄒ ㄓ ㄔ ㄕ ㄖ 帀1 ㄗ ㄘ ㄙ 帀2 sh jr chr shr r zr tz tsz sz sr final 一 ㄨ ㄩ ㄚ ㄛ ㄜ ㄞ ㄟ ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄝ yi wu yu a o e ai ei au ou an en ang ng er eh final with medial 一ㄚ 一ㄝ 一ㄠ 一ㄡ 一ㄢ 一ㄣ ㄧㄤ 一ㄥ ㄨㄚ ㄨㄛ ㄨㄞ ㄨㄟ ㄨㄢ ia ieh iau Iou ian ien iang ing ua uo uai uei uan ㄨㄣ ㄨㄤ ㄨㄥ ㄩㄝ ㄩㄢ ㄩㄣ ㄩㄥ uen uang ung iueh iuan iuen iung
27
Singing voice synthesis system - singing voice database
phonetic coverage initial final final contains medial phone 59 Total phones 15300 total words 8448 song 148
28
Singing voice synthesis system - Model definition
Long duration model To express well in singing, long duration note is important. shorter notes will soon be over with no special effects. Long tone is different, it provide a larger space to express. Lengthen the short duration note cannot present long duration note completely half or whole note -> Final + “L” 一起飛 飛就飛 叫就叫
29
Singing voice synthesis system - Model definition
text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label
30
Singing voice synthesis system - Model definition
Riffs and runs processing A syllable corresponding to multiple notes Repeat the last tonal Pause processing In order to present the breathing pause or segmented pause when human singing The singer suspend more than a threshold (> 0.3 seconds) a rest
31
Singing voice synthesis system - Model definition
Linguistic information phoneme current phoneme, { preceding, succeeding } two phonemes syllable # of phonemes at {preceding, current, succeeding} syllable Phrase # of phonemes/syllables at {preceding, current, succeeding} phrase song # of average phonemes/syllables in measure in this song # of phrases in this song Riffs and Run
32
Singing voice synthesis system - Model definition
Singing is the act of producing musical sounds with the voice, and augments regular speech by the use of both tonality and rhythm Note pitch Pitches are compared as "higher" and "lower" in the sense associated with musical melodies Note duration An amount of time or a particular time interval. It is the length of a note and one of the bases of rhythm. Songs structure what kind of an overall musical form or structure the song adopts the order of a music score
33
Singing voice synthesis system - Model definition
text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label
34
Singing voice synthesis system - Model definition
text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label
35
Singing voice synthesis system - Model definition
User-defined phrase units phrasing may be necessary for the singer to take catch breaths or to achieve a certain style. definition in relation to music is ”a short passage or segment, often consisting of four measures or forming part of a smaller/larger unit” We defined the unit of phrase depend on the song structure. used in outside label to present breathing pause 4 measures / phrase 2 measures / phrase
36
Singing voice synthesis system - Model definition
Note Calculation the basic information is not enough to present one note completely Relative pitch means difference between the key note and the current note Key note depends on numbers of sharps or flats Note position different note positions in the measure or phrase may have different expression due to breathing unit: note, 0.1 second, thirty-second note, % Note length 0.1 second(absolute pitch), thirty-second note(relative length)
37
Singing voice synthesis system - Model definition
text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label
38
Singing voice synthesis system - Model definition
Note information Note Pitch Absolute pitch (C0-G9), relative pitch(0-11), the difference pitch between previous & current / current & next Note Duration Length of note by syllable, thirty-second note, 0.1 second Song Structure Beat: 2/4, 3/4, 4/4 Tempo: 90, 100, 120 key Position Count by note, 0.1 second, thirty-second note, percentage in the measure/phrase Number of phrases
39
Singing voice synthesis system - Question sets definition
Question sets definition for singing model clustering (1) Phoneme (current and { preceding, succeeding } two phonemes) Final With or without medial Initial Initials pronunciation category Finals pronunciation category (2) Note Pitch Tempo Beat Duration Position (3) phrase # of phonemes/syllables preceding, current, succeeding phrase (4) song # of phonemes/syllables # of phrases
40
Singing voice synthesis system - Refinement
Pitch-shift pseudo data Pitch coverage using the nearby notes from other songs and shift to corresponding Hertz
41
Singing voice synthesis system - Refinement
42
Evaluation Experimental Conditions Database condition
Number of songs 148 number of phonemes 15300 number of words 8443 Number of notes 9054 Total of time About 100 minutes Database condition Frame shift 5ms Window Length 25ms Window function Blackman window MGC order 49 dim MFCCs Sampling rate 48kHz Mel-cepstral analysis condition
43
Evaluation Experiments settings Baseline
RQ : Reduced Question sets duplicate questions, indirect questions, relative questions PS : Pitch-shift pseudo data VP : Vibrato post-processing
44
Evaluation - Subjective evaluation
Pitch contour Synthesized (baseline) vs. Music score Synthesized (baseline) vs. Original singing
45
Evaluation - Subjective evaluation
Mean Opinion Scores(MOS) 10 synthesize songs 12 subjects Quality and Intelligibility evaluation ABX test A subject is presented with two known samples (A, the reference, and B, the alternative. X is randomly selected from A and B, and the subject identifies X as being either A or B) Quality MOS Excellent 5 Good 4 Fair 3 Poor 2 Bad 1
46
Evaluation - Subjective
Quality evaluation Intelligibility evaluation mean variance baseline 2.76 0.173 RQ 2.49 0.132 RQ+PS 3.04 0.141 mean variance baseline 2.79 0.166 RQ 2.75 0.187 RQ+PS 3.11 0.008
47
Demo Outside Test baseline baseline+QR baseline+QR+PS 娃娃哭了 叫媽媽 推你摔下
你又站起來
48
Evaluation - Subjective
The score of quality and intelligibility is lower than baseline The question set we reduced including the important information to classify Too few question 5364->1257 Find out the better version of reduced question sets
49
Preference test Natural- Testing vibrato
different pitch and situation corresponding to different settings Vibrato is not essential in children’ songs original vibrato
50
Discussion Singing corpus quality Too blurred
Recording in professional environment Singer’s timbre Context factor coverage Too blurred Not enough training corpus modeled with priority of singing characteristics
51
Conclusion A Mandarin corpus-based singing voice synthesis system based on hidden Markov models (HMMs) was implemented We defined the Mandarin model definition for singing and the question sets for model clustering. We use three methods to refine our system, i.e. question set reduction, pitch-shift pseudo data and vibrato post-processing.
52
Demo Inside Test Outside test original Our system original Our system
火車快飛 小星星 妹妹揹著 洋娃娃 三輪車 康定情歌 蝴蝶
53
Thanks for listening & comments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.