National Cheng Kung University, Tainan, TAIWAN

Slides:



Advertisements
Similar presentations
KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Being a Chinese Tutor Can You Translate Them?. Being a Chinese Tutor.
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
A comparison of rhythms in Jamaican Creole speech and reggae music Project’s long term goals We chose to compare the rhythmic patterns of Jamaican Creole.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
AN INTRODUCTION TO PRAAT Tina John M.A. Institute of Phonetics and digital Speech Processing - University Kiel Institute of Phonetics and Speech Processing.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
A Time Based Approach to Musical Pattern Discovery in Polyphonic Music Tamar Berman Graduate School of Library and Information Science University of Illinois.
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Polyphonic Music Transcription Using A Dynamic Graphical Model Barry Rafkind E6820 Speech and Audio Signal Processing Wednesday, March 9th, 2005.
Jacob Zurasky ECE5526 – Spring 2011
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Similarity Matrix Processing for Music Structure Analysis Yu Shiu, Hong Jeng C.-C. Jay Kuo ACM Multimedia 2006.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Chapter 3 Scales and Melody.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Performance Comparison of Speaker and Emotion Recognition
Content-Based MP3 Information Retrieval Chueh-Chih Liu Department of Accounting Information Systems Chihlee Institute of Technology 2005/06/16.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
G. Anushiya Rachel Project Officer
Mr. Darko Pekar, Speech Morphing Inc.
WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
Music Matching Speaker : 黃茂政 指導教授 : 陳嘉琳 博士.
A Melody Composer for both Tonal and Non-Tonal Languages
Text-To-Speech System for English
CRF &SVM in Medication Extraction
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Speech Analysis TA:Chuan-Hsun Wu
Context-based Data Compression
Neural Speech Synthesis with Transformer Network
Speech Perception (acoustic cues)
Research on the Modeling of Chinese Continuous Speech Recognition
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
2019/5/3 A De-compositional Approach to Regular Expression Matching for Network Security Applications Author: Eric Norige Alex Liu Presenter: Yi-Hsien.
Introduction to Pinyin
Children Should Sing Singing (Performing) is a National Standard
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
An overview of course assessment
Auditory Morphing Weyni Clacken
Presentation transcript:

National Cheng Kung University, Tainan, TAIWAN Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis ***Problems using technology for system goal*** 老師、學長姊、同學、學弟妹,大家好 我今天要報告的碩士論文題目為: xxxxxxx 我是今天的報告者: 純珊 Student: Ju-Yun Cheng Advisor: Prof. Chung-Hsien Wu Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, TAIWAN

Outline Introduction Singing voice synthesis system Evaluation Background Motivation Related work Singing voice synthesis system Evaluation Discussion Conclusion Future work Outline部分會介紹 1.製作論文的背景 2.論文的動機、訴求 3.相關研究與參考 4.要解決的問題以及提出的方法 5.提出方法步驟 6.實驗討論 7.結果

Introduction - Background Speech and singing are both important ways to communicate and present emotion Speech synthesizer can generate fluency and natural speech well, even with personal characteristics Singing voice synthesis has been one of the emerging and popular research topics recently enables computers to sing any songs without the need of the actual singing of human http://zh.wikipedia.org/wiki/%E5%88%9D%E9%9F%B3%E6%9C%AA%E4%BE%86

Introduction - Background There are two main methods in the corpus-based singing synthesis area sample-based approach: unit-selection appropriate sub-word units are selected from large speech databases Pros: high-quality speech at the waveform level Cons: require huge amount of recorded data, discontinuous, unstable quality, fixed voice characteristics lyrics Note Score editor Synthesis score Sample selection concatenation Synthesis output Singer Library

Introduction - Background sample-based approach: unit-selection chosen from singing voice corpus with the lyrics of the song and corresponding MIDI file [Zhou, 2008] Vocaloid a singing synthesizer developed by Yamaha Corporation, initial released in January 2004 Pitch conversion and timbre manipulation to smoothing concatenate samples Vocal +loid [Zhou, 2008] : http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4620864

Introduction - Background There are two main methods in the corpus-based singing synthesis area statistical approach : HMM-based Parameters model with context-dependent HMMs and waveforms are generated from the HMMs. Pros: relatively little training data, smooth and stable quality, flexibility to control voice characteristics Cons: vocoder sound, over-smoothing Singing waveform labels labels parameter extraction Acoustic model training parameter generation Waveform generation Synthesis output Acoustic model parameters Singing parameters

Introduction - Background statistical approach : HMM-based Sinsy A free on-line singing voice synthesis service which provide Japanese and English version Users can obtain synthesized singing voices by uploading musical scores represented in MusicXML

Introduction - Background Another method for singing voice synthesis system HNM (Harmonic plus Noise Model) HNM parameters of a source syllable are used to synthesize singing syllables of diverse pitches and durations [Gu, 2008] Speech-to-singing Synthesize singing voice by parameters control model from lyrics of a song and its musical score [Akagi, 2007] lyrics are converted into speech by TTS, then melody control model convert speech signal into singing voice by modifying the acoustic parameters [Cai, 2011] [Gu, 2008] : Mandarin Singing-voice Synthesis Using an HNM Based Scheme http://guhy.csie.ntust.edu.tw/pap/11_JNL_Mandarin_Singing-voice_Synthesis_Using_an_HNM_Based_Scheme.pdf [Cai, 2011] : A Lyrics to Singing Voice Synthesis System with Variable Timbre* http://link.springer.com/content/pdf/10.1007%2F978-3-642-23220-6_23.pdf

Introduction - Motivation In order to synthesize smooth and continuous singing voice, we chose HMM-based method to build our singing voice synthesis system HMM can model temporal sequence of singing voice parameter generation from an HMM composed by concatenation of phoneme HMMs HMM state sequence State duration Spectral and lf0 parameters

Introduction - Improvement in Sinsy These are a series of papers written by the producer of Sinsy’s team [An HMM-based Singing Voice Synthesis System,2006] The first paper about HMM-based singing voice synthesis system [HMM-based Singing Voice Synthesis System using Pitch-shifted Pseudo Training Data,2010] To increase the amount of F0 training data, pitch-shifted pseudo data can be prepared by shifting F0 up or down in halftone [Recent Development of the HMM-based Singing Voice Synthesis System – Sinsy ,2010] Introduce the free on-line singing voice synthesis service [Pitch Adaptive Training For HMM-based Singing Voice Synthesis ,2012] model-level normalization of pitch

Singing voice synthesis system - features extraction STRAIGHT [H. Kawahara 1997] A high-quality analysis synthesis method and offers high flexibility in parameter manipulation with no further degradation extract parameters with relatively good performance in not professional recording environment Features: Pitch, Smoothed Spectrum, Aperiodic factors Fixed-point analysis F0 extraction Analysis waveform F0 Smoothed spectrum Aperiodic factors Mixed excitation with phase manipulation Synthesis Synthetic waveform

Singing voice synthesis system - Proposed method for Mandarin singing Speech vs. Singing Pitch contour Database, Model definition, question set

Singing voice synthesis system - Proposed method for Mandarin singing Speech vs. Singing Music Score pitch: duration: key: tempo: beat:

Japanese Syllabary – hiragana Singing voice synthesis system - Proposed method for Mandarin singing Different from Sinsy Language: from Japanese to Mandarin Database, model definition, question sets Refinement Japanese Syllabary – hiragana Japanese syllables are basically from "consonant + vowel" only five vowel Bopomofo Existing 37 (initials 21, finals 16)

Singing voice synthesis system - Proposed method for Mandarin singing Acoustic parameters Model Question sets linguistic info note info cue info Singing Database Different from Sinsy Different from TTS Only for Mandarin Specially for singing

Singing voice synthesis system - system structure Training phase Singing voice database Excitation parameter extraction Spectral parameter extraction Aperiod parameter extraction Context-dependent HMMs & duration models CART-based state tying label Question set Training of HMM Synthesis phase Musical Score State selection by CART conversion label Excitation generation Synthesis filter Parameter generation from HMM Spectral generation Synthesized Singing Voice Aperiod generation

Singing voice synthesis system - Proposed method for Mandarin singing Singing Voice Database Construction Building a singing voice database for training and synthesis MHMC Singing Voice Database Mandarin singing Model definition Initial and final modification Medial modification Long duration models Question sets definition of decision trees Modification for Mandarin Refinements Pitch coverage by pitch-shift pseudo data Vibrato

Segmentation by phoneme Singing voice synthesis system - singing voice database construction Singing Voice Database Construction Singing corpus design process Music Score Corpus Songs selection Singing database Selected Scores Selected Scores Phonetic transcription Segmentation by phoneme Singing signal

Singing voice synthesis system - singing voice database construction Songs selection Selecting scores Music book and internet version Choosing criteria and specialization Simple and no need many skills Phone coverage Digitizing data format: MusicXML Transposition to appropriate pitch range

Singing voice synthesis system - Model definition MusicXML file Sheet Music score MusicXML format Key in Convert MusicXML is an XML-based file format for representing Western musical notation. The format is proprietary, but fully and openly documented.

Singing voice synthesis system - singing voice database construction Singer selection and data processing Finding candidates to record demo 4 candidates Choosing singer the accuracy of pitch timbre Checking recorded data noise is not allowed exceed recording criterion Segmentation and normalization Phoneme Let the energy of singing voice data smaller avoid singing voice becomes loud suddenly Pitch scale is too large leading to bad synthesize

Singing voice synthesis system - singing voice database NCKU Singing Voice Database We choose the 74 songs depends on the lyrics which can cover all mandarin phonemes Songs Nursery rhyme / children’s song Total 148 songs Singer One female Pitch range C4~B4 version 1, 2 Total time About 102 minutes Sample rate 48 kHz Resolution 16 bits Channels Mono File name data 小蜜蜂 兩隻老虎 火車快飛

Singing voice synthesis system - Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label

Singing voice synthesis system - Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label

Singing voice synthesis system - Model definition Initial and final processing tone instead of the original tone of words, the main pitch of note is more significant e.g. 不 speech->bu wuH wuL sing->bu wu Vowel We define the phonemes by phonology The medial with the rime rather than the initial When yi(ㄧ) 、 wu(ㄨ)、yu(ㄩ) is medial, than medial and rime are collectively known as one kind of final. http://zh.wikipedia.org/wiki/%E4%BB%8B%E9%9F%B3 介音 speech singing

Singing voice synthesis system - Model definition Initial and final processing Single initial A syllable only has initial without finals followed with an empty rime “帀“ to pronounce 捲舌音: ㄓㄔㄕㄖ+ zr 平舌音: ㄗㄘㄙ+ sr Total phonemes are 59 (speech: 66) initial ㄅ ㄆ ㄇ ㄈ ㄉ ㄊ ㄋ ㄌ ㄍ ㄎ ㄏ ㄐ ㄑ b p m f d t n l g k h j ch ㄒ ㄓ ㄔ ㄕ ㄖ 帀1 ㄗ ㄘ ㄙ 帀2 sh jr chr shr r zr tz tsz sz sr final 一 ㄨ ㄩ ㄚ ㄛ ㄜ ㄞ ㄟ ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄝ yi wu yu a o e ai ei au ou an en ang ng er eh final with medial 一ㄚ 一ㄝ 一ㄠ 一ㄡ 一ㄢ 一ㄣ ㄧㄤ 一ㄥ ㄨㄚ ㄨㄛ ㄨㄞ ㄨㄟ ㄨㄢ ia ieh iau Iou ian ien iang ing ua uo uai uei uan ㄨㄣ ㄨㄤ ㄨㄥ ㄩㄝ ㄩㄢ ㄩㄣ ㄩㄥ uen uang ung iueh iuan iuen iung

Singing voice synthesis system - singing voice database phonetic coverage initial final final contains medial phone 59 Total phones 15300 total words 8448 song 148

Singing voice synthesis system - Model definition Long duration model To express well in singing, long duration note is important. shorter notes will soon be over with no special effects. Long tone is different, it provide a larger space to express. Lengthen the short duration note cannot present long duration note completely half or whole note -> Final + “L” 一起飛 飛就飛 叫就叫

Singing voice synthesis system - Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label

Singing voice synthesis system - Model definition Riffs and runs processing A syllable corresponding to multiple notes Repeat the last tonal Pause processing In order to present the breathing pause or segmented pause when human singing The singer suspend more than a threshold (> 0.3 seconds) a rest

Singing voice synthesis system - Model definition Linguistic information phoneme current phoneme, { preceding, succeeding } two phonemes syllable # of phonemes at {preceding, current, succeeding} syllable Phrase # of phonemes/syllables at {preceding, current, succeeding} phrase song # of average phonemes/syllables in measure in this song # of phrases in this song Riffs and Run

Singing voice synthesis system - Model definition Singing is the act of producing musical sounds with the voice, and augments regular speech by the use of both tonality and rhythm Note pitch Pitches are compared as "higher" and "lower" in the sense associated with musical melodies Note duration An amount of time or a particular time interval.  It is the length of a note and one of the bases of rhythm. Songs structure what kind of an overall musical form or structure the song adopts the order of a music score

Singing voice synthesis system - Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label

Singing voice synthesis system - Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label

Singing voice synthesis system - Model definition User-defined phrase units phrasing may be necessary for the singer to take catch breaths or to achieve a certain style. definition in relation to music is ”a short passage or segment, often consisting of four measures or forming part of a smaller/larger unit” We defined the unit of phrase depend on the song structure. used in outside label to present breathing pause http://www.irenejackson.com/phrasing.html 4 measures / phrase 2 measures / phrase

Singing voice synthesis system - Model definition Note Calculation the basic information is not enough to present one note completely Relative pitch means difference between the key note and the current note Key note depends on numbers of sharps or flats Note position different note positions in the measure or phrase may have different expression due to breathing unit: note, 0.1 second, thirty-second note, % Note length 0.1 second(absolute pitch), thirty-second note(relative length)

Singing voice synthesis system - Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Note Absolute Pitch Note Type Measure Song Settings Long duration Processing transcription Riffs and runs Processing Note Calculation User-defined phrase units Pause Processing Note Pitch Note Duration Song Structure linguistic information Label

Singing voice synthesis system - Model definition Note information Note Pitch Absolute pitch (C0-G9), relative pitch(0-11), the difference pitch between previous & current / current & next Note Duration Length of note by syllable, thirty-second note, 0.1 second Song Structure Beat: 2/4, 3/4, 4/4 Tempo: 90, 100, 120 key Position Count by note, 0.1 second, thirty-second note, percentage in the measure/phrase Number of phrases

Singing voice synthesis system - Question sets definition Question sets definition for singing model clustering (1) Phoneme (current and { preceding, succeeding } two phonemes) Final With or without medial Initial Initials pronunciation category Finals pronunciation category (2) Note Pitch Tempo Beat Duration Position (3) phrase # of phonemes/syllables preceding, current, succeeding phrase (4) song # of phonemes/syllables # of phrases

Singing voice synthesis system - Refinement Pitch-shift pseudo data Pitch coverage using the nearby notes from other songs and shift to corresponding Hertz

Singing voice synthesis system - Refinement  

Evaluation Experimental Conditions Database condition Number of songs 148 number of phonemes 15300 number of words 8443 Number of notes 9054 Total of time About 100 minutes Database condition Frame shift 5ms Window Length 25ms Window function Blackman window MGC order 49 dim MFCCs Sampling rate 48kHz Mel-cepstral analysis condition

Evaluation Experiments settings Baseline RQ : Reduced Question sets duplicate questions, indirect questions, relative questions PS : Pitch-shift pseudo data VP : Vibrato post-processing

Evaluation - Subjective evaluation Pitch contour Synthesized (baseline) vs. Music score Synthesized (baseline) vs. Original singing

Evaluation - Subjective evaluation Mean Opinion Scores(MOS) 10 synthesize songs 12 subjects Quality and Intelligibility evaluation ABX test A subject is presented with two known samples (A, the reference, and B, the alternative. X is randomly selected from A and B, and the subject identifies X as being either A or B) Quality MOS Excellent 5 Good 4 Fair 3 Poor 2 Bad 1

Evaluation - Subjective Quality evaluation Intelligibility evaluation mean variance baseline 2.76 0.173 RQ 2.49 0.132 RQ+PS 3.04 0.141 mean variance baseline 2.79 0.166 RQ 2.75 0.187 RQ+PS 3.11 0.008

Demo Outside Test baseline baseline+QR baseline+QR+PS 娃娃哭了 叫媽媽 推你摔下 你又站起來

Evaluation - Subjective The score of quality and intelligibility is lower than baseline The question set we reduced including the important information to classify Too few question 5364->1257 Find out the better version of reduced question sets

Preference test Natural- Testing vibrato different pitch and situation corresponding to different settings Vibrato is not essential in children’ songs original vibrato

Discussion Singing corpus quality Too blurred Recording in professional environment Singer’s timbre Context factor coverage Too blurred Not enough training corpus modeled with priority of singing characteristics

Conclusion A Mandarin corpus-based singing voice synthesis system based on hidden Markov models (HMMs) was implemented We defined the Mandarin model definition for singing and the question sets for model clustering. We use three methods to refine our system, i.e. question set reduction, pitch-shift pseudo data and vibrato post-processing.

Demo Inside Test Outside test original Our system original Our system 火車快飛 小星星 妹妹揹著 洋娃娃 三輪車 康定情歌 蝴蝶

Thanks for listening & comments