Mr. Darko Pekar, Speech Morphing Inc.

Slides:



Advertisements
Similar presentations
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
G.S.MOZE COLLEGE OF ENGINNERING BALEWADI,PUNE -45.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
Speaker Adaptation for Vowel Classification
Create Photo-Realistic Talking Face Changbo Hu * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang.
Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Optimal Adaptation for Statistical Classifiers Xiao Li.
Analysis & Synthesis The Vocoder and its related technology.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Advances in WP1 and WP2 Paris Meeting – 11 febr
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
A PRESENTATION BY SHAMALEE DESHPANDE
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Page 1 NOLISP, Paris, May 23rd 2007 Audio-Visual Audio-Visual Subspaces Audio Visual Reduced Audiovisual Subspace Principal Component & Linear Discriminant.
Speech Signal Processing I
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
HMM training strategy for incremental speech synthesis.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.
G. Anushiya Rachel Project Officer
High Quality Voice Morphing
Online Multiscale Dynamic Topic Models
WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
Text-To-Speech System for English
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Voice conversion using Artificial Neural Networks
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Graduate School of Language Technology
Neural Speech Synthesis with Transformer Network
Indian Institute of Technology Bombay
Artificial Intelligence 2004 Speech & Natural Language Processing
Auditory Morphing Weyni Clacken
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.
Presentation transcript:

Mr. Darko Pekar, Speech Morphing Inc. Towards natural voices created from very short recordings: new developments in NN parametric TTS and voice conversion Mr. Darko Pekar, Speech Morphing Inc.

Contents Text-to-Speech development in SMI: Improvements to TTS: HMM-TTS Hybrid TTS Neural networks based TTS Improvements to TTS: Separate modelling of spectral details Voice conversion: Simple approach

Contents Text-to-Speech development in SMI: Improvements to TTS: HMM-TTS Hybrid TTS Neural networks based TTS Improvements to TTS: Separate modelling of spectral details Voice conversion: Simple approach

Contents Text-to-Speech development in SMI: Improvements to TTS: HMM-TTS Hybrid TTS Neural networks based TTS Improvements to TTS: Separate modelling of spectral details Voice conversion: Simple approach

All units (states) in the database Hybrid TTS Combine unit slection and HMM: Train HMM-TTS Parametrize the whole database Perform state-level alignment of the database Generate initial parameters by using HMM Search for optimal (real) segments (arrays of states) in the database Search is based on both target and concatenation costs Collect parameters of selected segments Synthesize speech Motives: Approach has potential of reaching high quality Remains flexible, since it’s parametric All units (states) in the database

Contents Text-to-Speech development in SMI: Improvements to TTS: HMM-TTS Hybrid TTS Neural networks based TTS Improvements to TTS: Separate modelling of spectral details Voice conversion: Simple approach

Neural networks based TTS Used architecture: Text analysis produces phonetic content and prosodic context 563 lingusitic features (LF) are extracted based on that LFs are input to NN which predicts state durations Acoustic NN predicts MGCs, f0 and aperiodicity, per frame, based on LFs and state durations Optional MLPG and post-filtering are applied Acoustic parameters are fed to vocoder (WORLD) Waveform Acoustic features Acoustic NN Frame-level liguistic features State durations Duration NN Phoneme-level liguistic features

Comparison HMM Hybrid Neural networks

Contents Text-to-Speech development in SMI: Improvements to TTS: HMM-TTS Hybrid TTS Neural networks based TTS Improvements to TTS: Separate modelling of spectral details Voice conversion: Simple approach

Separate modelling of spectral details Reasons for smoothing: Averaging similar phone realizations NNs are designed to generalize. Fine details are treated as noise. Solution: Train spectral details separately from smooth spectrum Use smooth spectrum as input for training fine details Additional input could be linguistic features

Separate modelling of spectral details Phoneme/state duration prediction (NNd) Phoneme-level linguistic features Text analysis State durations Frame-level linguistic features Smooth spectrum prediction (NNa1) Smooth sectrum Spectral details prediction (NNa2) Spectral details Vocoder

Comparison Baseline With spectral details

Contents Text-to-Speech development in SMI: Improvements to TTS: HMM-TTS Hybrid TTS Neural networks based TTS Improvements to TTS: Separate modelling of spectral details Voice conversion: Simple approach

synthesis with new voice Voice conversion Steps: Train NN on 4+ hours of speech Record and annotate ~10min of new speaker Continue training NN on those new samples Problems: Overfitting Inability to generalize Alignment Less articulated training utterances Prosodic and phonetic coverage 4h of talent voice 10min of new speaker synthesis with new voice

Comparison Original 4h TTS 10min VC Trevor President 1 President 2 Not president 3

Thank you!