High Quality Voice Morphing

Slides:



Advertisements
Similar presentations
ON THE REPRESENTATION OF VOICE SOURCE APERIODICITIES IN THE MBE SPEECH CODING MODEL Preeti Rao and Pushkar Patwardhan Department of Electrical Engineering,
Advertisements

Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.
Topics Recognition results on Aurora noisy speech databaseRecognition results on Aurora noisy speech database Proposal of robust formant.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
G.S.MOZE COLLEGE OF ENGINNERING BALEWADI,PUNE -45.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
Communications & Multimedia Signal Processing Report of Work on Formant Tracking LP Models and Plans on Integration with Harmonic Plus Noise Model Qin.
Communications & Multimedia Signal Processing Meeting 7 Esfandiar Zavarehei Department of Electronic and Computer Engineering Brunel University 23 November,
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
Oytun Turk and Levent M.Arslan Subband Based Voice Conversion SESTEK Inc., R&D Dept. Istanbul, Turkey Bogazici University, Electrical-Electronics Eng.
Optimal Adaptation for Statistical Classifiers Xiao Li.
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence Laboratory November 20th 2006.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High Quality Music Metacompression Using Repeated- Segment.
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
SPEECH CODING Maryam Zebarjad Alessandro Chiumento.
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari.
Compression No. 1  Seattle Pacific University Data Compression Kevin Bolding Electrical Engineering Seattle Pacific University.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
DR.D.Y.PATIL POLYTECHNIC, AMBI COMPUTER DEPARTMENT TOPIC : VOICE MORPHING.
Structure of Spoken Language
HMM-Based Synthesis of Creaky Voice
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
(Extremely) Simplified Model of Speech Production
3.3.1 Synchronized averaging
Performance Comparison of Speaker and Emotion Recognition
SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.
Sung-Won Yoon, David ChoiEE368C Project Proposal Bandwidth Extrapolation of Audio Signals Sung-Won Yoon David Choi February 8 th, 2001.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research.
HIGH-RESOLUTION SINUSOIDAL MODELING OF UNVOICED SPEECH GEORGE P. KAFENTZIS, YANNIS STYLIANOU MULTIMEDIA INFORMATICS LABORATORY DEPARTMENT OF COMPUTER SCIENCE.
Speech Enhancement Summer 2009
ARTIFICIAL NEURAL NETWORKS
Vocoders.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Voice conversion using Artificial Neural Networks
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Vocoders.
2D transformations (a.k.a. warping)
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Norm-Based Coding of Voice Identity in Human Auditory Cortex
INTRODUCTION TO ADVANCED DIGITAL SIGNAL PROCESSING
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Auditory Morphing Weyni Clacken
Presentation transcript:

High Quality Voice Morphing Hui Ye & Steve Young Cambridge University Engineering Department August 2004 Hui YE @ Cambridge University Engineering Department

Baseline System Pitch Synchronous Harmonic Model for speech representation and modification. Pitch scale 1.7 Time scale 1.7

Transform-based Conversion Source Speaker Target Speaker Extract Spectral Envelope Time alignment Estimate Transforms Spectral envelopes conversion Converted speech Source Speaker Extract Spectral Envelope

Analysis of the distortion in the baseline suggested 3 problems areas: Spectral Distortion Unnatural Phase Dispersion Transformation of Unvoiced sounds Solutions have therefore been developed in each of these areas

Spectral Distortion Formant structure has been transformed Spectral details lost due to reduced LSF dimensionality Spectral peaks broadened by the averaging effect of least square error estimation

Spectral Residual Selection Idea: reintroduce the lost spectral details to the converted envelopes Use a codebook selection method to construct a residual Post-filtering applies a perceptual filter to the converted spectral envelope

2. Unnatural Phase Dispersion In the baseline system, the converted spectral envelope was combined with the original phases. This results in converted speech with a “harsh” quality. Spectral magnitudes and phases of human speech are highly correlated. To simultaneously model the magnitudes and phases and then convert them both via a single unified transform is extremely difficult.

Target spectral envelopes Phase Prediction If we can predict the waveform shape, then we can predict the phases. St Estimator GMM Target spectral envelopes Template signal T1 . TM St’ Soft classifier P(CM|vt) P(C1|vt) vt Extract phases Φt

Phase Prediction Implementation The set of template signal (codebook entries) T=[T1,…,TM] can be estimated by minimizing the waveform shape prediction error

Phase Prediction Result: Phase prediction vs copying src phase Phase prediction SNR 7.2 3.2 Original signal Copy src phases Amplitude Time

Phase Prediction Result Phase prediction vs codebook Phase prediction SNR 7.2 6.1 Original signal Phase codebook Amplitude Time

3. Transforming Unvoiced Sounds In our baseline system, the unvoiced sounds are not transformed. In reality, many unvoiced sounds have some vocal tract colouring which affects the speech characteristics. A unit selection approach was therefore developed to transform the unvoiced sounds.

Experiments Training Data: OGI Voice Corpus – 12 speakers, each speaker has about 5 minutes parallel speech data. Four Conversion Tasks: male to male, male to female, female to male, female to female

Subjective Evaluation ABX test (identify target speaker) Preference test (which is more natural) Baseline system Enhanced system ABX 86.4% 91.8% Baseline system Enhanced system Preference 38.9% 61.1%

Baseline+shifted pitch Examples Voice Transformation with parallel training data Source Baseline+shifted pitch Enhanced +tgt prosody target M to F F to M F to F M to M

Unknown Speaker Voice Transformation No pre-existed training data is available from the source speaker, although there is still a reasonable amount of speech data from designated target speaker. Use speech recognition to create a mapping between the unknown input source speech and the target vectors. Source Converted Target Female Male

Summary A complete solution to the voice morphing problem has been developed which can deliver reasonable quality. However, there still some way to go before these techniques can support high fidelity studio applications. Future Work Improve the quality of the converted speech Unknown speaker voice conversion Cross language voice conversion