A System for Hybridizing Vocal Performance

Slides:



Advertisements
Similar presentations
Shapelets Correlated with Surface Normals Produce Surfaces Peter Kovesi School of Computer Science & Software Engineering The University of Western Australia.
Advertisements

Analysis and Digital Implementation of the Talk Box Effect Yuan Chen Advisor: Professor Paul Cuff.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
A System for Hybridizing Vocal Performance By Kim Hang Lau.
Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Filtering Filtering is one of the most widely used complex signal processing operations The system implementing this operation is called a filter A filter.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
SYED SYAHRIL TRADITIONAL MUSICAL INSTRUMENT SIMULATOR FOR GUITAR1.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Unit 9 IIR Filter Design 1. Introduction The ideal filter Constant gain of at least unity in the pass band Constant gain of zero in the stop band The.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
Source/Filter Theory and Vowels February 4, 2010.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Lecture 1 Signals in the Time and Frequency Domains
IIT Bombay ICA 2004, Kyoto, Japan, April 4 - 9, 2004   Introdn HNM Methodology Results Conclusions IntrodnHNM MethodologyResults.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
MUSIC 318 MINI-COURSE ON SPEECH AND SINGING
1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High Quality Music Metacompression Using Repeated- Segment.
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance.
Implementing a Speech Recognition System on a GPU using CUDA
Wavelet transform Wavelet transform is a relatively new concept (about 10 more years old) First of all, why do we need a transform, or what is a transform.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Pitch Determination by Wavelet Transformation Santhosh Bellikoth ECE Speech Processing Instructor: Dr Kepuska.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Performance Comparison of Speaker and Emotion Recognition
Present document contains informations proprietary to France Telecom. Accepting this document means for its recipient he or she recognizes the confidential.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Time Compression/Expansion Independent of Pitch. Listening Dies Irae from Requiem, by Michel Chion (1973)
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Topic: Pitch Extraction
High Quality Voice Morphing
CS 591 S1 – Computational Audio -- Spring, 2017
Spectrum Analysis and Processing
Speech Signal Processing
Vocoders.
Automated Detection of Speech Landmarks Using
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Linear Predictive Coding Methods
The Vocoder and its related technology
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.
Wavelet transform Wavelet transform is a relatively new concept (about 10 more years old) First of all, why do we need a transform, or what is a transform.
Digital Systems: Hardware Organization and Design
Linear Prediction.
Chapter 7 Finite Impulse Response(FIR) Filter Design
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Auditory Morphing Weyni Clacken
Presentation transcript:

A System for Hybridizing Vocal Performance By Kim Hang Lau

Parameters of the singing voice Parameters of the singing voice can be loosely classified as: Timbre Pitch contour Time contour (rhythm) Amplitude envelope (projections)

Vocal Modification Vocal modification refers to the signal processing of live or recorded singing to achieve a different inflection and/or timbre Commercially available units include Intonation corrector Pitch/formant processor Harmonizer Vocoder Of particular interest is the Auto-Tune intonation correction from Antares system, which will be used to benchmark some of our tests.

Objectives Prototype a system for vocal modification Modify a source vocal sample to match the time evolution, pitch contour and amplitude envelope of a similarly sung, target vocal sample Simulates a transfer of singing techniques from a target vocalist to a source vocalist – thus a hybridizing vocal performance Note that timbre is not included in the objectives to limit the scope of this thesis

Order of Presentation System Overview Individual components System evaluation System limitations Conclusions and recommendations We’ll first present an overview of the prototype system, followed by the individual components. The overall system will then be evaluated, and its limitations will be accessed. And finally, conclusions and recommendations.

System Overview Three components Pitch-marking Time-alignment Time/pitch/amplitude modification engine Inspired by Verhelst’s prototype system for the post-synchronization of speech utterances The system consists of three components: pitch marking applied to both the source and target vocal sample, which generate pitch information and amplitude envelope information. The time-alignment unit generates the time-warping information that synchronized the source to the target. Together with pitch and amplitude information, modification parameters are generated, which will be applied to the modification engine to modify the source vocal sample According to my knowledge, this prototype system is an original contribution that suggest a new form of vocal modification. The individual components are however, implementation and adaptations of the existing techniques. This system is implemented in software using Matlab.

Targeted System Specifications Vocal performance Commercial singing Vocal pitch range 60-1200 Hz Detection accuracy/resolution 10 cents Detection dynamic range 40dB Sampling rate 44.1kHz and 48kHz Time-scale modification ±20% Pitch-scale modification ±600 cents The requirements of detection accuracy/resolution is stringent, because the system has to handle minute pitch inflections like pitch jitter and vibrato. Singing vibrato generally occurs between 5.5-7.5Hz at 100-200 cents. The system must be able to detect and modify to produce smooth quasi-sine pitch contour at this frequency and depth. Detection dynamic range is with reference to normalized power. For good singers, their dynamic range can be higher, but 40dB is the average. The moderate time/pitch modification requirements are result of the assumption that two similarly sung vocal sample are compared Without further ado, I’ll start presenting the individual components, starting with the pitch-marking system, followed by the modification engine, and then the time-alignment system

Component No.1 Pitch-marking

Pitch-marking and Glottal Closure Instants (GCIs) Pitch-marks 5ms P P’ Information generated from pitch-marking Pitch period Amplitude envelope Voiced/unvoiced segment boundaries Pitch-marking is the process of placing markers in the signal waveform at a pitch synchronous rate for voiced sounds, and at constant rate for unvoiced sounds For applications to the modification engine, these markers should ideally correspond to the time-instant when the vocal tract is most excited during a cycle of vocal fold vibration This time instant is commonly accepted to be the instant when the glottis closes, hence Glottal Closure Instants It is clear that pitch period and amplitude envelope can be derived from pitch-marking

Pitch-marking applying Dyadic Wavelet Transform (DyWT) Kadambe adapted Mallat’s algorithm for edge detection in image signal to the detection of GCIs in speech signal He assumed the correlation between edges in image signal and GCIs in speech signal DyWT computation for dyadic scales 2^3 to 2^5 was sufficient for pitch-marking If a particular peak detected in DyWT matches for two consecutive scales, starting from a lower scale, that time-instant is taken as a GCI , which are both considered abrupt transition points in image and speech signals respectively

Mallat Kadambe Base-band Original Signal 2^1 2^2 2^3 2^4 2^5 In the left hand plot is an illustration from Mallat. On the top level is the original signal and subsequent plots are the wavelet coefficients of dyadic scale of power 1 to 5. It is clear that for every abrupt transition in the signal, the wavelet coeffients display a peak. On the right-hand side is an illustration from Kadambe. The original signal is a synthesized vowel, and the ‘true’ GCIs are marked at the top of every graph. Comparing the original signals, it will be quick to notice every oscillation in the speech signal can be considered as a abrupt transition point in Mallat’s context. This is because a GCI is embedded in the speech by way of convolution, and was never explicitly manifested in the signal waveform. However, illustrated in the bottom right of Kadambe’s illustration, the time response of the wavelet transform is still very desirable for pitch-marking when higher harmonics are filtered, and the wavelet filter band that contain the fundamental frequency can accurately define GCIs. I’ll refer this band as the base-band. 2^4 2^5 Base-band

The proposed pitch-marking scheme Detection principle Detection of the scale that contains the fundamental period Starting from a higher scale (of lower frequency), there is a considerable jump in frame power when this scale is encountered Features 4X decimation to support high sampling rates Frame based processing and error correction for possible quasi-real-time detection

The proposed pitch-marking system The purpose of showing this system is to illustrate the immensity of the system that controls the behavior of the pitch-marking

Comparisons of results with Auto-Tune Proposed system Auto-Tune

Component No.2 The Modification Engine

Time/pitch/amplitude modification engine D(n) (n): time-modification factor (n): pitch-modification factor (n): amplitude modification factor D(n): time-warping function

TD-PSOLA (Time-domain Pitch Synchronous Overlap-Add) Time-domain splicing overlap-add method Used in prosodic modification of speech TD-PSOLA is implemented for the modification engine. Analysis stage: short-time analysis are extracted from the signal via windows centered at a pitch-marks, as illustrated in the diagram Pitch-scale modifications are attained by narrowing the distance between pitch-marks before overlap-add. Time-scale modification is performed by adding or discarding short-time analysis signals in the synthesis stage. The size and type of window are important analysis parameters. In general, windows with reasonable spectral behavior can be used, and the size of the window should be approximately 2 times the local pitch-period. The commonly used Hanning window was chosen.

Evaluation of the modification engine Original TD-PSOLA In this test example, a female vocal sample was pitch-shifted to sustain a constant pitch. The original Our modification engine Auto-Tune Auto-tune has advance methods for the handling of pitch-transition, which is manifested by the non-instantaneous modification. The emphasis is however to show that Auto-Tune does not preserve the formants well. Auto-Tune

Component No.3 Time-alignment

Time-alignment Based on Verhelst’s prototye system that applies Dynamic Time Warping (DTW) He claimed that the basic local constrain produces the most accurate time-warping path Exponential increase in computation as length of comparison increases Accuracy deteriorates as length of comparison increases In order to find the modification parameters, the source and target time events has to be time-aligned. It compares speaker independent parameters i.e. the LPC Cepstral parameter, and make use of dynamic programming to search for the best match between a target spoken word and a series of arbitrary spoken word. In two similarly spoken word are compared, a time-warping path that synchronizes the source speech to the target speech is formulated These constrains limit the search path

Adaptations from Verhelst’s method Proposed to perform time-alignment on a voiced/unvoiced segmental basis DTW for voiced segments Linear Time Warping (LTW) for unvoiced segments Global constraints are introduced to further reduce computations Synchronization of voiced/unvoiced segments are required, which is manually edited in current implementation LTW was chosen for unvoiced segment to confine our modification to voiced sound only. This is because the manipulation of unvoiced sounds can be more complicated than voiced sounds Voiced/unvoiced segments boundaries are generated by the pitch-marking stage. But dissimilarity between source/target and limitations of pitch-marking will never yield the exact number of segments between the source and the target. A easy method was used to manually edit these segment boundaries.

Manipulation of modification parameters Simple smoothing of (n), (n) using linear phase FIR low-pass filters are performed before feeding them to the modification engine

The Prototype System With this, I’ll shall present the final

System Evaluation: case 1

System Evaluation: case 2

System Limitations Segmentation Modification engine Lack of a reliable technique for voiced/unvoiced segmentation Segmentation and classification of different vocal sounds is the key to devise rules for modification Modification engine Lack capabilities to handle pitch transition, total dependence to the pitch-marking stage

System Limitations Pitch-marking Time-alignment Proposed system lacks robustness Despite desirable time-response of the wavelet filter bank, its frequency response is not capable of isolating harmonics effectively and efficiently Time-alignment The DTW basic local constraint allows infinite time expansion and compression. This factor often causes distortions in the synthesized vocal sample

Conclusions and Recommendations Current systems works well for slow and continuous singing Further improvements on the individual components are recommended to handle greater dynamic changes of the vocal signal, thereby extending the current good results to a wider range of singing styles

Questions & Answers

Wavelet filter bank

Dyadic Spline Wavelet

Wide-band analysis

DTW local constraints

Calculation of pitch-marks

DyWT