A System for Hybridizing Vocal Performance

A System for Hybridizing Vocal Performance
By Kim Hang Lau

Parameters of the singing voice
Parameters of the singing voice can be loosely classified as: Timbre Pitch contour Time contour (rhythm) Amplitude envelope (projections)

Vocal Modification Vocal modification refers to the signal processing of live or recorded singing to achieve a different inflection and/or timbre Commercially available units include Intonation corrector Pitch/formant processor Harmonizer Vocoder Of particular interest is the Auto-Tune intonation correction from Antares system, which will be used to benchmark some of our tests.

Objectives Prototype a system for vocal modification
Modify a source vocal sample to match the time evolution, pitch contour and amplitude envelope of a similarly sung, target vocal sample Simulates a transfer of singing techniques from a target vocalist to a source vocalist – thus a hybridizing vocal performance Note that timbre is not included in the objectives to limit the scope of this thesis

Order of Presentation System Overview Individual components
System evaluation System limitations Conclusions and recommendations We’ll first present an overview of the prototype system, followed by the individual components. The overall system will then be evaluated, and its limitations will be accessed. And finally, conclusions and recommendations.

System Overview Three components
Pitch-marking Time-alignment Time/pitch/amplitude modification engine Inspired by Verhelst’s prototype system for the post-synchronization of speech utterances The system consists of three components: pitch marking applied to both the source and target vocal sample, which generate pitch information and amplitude envelope information. The time-alignment unit generates the time-warping information that synchronized the source to the target. Together with pitch and amplitude information, modification parameters are generated, which will be applied to the modification engine to modify the source vocal sample According to my knowledge, this prototype system is an original contribution that suggest a new form of vocal modification. The individual components are however, implementation and adaptations of the existing techniques. This system is implemented in software using Matlab.

Targeted System Specifications
Vocal performance Commercial singing Vocal pitch range Hz Detection accuracy/resolution 10 cents Detection dynamic range 40dB Sampling rate 44.1kHz and 48kHz Time-scale modification ±20% Pitch-scale modification ±600 cents The requirements of detection accuracy/resolution is stringent, because the system has to handle minute pitch inflections like pitch jitter and vibrato. Singing vibrato generally occurs between Hz at cents. The system must be able to detect and modify to produce smooth quasi-sine pitch contour at this frequency and depth. Detection dynamic range is with reference to normalized power. For good singers, their dynamic range can be higher, but 40dB is the average. The moderate time/pitch modification requirements are result of the assumption that two similarly sung vocal sample are compared Without further ado, I’ll start presenting the individual components, starting with the pitch-marking system, followed by the modification engine, and then the time-alignment system

Component No.1 Pitch-marking

Pitch-marking and Glottal Closure Instants (GCIs)
Pitch-marks 5ms P P’ Information generated from pitch-marking Pitch period Amplitude envelope Voiced/unvoiced segment boundaries Pitch-marking is the process of placing markers in the signal waveform at a pitch synchronous rate for voiced sounds, and at constant rate for unvoiced sounds For applications to the modification engine, these markers should ideally correspond to the time-instant when the vocal tract is most excited during a cycle of vocal fold vibration This time instant is commonly accepted to be the instant when the glottis closes, hence Glottal Closure Instants It is clear that pitch period and amplitude envelope can be derived from pitch-marking

Pitch-marking applying Dyadic Wavelet Transform (DyWT)
Kadambe adapted Mallat’s algorithm for edge detection in image signal to the detection of GCIs in speech signal He assumed the correlation between edges in image signal and GCIs in speech signal DyWT computation for dyadic scales 2^3 to 2^5 was sufficient for pitch-marking If a particular peak detected in DyWT matches for two consecutive scales, starting from a lower scale, that time-instant is taken as a GCI , which are both considered abrupt transition points in image and speech signals respectively

Mallat Kadambe Base-band Original Signal 2^1 2^2 2^3 2^4 2^5
In the left hand plot is an illustration from Mallat. On the top level is the original signal and subsequent plots are the wavelet coefficients of dyadic scale of power 1 to 5. It is clear that for every abrupt transition in the signal, the wavelet coeffients display a peak. On the right-hand side is an illustration from Kadambe. The original signal is a synthesized vowel, and the ‘true’ GCIs are marked at the top of every graph. Comparing the original signals, it will be quick to notice every oscillation in the speech signal can be considered as a abrupt transition point in Mallat’s context. This is because a GCI is embedded in the speech by way of convolution, and was never explicitly manifested in the signal waveform. However, illustrated in the bottom right of Kadambe’s illustration, the time response of the wavelet transform is still very desirable for pitch-marking when higher harmonics are filtered, and the wavelet filter band that contain the fundamental frequency can accurately define GCIs. I’ll refer this band as the base-band. 2^4 2^5 Base-band

The proposed pitch-marking scheme
Detection principle Detection of the scale that contains the fundamental period Starting from a higher scale (of lower frequency), there is a considerable jump in frame power when this scale is encountered Features 4X decimation to support high sampling rates Frame based processing and error correction for possible quasi-real-time detection

The proposed pitch-marking system
The purpose of showing this system is to illustrate the immensity of the system that controls the behavior of the pitch-marking

Comparisons of results with Auto-Tune
Proposed system Auto-Tune

Component No.2 The Modification Engine

Time/pitch/amplitude modification engine
D(n) (n): time-modification factor (n): pitch-modification factor (n): amplitude modification factor D(n): time-warping function

TD-PSOLA (Time-domain Pitch Synchronous Overlap-Add)
Time-domain splicing overlap-add method Used in prosodic modification of speech TD-PSOLA is implemented for the modification engine. Analysis stage: short-time analysis are extracted from the signal via windows centered at a pitch-marks, as illustrated in the diagram Pitch-scale modifications are attained by narrowing the distance between pitch-marks before overlap-add. Time-scale modification is performed by adding or discarding short-time analysis signals in the synthesis stage. The size and type of window are important analysis parameters. In general, windows with reasonable spectral behavior can be used, and the size of the window should be approximately 2 times the local pitch-period. The commonly used Hanning window was chosen.

Evaluation of the modification engine
Original TD-PSOLA In this test example, a female vocal sample was pitch-shifted to sustain a constant pitch. The original Our modification engine Auto-Tune Auto-tune has advance methods for the handling of pitch-transition, which is manifested by the non-instantaneous modification. The emphasis is however to show that Auto-Tune does not preserve the formants well. Auto-Tune

Component No.3 Time-alignment

Time-alignment Based on Verhelst’s prototye system that applies Dynamic Time Warping (DTW) He claimed that the basic local constrain produces the most accurate time-warping path Exponential increase in computation as length of comparison increases Accuracy deteriorates as length of comparison increases In order to find the modification parameters, the source and target time events has to be time-aligned. It compares speaker independent parameters i.e. the LPC Cepstral parameter, and make use of dynamic programming to search for the best match between a target spoken word and a series of arbitrary spoken word. In two similarly spoken word are compared, a time-warping path that synchronizes the source speech to the target speech is formulated These constrains limit the search path

Adaptations from Verhelst’s method
Proposed to perform time-alignment on a voiced/unvoiced segmental basis DTW for voiced segments Linear Time Warping (LTW) for unvoiced segments Global constraints are introduced to further reduce computations Synchronization of voiced/unvoiced segments are required, which is manually edited in current implementation LTW was chosen for unvoiced segment to confine our modification to voiced sound only. This is because the manipulation of unvoiced sounds can be more complicated than voiced sounds Voiced/unvoiced segments boundaries are generated by the pitch-marking stage. But dissimilarity between source/target and limitations of pitch-marking will never yield the exact number of segments between the source and the target. A easy method was used to manually edit these segment boundaries.

Manipulation of modification parameters
Simple smoothing of (n), (n) using linear phase FIR low-pass filters are performed before feeding them to the modification engine

The Prototype System With this, I’ll shall present the final

System Evaluation: case 1

System Evaluation: case 2

System Limitations Segmentation Modification engine
Lack of a reliable technique for voiced/unvoiced segmentation Segmentation and classification of different vocal sounds is the key to devise rules for modification Modification engine Lack capabilities to handle pitch transition, total dependence to the pitch-marking stage

System Limitations Pitch-marking Time-alignment
Proposed system lacks robustness Despite desirable time-response of the wavelet filter bank, its frequency response is not capable of isolating harmonics effectively and efficiently Time-alignment The DTW basic local constraint allows infinite time expansion and compression. This factor often causes distortions in the synthesized vocal sample

Conclusions and Recommendations
Current systems works well for slow and continuous singing Further improvements on the individual components are recommended to handle greater dynamic changes of the vocal signal, thereby extending the current good results to a wider range of singing styles

Questions & Answers

Wavelet filter bank

Dyadic Spline Wavelet

Wide-band analysis

DTW local constraints

Calculation of pitch-marks

A System for Hybridizing Vocal Performance

Similar presentations

Presentation on theme: "A System for Hybridizing Vocal Performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A System for Hybridizing Vocal Performance

Similar presentations

Presentation on theme: "A System for Hybridizing Vocal Performance"— Presentation transcript:

Similar presentations

About project

Feedback