WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
HALİL İBRAHİM KURU

OUTLINE Introduction Related Work & Background Model Details Results
General Model Structure Dilated Causal Convolutions Specialized Softmax Distribution Activation Conditional WaveNet Results Conclusion

INTRODUCTION Raw audio generation technique
Inspired by neural autoregressive generative models (PixelRNN, we saw at the last presentation session) A similar work on images  PixelRNN (van den Oord et al., 2016a) WaveNets apply similar approach with PixelCNN (van den Oord et al., 2016a;b) on wideband audio signals

RELATED WORK PixelCNN (van den Oord et al., 2016a;b) Gated activation
Two convolutions: 1)vertical  concerns all the pixel above 2)horizontal  concerns all the pixels left in the row Mask convolution to avoid seeing future pixels Gated activation

MODEL DETAILS Input: x = {x1 , . . . , xT } time series data
The joint probability: Aim: Generate xt given that all the previous time steps (xi<t) Condition property: Each audio sample xt is conditioned on previous ones Causal Convolutions: Main factor of the WaveNets Current acoustic intensity that the neural network produces at time step t only depends on data before t, generation of new data cannot depend on future data BUT, causal convolutions lead to a problem Solution  Dilated Casual Convolutions *****PROBLEM: Receptive field’I arttırmak için layer sayısı ve filter size’I arttırmak gerekiyor

Dilated Casual Convolution
Convolution where the filter is applied over an area larger than its length by skipping input values with a certain step Simple explanation: convolution with a larger filter derived from the original filter by dilating it with zeros Size of output = size of input Dilation doubled at every layer Dilation limited, then repeat 1,2,4,...,512, 1,2,4,...,512, 1,2,4,...,512

Pros of Dilation Enable networks to have very large receptive fields with just a few layers Preserves the input resolution throughout the network Exponentially increasing the dilation factor results in exponential receptive field growth with depth (Yu & Koltun, 2016)

Softmax and Activations
Softmax distributions Original case: Audio data is a sequence of 16- bit int, range of 16-bit int  [ ] Output of softmax vector of 65,536 values μ-law companding transformation (ITU-T, ) for decreasing output size of softmax Gated activations k  layer index, f and g  filter and gate, respectively W  learnable convolution filter

Conditioning WaveNets allow to give additional input New model becomes
Ability to guide WaveNet’s generation to produce audio with the required characteristics

Global vs. Local Conditioning
Single latent representation h that influences the output distribution across all timesteps A speaker embedding in a TTS model Local ht, possibly with a lower sampling frequency than the audio signal Linguistic features in a TTS model Transform this time series using a transposed convolutional network y = f(h) Vf,k *y  1x1 conv

RESULTS Text-to-Speech Music Multi-Speaker Speech Generation
Subjective paired comparison tests Mean opinion score (MOS) tests Music

Multi-Speaker Speech Generation
WaveNet conditioned only on the speaker, by feeding the speaker ID (one-hot vector) Not conditioned on text Non-existent but human language-like words in a smooth way with realistic sounding intonations Lacks of coherence in long range Limited size of receptive fields (about 300 milliseconds = last 2-3 phonemes) Able to model any speakers when conditioned by speaker ID (total of 109 speakers) Single model can capture the characteristics of all 109 speakers It may mimic the acoustics and recording quality, apart from the voice

Text-to-Speech Two datasets
Google’s North American English (24.6 hours of speech, female) Mandarin Chinese (34.8 hours of speech, female) Conditioned on linguistic features and logF0 Baseline speech synthesizers : HMM-driven unit selection concatenative (Gonzalvo et al., 2016) LSTM-RNN-based statistical parametric (Zen et al., 2016) Evaluation Metrics: Subjective paired comparison tests (choose one of two(or neutral) after listening the pairs of audio Mean opinion score (MOS) test (rate the naturalness, 1-to-5)

Music Datasets: MagnaTagATune  200 hours of music audio Youtube piano dataset  60 hours of solo piano Negative  did not enforce long-range consistency which resulted in second-to-second variations in genre, instrumentation, volume and sound quality Positive  often harmonic and aesthetically pleasing samples

CONCLUSION Operates directly at the waveform level
Combine causal filters with dilated convolutions Receptive fields to grow exponentially with depth Important to model the long-range temporal dependencies in audio signals Can be combined with other inputs (global-local) Good results on real-world problems Examples  WaveNet Generation Examples

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

Similar presentations

Presentation on theme: "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

Similar presentations

Presentation on theme: "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO"— Presentation transcript:

Similar presentations

About project

Feedback