Download presentation
Presentation is loading. Please wait.
1
An implementation of WaveNet
May 2018 Vassilis Tsiaras Computer Science Department University of Crete
2
Introduction In September 2016, DeepMind presented WaveNet.
Wavenet out-performed the best TTS systems (parametric and concatenative) in Mean Opinion Scores (MOS). Before wavenet, all Statistical Parametric Speech Synthesis (SPSS) methods modelled parameters of speech, such as cepstra, F0, etc. WaveNet revolutionized our approach to SPSS by directly modelling the raw waveform of the audio signal. DeepMind published a paper about WaveNet, but it did not reveal all the details of the network. Here an implementation of WaveNet is presented, which fills some of the missing details.
3
Probability of speech segments
Let Ξ© π denote the set of all possible sequences of length π over 0,1,β¦,πβ1 . Let π: Ξ© π β[0,1] be a probability distribution which achieves higher values for speech sequences than for other sequences. Knowledge of the distribution π: Ξ© π β[0,1], allow us to test whether a sequence π₯ 1 π₯ 2 β― π₯ π is speech or not. Also, using random sampling methods, it allows us to generate sequences that with high probability look like speech. The estimation of π is easy for very small values of π (e.g., π=1,2). Estimation of π( π₯ 1 ) Green: Random samples Blue: Speech samples. Value 0 corresponds to silence Different views of π π₯ 1 π₯ 2 , which was estimated from speech samples from the arctic database.
4
Probability of speech segments
The estimation of π for very small values of π is easy but it is not very useful since the interdependence of speech samples, whose time indices differ more than π, is ignored. In order to be useful for practical applications, the distribution π should be estimated for large values of π. However, the estimation of π becomes very challenging as π grows, due to sparsity of data and to the extremely low values of π. In order to robustly estimate π, we take the following actions. The dynamic range of speech is reduced within the interval [-1,1] and then the speech is quantized into a number of bins (usually π=256). Based on the factorization π π₯ 1 ,β¦, π₯ π‘ = π‘=1 π π( π₯ π‘ | π₯ 1 ,β¦, π₯ π‘β1 ) , we calculate the conditional probabilities π π₯ π‘ π₯ 1 ,β¦, π₯ π‘β1 instead of π π₯ 1 ,β¦, π₯ π‘ . The conditional probability π π₯ π‘ π₯ 1 ,β¦, π₯ π‘β1 = π π₯ 1 ,β¦, π₯ π‘ π π₯ 1 ,β¦, π₯ π‘β1 is numerically more manageable than π π₯ 1 ,β¦, π₯ π‘ .
5
Dynamic range compression and Quantization
Raw audio, π¦ 1 β¦π¦ π‘ β¦ π¦ π , is first transformed into π₯ 1 β¦π₯ π‘ β¦ π₯ π , where β1< π₯ π‘ <1, for π‘β 1,β¦,π using an ΞΌ-law transformation π₯ π‘ =π πππ( π¦ π‘ ) lnβ‘(1+π π¦ π‘ ) lnβ‘(1+π) where π=255 Ξ€hen π₯ π‘ is quantized into 256 values. Finally, π₯ π‘ is encoded to one-hot vectors. Toy example: β2.2, β1.43, β0.77, β1.13, β0.58, β0.43, β0.67, β¦ β0.7, β0.3, 0.2, β0.1, 0.4, 0.6, 0.3, β¦ signal ΞΌ-law transformed bin 0 1 0, 1, 2, 1, 2, 3, 2, β¦ bin 1 1 1 β¦ Input to WaveNet quantized into 4 bins bin 2 1 1 1 bin 3 1 one-hot vectors
6
The conditional probability
The conditional probability π π₯ π‘ π₯ 1 ,β¦, π₯ π‘β1 is modelled with a categorical distribution where π₯ π‘ falls into one of a number of bins (usually 256). The tabular representation of π π₯ π‘ π₯ 1 ,β¦, π₯ π‘β1 is infeasible, since it requires space proportional to 256 π‘ . Instead, function approximation of π is used. Well known function approximators are the neural networks. The recurrent and the convolutional neural networks model the interdependence of the samples in a sequence and are ideal candidates to represent π π₯ π‘ π₯ 1 ,β¦, π₯ π‘β1 . The recurrent neural networks usually work better than the convolutional neural networks but their computation cannot be parallelized across time. Wavenet, uses one-dimensional causal convolutional neural networks to represent π π₯ π‘ π₯ 1 ,β¦, π₯ π‘β1 .
7
WaveNet architecture β 1Γ1 Convolutions
1Γ1 convolutions are used to change the number of channels. They do not operate in time dimension. They can be written as matrix multiplications. Example of a 1Γ1 convolution with 4 input channels, and 3 output channels Input signal Filters 1 1 8 1x1 convolution 1 1 β¦ 3 3 4 Input channels Input channels 1 1 1 4 1 2 1 5 2 1 Width - time Output channels Input signal Transposed Filters Output signal 1 1 3 4 5 1 3 4 3 4 5 4 β 1 1 = Output channels 8 3 1 2 Input channels β¦ Output channels 8 3 1 3 1 2 1 1 1 1 4 2 1 4 2 4 2 1 2 1 Width - time Width - time Input channels ππ’π‘ π ππ’π‘ ,π‘ = π ππ =0 3 ππ π ππ ,π‘ βππππ‘ππ[ π ππ’π‘ , π ππ ]
8
Filter of a causal convolution
Causal convolutions Example of a convolution Many machine learning libraries avoid the filter flipping. For simplicity, we will also avoid the filter flipping. Causal convolutions do not consider future samples. Therefore all values of the filter kernel that correspond to future samples are zero. Filters of width 2 are causal Input signal Filter Filter flipped 1 2 3 2 1 4 2 1 1 2 4 Width = 3 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 5 14 14 7 6 Filter of a causal convolution 4 2 4 2 past present
9
Width = 9 = (Filter_width-1)*dilation + 1
Dilated convolutions Example of a dilated convolution, with dilation=2 Equivalent filter of a dilated convolution, with dilation=4 Dilated convolutions have longer receptive fields. Efficient implementations of dilated convolutions do not consider the equivalent filter with the filled zeros. Input signal Filter Equivalent filter 1 2 3 2 1 4 2 1 2 4 1 2 4 1 2 4 1 2 4 1 Width = 3 Width = 5 6 14 5 Filter Equivalent filter 4 2 1 4 2 1 Width = 3 Width = 9 = (Filter_width-1)*dilation + 1
10
Causal convolutions - Matrix multiplications
Example of a causal convolution of width 2, 4 input channels, and 3 output channels Input signal Filters 1 1 2 8 7 1 1 1 β¦ 3 1 3 4 9 Input channels Input channels 1 1 1 4 6 1 4 2 1 1 5 1 2 9 1 Width - time Width Width Width Output channels 1 1 3 4 5 2 1 6 β 1 1 + β 1 1 = 8 3 1 2 7 4 9 1 1 1 1 1 4 2 1 1 9 1 1 Output signal 2 9 5 9 5 11 Output channels ππ’π‘ π ππ’π‘ ,π‘ = π ππ =0 3 π=0 1 ππ π ππ ,π‘+π βππππ‘ππ[ π ππ’π‘ , π ππ ,π] 8 7 1 7 10 6 9 5 11 5 2 2 Width - time
11
Causal convolutions - Embedding
Example of a causal convolution of width 2, 4 input channels, and 3 output channels Input signal Filters 1 1 2 8 7 1 1 1 β¦ 3 1 3 4 9 Input channels Input channels 1 1 1 4 6 1 4 2 1 1 5 1 2 9 1 Width - time Width Width Width Output channels 1 1 3 4 5 2 1 6 1 1 + 1 1 = 8 3 1 2 7 4 9 1 1 1 1 1 4 2 1 1 9 1 1 Output signal 2 9 5 9 5 11 Output channels ππ’π‘ π ππ’π‘ ,π‘ = π ππ =0 3 π=0 1 ππ π ππ ,π‘+π βππππ‘ππ[ π ππ’π‘ , π ππ ,π] 8 7 1 7 10 6 9 5 11 5 2 2 Width - time
12
Dilated convolutions β Matrix Multiplications
Example of a causal dilated convolution of width 2, dilation 2, 4 input channels, and 3 output channels. Dilation is applied in time dimension Input signal Filters 1 1 2 8 7 1 1 1 β¦ 3 1 3 4 9 Input channels Input channels 1 1 1 4 6 1 4 2 1 1 5 1 2 9 1 Width - time Width Width Width Output channels 1 1 3 4 5 2 1 6 β 1 1 + β 1 = 8 3 1 2 7 4 9 1 1 1 1 1 4 2 1 1 9 1 Output signal 7 4 10 4 10 dilation π
=π ππ’π‘ π ππ’π‘ ,π‘ = π ππ =0 3 π=0 1 ππ π ππ ,π‘+πβπ βππππ‘ππ[ π ππ’π‘ , π ππ ,π] Output channels 12 3 5 12 5 1 13 3 4 3 Width - time
13
Dilated convolutions β Matrix Multiplications
Example of a causal dilated convolution of width 2, dilation 4, 4 input channels, and 3 output channels. Dilation is applied in time dimension Input signal Filters 1 1 2 8 7 1 1 1 β¦ 3 1 3 4 9 Input channels Input channels 1 1 1 4 6 1 4 2 1 1 5 1 2 9 1 Width - time Width Width Width Output channels 1 1 3 4 5 2 1 6 β 1 + β = 8 3 1 2 7 4 9 1 1 1 4 2 1 1 9 1 Output signal 7 4 10 dilation π
=π Output channels 12 12 5 1 4 3 Width - time
14
WaveNet architecture β Dilated convolutions
WaveNet models the conditional probability distribution π π₯ π‘ π₯ 1 ,β¦, π₯ π‘β1 with a stack of dilated causal convolutions. Output dilation = 8 Hidden layer dilation = 4 Hidden layer dilation = 2 Hidden layer dilation = 1 Input Visualization of a stack of dilated causal convolutional layers Stacked dilated convolutions enable very large receptive fields with just a few layers. The receptive field of the above example is ( ) + 1 = 16 In WaveNet, the dilation is doubled for every layer up to a certain point and then repeated: 1, 2, 4, ..., 512, 1, 2, 4, ..., 512, 1, 2, 4, ..., 512, 1, 2, 4, β¦, 512, 1, 2, 4, β¦, 512
15
WaveNet architecture β Dilated convolutions
Example with dilations 1,2,4,8,1,2,4,8 d=8 d=4 d=2 d=1 d=8 d=4 d=2 d=1
16
WaveNet architecture β Residual connections
Weight layer + identity π₯+β±(π₯) π’(π₯+β±(π₯)) π₯+β±(π₯)+π’(π₯+β±(π₯)) In order to train a WaveNet with more than 30 layers, residual connections are used. Residual networks were developed by researchers from Microsoft Research. They reformulated the mapping function, π₯βπ π₯ , between layers from π π₯ =β±(π₯) to π π₯ =π₯+β±(π₯). The residual networks have identity mappings, π₯, as skip connections and inter-block activations β±(π₯). Benefits The residual β±(π₯) can be more easily learned by the optimization algorithms. The forward and backward signals can be directly propagated from one block to any other block. The vanishing gradient problem is not a concern. π₯+β±(π₯) + Weight layer identity π₯ β±(π₯) Weight layer π₯
17
WaveNet architecture β Experts & Gates
WaveNet uses gated networks. For each output channel an expert is defined. Experts may specialize in different parts of the input space The contribution of each expert is controlled by a corresponding gate network. The components of the output vector are mixed in higher layers, creating mixture of experts. Γ tanh Ο expert gate Dilated convolution Dilated convolution
18
WaveNet architecture β Output
WaveNet assigns to an input vector π₯ π‘ a probability distribution using the softmax function. β(π§) π = π π§ π π= π π§ π , π=1, β¦, 256 Example with receptive field = 4 Input: π₯ 1 , π₯ 2 , π₯ 3 , π₯ 4 , π₯ 5 , π₯ 6 , π₯ 7 , π₯ 8 , π₯ 9 , π₯ 10 Output: π 4 , π 5 , π 6 , π 7 , π 8 , π 9 , π 10 target: π₯ 4 , π₯ 5 , π₯ 6 , π₯ 7 , π₯ 8 , π₯ 9 , π₯ 10 where π 4 =π π₯ 4 π₯ 1 , π₯ 2 , π₯ 3 , π 5 =π π₯ 5 π₯ 2 , π₯ 3 , π₯ 4 , β¦. .6 .2 .1 .1 .2 .5 .1 .6 .1 Channels .1 .2 .7 .2 .1 .1 .1 .1 .1 .8 time WaveNet output: probabilities from softmax
19
WaveNet architecture β Loss function
Example with receptive field = 4 Input: π₯ 1 , π₯ 2 , π₯ 3 , π₯ 4 , π₯ 5 , π₯ 6 , π₯ 7 , π₯ 8 , π₯ 9 , π₯ 10 Output: π 4 , π 5 , π 6 , π 7 , π 8 , π 9 , π 10 target: π₯ 4 , π₯ 5 , π₯ 6 , π₯ 7 , π₯ 8 , π₯ 9 , π₯ 10 where π 4 =π π₯ 4 π₯ 1 , π₯ 2 , π₯ 3 , π 5 =π π₯ 5 π₯ 2 , π₯ 3 , π₯ 4 , β¦. During training the estimation of the probability distribution π π‘ =π π₯ π‘ π₯ π‘βπ
,β¦, π₯ π‘β1 is compared with the one-hot encoding of π₯ π‘ . The difference between these two probability distributions is measured with the mean (across time) cross entropy. π» π₯ 4 ,β¦, π₯ π , π 4 ,β¦, π π =β 1 πβ3 π‘=4 π π₯ π‘ βΊ β log π π‘ =β 1 πβ3 π‘=4 π π= π₯ π‘ (π)logβ‘( π π‘ (π))
20
WaveNet β Audio generation
After training, the network is sampled to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Example with receptive field 4 and 4 quantization channels Input: π₯ 1 , π₯ 2 , π₯ 3 Output: π 4 =πππ£ππππ‘ π₯ 1 , π₯ 2 , π₯ 3 = sample: π₯ 4 =1 Input: π₯ 2 , π₯ 3 , π₯ 4 Output: π 5 =πππ£ππππ‘ π₯ 2 , π₯ 3 , π₯ 4 = sample: π₯ 5 =0 Probability distribution over the symbols 0,1,2,3
21
WaveNet β Audio generation
Sampling methods Direct sampling: Sample randomly from π(π₯) Temperature sampling: Sample randomly from a distribution adjusted by a temperature π, π π π₯ = 1 π π(π₯) 1 π , where π is a normalizing constant. Mode: Take the most likely sample, argmax π₯ π(π₯) Mean: Take the mean of the distribution, πΈ π π₯ Top k: Sample from an adjusted distribution that only permits the top k samples The generated samples, π₯ π‘ , are scaled back to speech with the inverse ΞΌ-law transformation. π’=2 π₯ π β1 Convert from π₯β 0,1,2,β¦, to π’β β1,1 speech= π πππ(π’) π 1+π π’ β1 Inverse ΞΌ-law transform
22
Fast WaveNet β Audio generation
A naΓ―ve implementation of WaveNet generation requires time π 2 πΏ , where πΏ is the number of layers. Recently, Tom Le Paine et al. have published their code for fast generation of sequences from trained WaveNets. Their algorithm uses queues to avoid redundant calculations of convolutions. This implementation requires time π(πΏ). Fast
23
Basic WaveNet architecture
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ
Res. block 1X1 256 512 Ξ£ 256 512 Res. block 1X1 256 512 512 30 Res. block 1X1 256 256 ReLU 1X1 256 ReLU 1X1 256 Softmax 512 512 256 Res. block 1X1 512 512 256 Res. block 1X1 512 512 1X1 conv 1X1 conv π₯ πβ2 π₯ πβ1 256
24
Basic WaveNet architecture
Dilated convolution tanh Ο Conv 1Γ1, r-chan Γ + Identity mapping Loss Output Skip connection Conv 1Γ1 s-chan kth residual block dilation = 512 Softmax Conv 1Γ1, d-chan ReLU Conv 1Γ1, d-chan Dilated convolution tanh Ο Conv 1Γ1 r-chan Γ + Identity mapping ReLU Skip connection 1st residual block dilation = 1 Conv 1Γ1 s-chan + Post-processing Parameters of the original Wavenet Causal conv, r-chan One hot, d-chan π=256 π=512 π =256 channels Pre-processing Input (speech) 30 residual blocks
25
WaveNet architecture β Global conditioning
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ ,π Res. block Embedding channels π 5 πΈ Res. block Speaker embedding vector π 4 Res. block Speaker id, π π 3 Res. block π 2 Res. block Res. block π 1 π₯ πβ4 π₯ πβ3 π₯ πβ2 π₯ πβ1
26
WaveNet architecture β Global conditioning
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ ,π Res. block Residual channels πΈ Res. block Speaker embedding vector Res. block Speaker id Res. block Res. block Res. block π₯ πβ4 π₯ πβ3 π₯ πβ2 π₯ πβ1
27
WaveNet architecture β Local conditioning
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ , β π Linguistic features Res. block π 5 Upsampling Res. block π 4 Res. block π 3 β π Res. block π 2 Res. block Res. block π 1 π₯ πβ4 π₯ πβ3 π₯ πβ2 π₯ πβ1
28
WaveNet architecture β Local conditioning
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ , β π Linguistic features Res. block Res. block Embedding at time n Res. block β π Res. block Res. block Res. block π₯ πβ4 π₯ πβ3 π₯ πβ2 π₯ πβ1
29
WaveNet architecture β Local conditioning
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ , β π Acoustic features Res. block π 5 Upsampling Res. block π‘ππβ π 4 π π Res. block β π π 3 Res. block π 2 π‘ππβ Res. block Res. block π 1 π₯ πβ4 π₯ πβ3 π₯ πβ2 π₯ πβ1
30
WaveNet architecture β Local and global conditioning
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ , β π , π π Acoustic features Res. block Upsampling π 5 π π π π Res. block π‘ππβ π 4 β π Res. block π 3 π π Res. block π π π 2 π‘ππβ Res. block Res. block Upsampling π 1 Speaker id π₯ πβ4 π₯ πβ3 π₯ πβ2 π₯ πβ1
31
WaveNet architecture for TTS
Dilated conv tanh Ο Conv 1Γ1, r-chan Γ + Identity mapping Conv 1Γ1 Loss Output Skip connection Conv 1Γ1, d-chan Softmax Conv 1Γ1, d-chan ReLU Conv 1Γ1, d-chan Dilated conv tanh Ο Conv 1Γ1, r-chan Γ + Identity mapping Conv 1Γ1 ReLU Skip connection Conv 1Γ1, d-chan + Post-processing Causal conv, r-chan One hot, d-chan Up-sampling Pre-processing Input (labels) Input (speech)
32
WaveNet architecture -Improvements
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ
Res. block 1X1 256 512 Ξ£ 256 512 Res. block 1X1 256 512 512 30 Res. block 1X1 256 256 ReLU 1X1 256 ReLU 1X1 256 Softmax 512 512 256 Res. block 1X1 512 512 256 Res. block 1X1 512 512 1X1 conv 1X1 conv π₯ πβ2 π₯ πβ1 256
33
WaveNet architecture - Improvements
π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ
Res. block 512 256 512 Res. block 512 512 30 Concatenate Res. block 512 x 30 1X1 256 ReLU 1X1 256 ReLU 1X1 256 Softmax 512 512 Res. block 512 512 Res. block 512 512 Up to 10% increase in speed 1X1 conv 1X1 conv π₯ πβ2 π₯ πβ1 256
34
WaveNet architecture - Improvements
Dilated convolution tanh Ο Conv 1Γ1, r-chan Γ + Identity mapping Loss Output Skip connection Conv 1Γ1 s-chan kth residual block dilation = 512 Softmax Conv 1Γ1, d-chan ReLU Conv 1Γ1, d-chan Dilated convolution tanh Ο Conv 1Γ1 r-chan Γ + Identity mapping ReLU Skip connection 1st residual block dilation = 1 Conv 1Γ1 s-chan + Post-processing Parameters of the original Wavenet Causal conv, r-chan One hot, d-chan π=256 π=512 π =256 channels Pre-processing Input (speech) 30 residual blocks
35
WaveNet architecture - Improvements
2x1 Dilated conv, 2r-chan tanh Ο Conv 1Γ1, r-chan Γ + Identity mapping Loss Output Skip connection kth residual block dilation = 512 Softmax Conv 1Γ1, d-chan ReLU Concatenate Conv 1Γ1, d-chan Conv 1Γ1 s-chan 2x1 Dilated conv, 2r-chan tanh Ο Conv 1Γ1 r-chan Γ + Identity mapping ReLU Skip connection 1st residual block dilation = 1 + Post-processing Parameters of the original Wavenet Causal conv, r-chan One hot, d-chan π=256 π=512 π =256 channels Pre-processing Input (speech) 30 residual blocks
36
WaveNet architecture - Improvements
2x1 Dilated conv, 2r-chan tanh Ο Γ + Identity mapping Loss Output Skip connection kth residual block dilation = 512 Softmax Conv 1Γ1, d-chan ReLU Concatenate Conv 1Γ1, d-chan Conv 1Γ1 s-chan 2x1 Dilated conv, 2r-chan tanh Ο Γ + Identity mapping ReLU Skip connection 1st residual block dilation = 1 + Post-processing Parameters of improved Wavenet Causal conv, r-chan One hot, d-chan π=256 π=64 π =256 channels Pre-processing Input (speech) 40 to 80 residual blocks
37
WaveNet architecture -Improvements
ReLU 1X1 ReLU 1X1 Softmax π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ
Post-processing in the original Wavenet 8-bit quantization ReLU 1X1 ReLU 1X1 Parameters of discretized mixture of logistics Post-processing in new Wavenet (high fidelity Wavenet) 16-bit quantization π π₯ π,π,π = π=1 πΎ π π π π₯+0.5β π π / π π βπ π₯β0.5β π π / π π π π₯ = 1 1+ π βπ₯
38
WaveNet architecture -Improvements
ReLU 1X1 ReLU 1X1 Softmax π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ
Post-processing in the original Wavenet 8-bit quantization ReLU 1X1 ReLU 1X1 Parameters of discretized mixture of logistics Post-processing in new Wavenet (high fidelity Wavenet) 16-bit quantization The original Wavenet maximizes the cross-entropy between the desired distribution 0,β¦,0,1,0,β¦,0 and the network prediction π π₯ π | π₯ πβ1 ,β¦, π₯ πβπ
= π¦ 1 , π¦ 2 ,β¦, π¦ 256 The new Wavenet maximizes the log-likelihood 1 πβπ
π=π
+1 π log π(π,π,π | π₯ π )
39
WaveNet architecture - Knowledge Distillation
40
WaveNet architecture - Knowledge Distillation
41
Comments According to S. Arik, et al.
The number of dilated modules should be β₯ 40. Models trained with 48 kHz speech produce higher quality audio than models trained with 16 kHz speech. The model need more than iterations to converge. The speech quality is strongly affected by the up-sampling method of the linguistic labels. The Adam optimization algorithm is a good choice. Conditioning: pentaphones + stress + continuous F0 + VUV IBM, in March 2017, announced a new industry bound in word error rate in conversational speech recognition. IBM exploited the complementarity between recurrent and convolutional architectures by adding word and character-based LSTM Language Models and a convolutional WaveNet Language Model. G. Saon, et al., English Conversational Telephone Speech Recognition by Humans and Machines
42
Generated audio samples for the original Wavenet
Basic Wavenet Model trained with the cmu_us_slt_arctic-0.95-release.zip database (~40 min, Hz)
43
References van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray, WaveNet: A Generative Model for Raw Audio, arXiv: Arik, Sercan; Chrzanowski, Mike; Coates, Adam; Diamos, Gregory; Gibiansky, Andrew; Kang, Yongguo; Li, Xian; Miller, John; Ng, Andrew; Raiman, Jonathan; Sengupta, Shubho; Shoeybi, Mohammad, Deep Voice: Real-time Neural Text-to-Speech, eprint arXiv: Arik, Sercan; Diamos, Gregory; Gibiansky, Andrew; Miller, John; Peng, Kainan; Ping, Wei; Raiman, Jonathan; Zhou, Yanqi, Deep Voice 2: Multi-Speaker Neural Text-to-Speech, eprint arXiv: Le Paine, Tom; Khorrami, Pooya; Chang, Shiyu; Zhang, Yang; Ramachandran, Prajit; Hasegawa- Johnson, Mark A.; Huang, Thomas S., Fast Wavenet Generation Algorithm, eprint arXiv: Ramachandran, Prajit; Le Paine, Tom; Khorrami, Pooya; Babaeizadeh, Mohammad; Chang, Shiyu; Zhang, Yang; Hasegawa-Johnson, Mark A.; Campbell, Roy H.; Huang, Thomas S., Fast Generation for Convolutional Autoregressive Models, eprint arXiv: Wavenet implementation in Tensorflow. Found in (Author: Igor Babuschkin et al.). Fast Wavenet implementation in Tensorflow. Found in wavenet (Authors: Tom Le Paine, Pooya Khorrami, Prajit Ramachandran and Shiyu Chang) AΓ€ron van den Oord et al., Parallel WaveNet: Fast High-Fidelity Speech Synthesis, arXiv: Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma, PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.