Artificial Neural Networks in Speech Processing

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Artificial Neural Networks in Speech Processing May 2017 Vassilis Tsiaras Computer Science Department University of Crete

Artificial Neural Networks The artificial neural networks is a family of model, which are inspired from the biological neural networks, and which are used as multivariate function approximators. In 1943, Warren McCulloch and Walter Pitts proposed the computational model of a neuron. 𝑥 1 𝑥 2 𝑥 3 𝑥 4 𝜑 𝑧 𝑦 1 𝑤 1 𝑤 2 𝑤 3 𝑤 4 𝑏 𝑧= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑦=𝜑 𝑧 = 0, 𝑧<0 &1, 𝑧≥0

Activation functions Step function 𝜑 𝑥 = 0, 𝑥<0 &1, 𝑥≥0 𝜑 𝑥 = 0, 𝑥<0 &1, 𝑥≥0 𝑑𝜑 𝑥 𝑑𝑥 = 0, 𝑥<0 &0, 𝑥>0 Sigmoid functions 𝜎(𝑥) 𝜑 𝑥 = 1 1+ 𝑒 −𝑥 𝑑𝜑 𝑥 𝑑𝑥 = 1−𝜑 𝑥 𝜑(𝑥) Hyperbolic tangent tanh⁡(𝑥) 𝜑 𝑥 = 2 1+ 𝑒 −2𝑥 −1 𝑑𝜑 𝑥 𝑑𝑥 =1− 𝜑(𝑥) 2 Rectified linear unit ReLU 𝜑 𝑥 = 0, 𝑥<0 &𝑥, 𝑥≥0 𝑑𝜑 𝑥 𝑑𝑥 = 0, 𝑥<0 &1, 𝑥≥0 Linear function 𝑑𝜑 𝑥 𝑑𝑥 =1 𝜑 𝑥 =𝑥

Representation of artificial neural networks 𝑥(1) 𝑊 1 (1,1) 𝑦 1 (1) 𝑊 2 (1,1) 𝑊 1 (1,2) 𝑊 1 (1,3) 𝑊 2 (1,2) 𝑦 2 (1) 𝑊 2 (2,1) 𝑥(2) 𝑦 1 (2) 𝑊 1 (2,2) 𝑊 1 (2,3) 𝑊 2 (2,2) 𝑦 2 (2) 𝑊 1 (3,1) 𝑥(3) 𝑦 1 (3) 𝑦 1 (1) 𝑦 1 (2) 𝑦 1 (3) 𝜑 2 𝑧 2 (2) 𝑦 2 (2) +1 𝑤 2 (1,2) 𝑤 2 (2,2) 𝑤 2 (3,2) 𝑏 2 (2) 𝑏 1 (2) 𝑏 1 (1) 𝑏 2 (1) 𝑏 2 (2) 𝑏 1 (3) +1 +1 × W1 + b1 × W2 + b2 φ2(.) 𝑥 𝑦 2  ℝ 1× 𝑑 𝑜𝑢𝑡 𝑧 2  ℝ 1× 𝑑 𝑜𝑢𝑡 𝑑 1 × 𝑑 𝑜𝑢𝑡 1× 𝑑 𝑜𝑢𝑡 𝑑 𝑖𝑛 × 𝑑 1 1× 𝑑 1 𝑦 1  ℝ 1× 𝑑 1 𝑧 1  ℝ 1× 𝑑 1 f1(.) φ1(.)

Representation of artificial neural networks + b1 × W2 + b2 φ2(.) 𝑥 𝑦 2  ℝ 1× 𝑑 𝑜𝑢𝑡 𝑧 2  ℝ 1× 𝑑 𝑜𝑢𝑡 𝑑 1 × 𝑑 𝑜𝑢𝑡 1× 𝑑 𝑜𝑢𝑡 𝑑 𝑖𝑛 × 𝑑 1 1× 𝑑 1 𝑦 1  ℝ 1× 𝑑 1 𝑧 1  ℝ 1× 𝑑 1 f1(.) φ1(.) 𝜑 𝑙 𝑧 𝑙 = 𝑦 𝑙 (1), 𝑦 𝑙 (2), …, 𝑦 𝑙 ( 𝑑 𝑙 ) 𝑧 1 = 𝑥𝑊 1 + 𝑏 1 𝑦 1 = 𝜑 1 ( 𝑧 1 ) 𝑧 2 = 𝑦 1 𝑊 2 + 𝑏 2 𝑦 2 = 𝜑 2 ( 𝑧 2 ) where 𝑦 𝑙 𝑘 = 𝜑 𝑙 ( 𝑧 𝑙 (𝑘)) or the Softmax activation 𝑦 𝑙 𝑘 = 𝑒 𝑧 𝑙 (𝑘) 𝑖=1 𝑑 𝑙 𝑒 𝑧 𝑙 (𝑖)

Linear regression Suppose that we can predict variable 𝑦 from variable 𝑥 We assume that the relationship between the independent variable 𝑥 and the dependent variable 𝑦 is of the form 𝑡=𝑞 𝑥 +𝜀, where ε~𝑁(0, 𝜎 2 ) The function q will be determined from a set of points 𝑥 𝑖 , 𝑡 𝑖 | 𝑖=1,…, 𝑛 In order to determine q its form is restricted. The simplest relationship between two variables is the linear regression 𝑦=𝑤𝑥+𝑏 In this case the parameters 𝑤 and 𝑏 are estimated from the data. The mean square error between the target values 𝑡 𝑖 and the corresponding estimations 𝑦 𝑖 is: 𝐸= 1 𝑛 𝑖=1 𝑛 𝑡 𝑖 −𝑤 𝑥 𝑖 −𝑏 2 We seek parameters 𝑤 and 𝑏 which minimize the error 𝐸

Linear regression Mean squared error 𝐸(𝑤,𝑏)= 1 𝑛 𝑖=1 𝑛 𝑡 𝑖 −𝑤 𝑥 𝑖 −𝑏 2 𝐸(𝑤,𝑏)= 1 𝑛 𝑖=1 𝑛 𝑡 𝑖 −𝑤 𝑥 𝑖 −𝑏 2 We seek the minimum of 𝐸 in the roots of the partial derivatives 𝜕E 𝜕𝑤 = 2 𝑛 𝑖=1 𝑛 𝑤 𝑥 𝑖 +𝑏− 𝑡 𝑖 𝑥 𝑖 =0 𝜕𝐸 𝜕𝑏 = 2 𝑛 𝑖=1 𝑛 𝑤 𝑥 𝑖 +𝑏− 𝑡 𝑖 =0 From the above equation, the least square estimators are: 𝑤= 𝑛 𝑖=1 𝑛 𝑥 𝑖 𝑡 𝑖 − 𝑖=1 𝑛 𝑥 𝑖 𝑖=1 𝑛 𝑡 𝑖 𝑛 𝑖=1 𝑛 𝑥 𝑖 2 − 𝑖=1 𝑛 𝑥 𝑖 2 𝑏= 𝑡 −𝑤 𝑥 where 𝑥 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 , 𝑡 = 1 𝑛 𝑖=1 𝑛 𝑡 𝑖

Linear regression - Example We generate 1500 points from the function 𝑡=𝑞 𝑥 +𝜀, where ε~𝑁(0,0.25) and the plot of 𝑞 is shown in the left figure. We will estimate the parameters of the linear regression that minimizes the least squares error. 𝐶 𝑞 Plot of 𝑞 Samples 𝑥 𝑖 , 𝑡 𝑖 | 𝑖=1, …, 𝑛

Linear regression - Example From the normal equations we have 𝑤=0.897945 and 𝑏=0.6412331 Although we estimated the parameters 𝑤 and 𝑏 from the normal equations , we will also compute these parameters by minimizing the mean squared error using the steepest descent iterative method. We do this for two reasons The iterative methods can be applied to non-linear regressions. The way that the real neurons work is closer to the iterative methods than to the solution of algebraic equations.

Linear regression - Example 𝐸(𝑤,0.6441) 𝑏=0.6412 𝑤 𝐸(0.8979,𝑏) 𝑏 𝑤=0.8979 Sections of the graph of error Ε with the planes b=0.6412 and w=0.8979 correspondingly Contours of the mean squared error The error 𝐸(𝑤,𝑏) has global minimum

The steepest descent method Let 𝜃 (𝑘) = 𝑤 (𝑘) , 𝑏 (𝑘) and 𝑝 (𝑘) be a point and a vector of the plane 𝑤,𝑏 . Let 𝜂 be a small positive number. Then 𝜃 (𝑘) +𝜂 𝑝 (𝑘) is a point of the plane 𝑤,𝑏 near to 𝜃 (𝑘) From the Taylor expansion, after discarding the second order terms with respect to 𝜂, we get 𝐸 𝜃 (𝑘) +𝜂 𝑝 𝑘 =𝐸 𝜃 (𝑘) +𝜂 𝑝 𝑘 𝑇 ∙𝛻𝐸 𝜃 (𝑘) where 𝛻𝐸 𝜃 (𝑘) = 𝜕E 𝜕𝑤 𝜃 (𝑘) , 𝜕𝐸 𝜕𝑏 𝜃 (𝑘) Τ The error decreases when the number 𝜂𝑝 𝑘 𝑇 ∙𝛻𝐸 𝜃 (𝑘) is negative One possible choice is: 𝑝 𝑘 𝑇 =−𝛻𝐸 𝜃 (𝑘) and 𝜂 “small”. 𝜃 (𝑘) 𝜃 (𝑘) +𝜂 𝑝 (𝑘) 𝜂 𝑝 (𝑘) b w

The steepest descent method Initialization 𝜃 (0) = 𝑤 (0) , 𝑏 (0) =[N 0, 𝜎 𝑤 2 , 0] For k = 0,…,numIts-1 𝜃 (𝜅+1) = 𝜃 (𝜅) −𝜂𝛻𝐸 𝜃 (𝜅) The method diverges when 𝜂 is not sufficiently small The vector 𝛻𝐸 𝜃 (𝜅) is perpenticular to the contour lines of 𝐸 𝜃 (4) 𝜃 (3) 𝜃 (2) 𝜃 (0) 𝜃 (1)

The steepest decent method with momentum For each one of the parameters 𝑤, 𝑏 we consider its velocity 𝑣 𝑤 and 𝑣 𝑏 Initialization 𝜃 (0) = 𝑤 (0) , 𝑏 (0) =[N 0, 𝜎 𝑤 2 , 0] Initialization 𝑣 (0) =[ 𝑣 𝑤 0 , 𝑣 𝑏 (0) ]=[0,0] For k = 0,…,numIts-1 𝑣 (𝜅+1) = 𝜇𝑣 (𝜅) −𝜂𝛻𝐸 𝜃 (𝜅) 𝜃 (𝜅+1) = 𝜃 (𝜅) +𝑣 Vectors 𝛻𝐸 𝑣~𝜂× sum 𝛻𝐸 Steepest descent Steepest descent with momentum

Linear neuron We consider a linear neuron, whose output 𝑦 is given by the equation 𝑦=𝑤𝑥+𝑏 Using 1500 samples 𝑥 𝑖 , 𝑡 𝑖 | 𝑖=1, …, 1500 we train the neuron with the steepest descent method so that to minimize 𝐸 𝑤,𝑏 . The optimum value of the parameters are: 𝑤=0.92 and 𝑏=0.5494 For 𝑥∈ −1, 14 , the output of the neuron is shown in the following figure. × w + b 𝑥 𝑦 • Samples -- Linear regression ꟷ Linear neuron

Linear neural network We consider a neural network whose structure is shown in the following figure × W1 + b1 × W2 + b2 linear 𝑥 1204×1 1×1 1×1024 f1(.) 𝑦 2 • Samples -- Linear regression ꟷ Neural network No mater how many parameters has, the neuron can only learn lines.

Non-liner neural network + b1 × W2 + b2 linear 𝑥 𝑧 2  ℝ 1×1 1204×1 1×1 1×1024 𝑦 1  ℝ 1×1024 𝑧 1  ℝ 1×1024 f1(.) tanh 𝑦 2 • Samples -- Linear regression ꟷ Neural network

Non-linear neural network + b1 × W2 + b2 tanh × W3 + b3 linear 𝑥 𝑦 3 𝑧 3  ℝ 1×1 𝑦 2  ℝ 1×256 𝑧 2  ℝ 1×256 256×1 1×1 128×256 1×256 1×128 𝑦 1  ℝ 1×128 𝑧 1  ℝ 1×128 f1(.) • Samples -- Linear regression ꟷ Neural network

Output of an artificial neural network Input: 𝑥= 𝑥(1) … 𝑥( 𝑑 𝑖𝑛 )  ℝ 1×𝑑 𝑖𝑛 𝑧 1 =𝑥 𝑊 1 + 𝑏 1 𝑦 1 = 𝜑 1 ( 𝑧 1 ) 𝑧 2 = 𝑦 1 𝑊 2 + 𝑏 2 𝑦 2 =𝜑( 𝑧 2 ) 𝑧 3 = 𝑦 2 𝑊 3 + 𝑏 3 𝑦 3 = 𝜑 3 ( 𝑧 3 ) Output: 𝑦 3 = 𝑦 3 1 , …, 𝑦 3 ( 𝑑 𝑜𝑢𝑡 )  ℝ 1× 𝑑 𝑜𝑢𝑡 𝛼 . ; 𝑊 1 , 𝑏 1 , 𝑊 2 , 𝑏 2 , 𝑊 3 𝑥 𝑦 3 𝑦 3 = 𝜑 3 𝜑 2 𝜑 1 𝑥 𝑊 1 + 𝑏 1 𝑊 2 + 𝑏 2 𝑊 3 + 𝑏 3 = 𝛼 𝜃 𝑥 × W1 + b1 × W2 + b2 φ2(.) × W3 + b3 φ3(.) 𝑥 𝑦 3 𝑧 3  ℝ 1× 𝑑 𝑜𝑢𝑡 𝑦 2  ℝ 1× 𝑑 2 𝑧 2  ℝ 1× 𝑑 2 𝑑 2 × 𝑑 𝑜𝑢𝑡 1× 𝑑 𝑜𝑢𝑡 𝑑 1 × 𝑑 2 1× 𝑑 2 𝑑 𝑖𝑛 × 𝑑 1 1× 𝑑 1 𝑦 1  ℝ 1× 𝑑 1 𝑧 1  ℝ 1× 𝑑 1 f1(.) φ1(.)

Cost function Given a set of samples 𝑥 𝑖 , 𝑡 𝑖 | 𝑖=1,…, 𝑛 with 𝑥 𝑖  ℝ 1×𝑑 𝑖𝑛 and 𝑦 𝑖  ℝ 1×𝑑 𝑜𝑢𝑡 Let 𝑦 𝑖 = 𝛼 𝜃 ( 𝑥 𝑖 ) be the output vector of the neurons The mean cost of all samples 𝐸 𝑑𝑎𝑡𝑎 = 1 𝑛 𝑖=1 𝑛 𝐿 𝑡 𝑖 , 𝑦 𝑖 In speech synthesis, the cost function is the mean squared error 𝐿 𝑡 𝑖 , 𝑦 𝑖 = 1 2 𝑑 𝑜𝑢𝑡 𝑡 𝑖 − 𝑦 𝑖 2 = 1 2 𝑑 𝑜𝑢𝑡 𝑘=1 𝑑 𝑜𝑢𝑡 𝑡 𝑖 𝑘 − 𝑦 𝑖 (𝑘) 2 In speech recognition, the cost function is the cross-entropy 𝐿 𝑡 𝑖 , 𝑦 𝑖 =− 𝑘=1 𝑑 𝑜𝑢𝑡 𝑡 𝑖 𝑘 log⁡( 𝑦 𝑖 (𝑘)) where 𝑡 𝑖 = 0, ⋯, 0, 1, 0, ⋯, 0 𝑇 is a one-hot encoding vector, where 𝑡 𝑖 𝑘 = 𝛿 𝑖𝑘

Regularization 𝐸=𝐸 𝑑𝑎𝑡𝑎 + 𝜆 1 𝐿 1 𝜃 + 𝜆 2 𝐿 2 (𝜃) When the number of parameters of a model is big compared to the number of samples then the model may overfitting the data. In over-fitting a statistical model describes random errors or noise instead of the relationship between the input and output data. One way to ovoid overfitting is to add regularization terms in the cost function 𝐸=𝐸 𝑑𝑎𝑡𝑎 + 𝜆 1 𝐿 1 𝜃 + 𝜆 2 𝐿 2 (𝜃) where 𝐿 1 𝜃 = 𝑙=1 3 𝑊 𝑙 1 + 𝑙=1 3 𝑏 𝑙 1 𝐿 2 𝜃 = 𝑙=1 3 𝑊 𝑙 2 2 + 𝑙=1 3 𝑏 𝑙 2 2

Regularization with dropout Another way to avoid overfitting is to randomly deactivate neurons during training. The dropout, in comparison with the 𝐿 1 or 𝐿 2 regularizations, does not change the cost function.

Learning with backpropagation The steepest descent method requires the computation of the gradient 𝛻𝐸( 𝜃 𝑘 ) We have 𝛻 𝜃 𝐸= 𝜕𝐸 𝜕 𝑊 1 , 𝜕𝐸 𝜕 𝑏 1 , 𝜕𝐸 𝜕 𝑊 2 , 𝜕𝐸 𝜕 𝑏 2 , 𝜕𝐸 𝜕 𝑊 3 , 𝜕𝐸 𝜕 𝑏 3 where 𝜕 𝑊 𝑙 2 2 𝜕 𝑊 𝑙 =2 𝑊 𝑙 , 𝜕 𝑏 𝑙 2 2 𝜕 𝑏 𝑙 = 2𝑏 𝑙 If all elements of 𝑊 𝑙 are different from zero, then the derivative 𝜕 𝑊 𝑙 1 𝜕 𝑊 𝑙 is defined and is equal to 𝑠𝑖𝑔𝑛( 𝑊 𝑙 ) Similarly 𝜕 𝑏 𝑙 1 𝜕 𝑏 𝑙 =𝑠𝑖𝑔𝑛( 𝑏 𝑙 ) The computation of the partial derivatives 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑊 1 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑏 1 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑊 2 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑏 2 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑊 3 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑏 3 is done using the chain rule and is reduced to the computation of the partial derivatives 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑧 1 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑧 2 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑧 3 𝐸= 𝐸 𝑑𝑎𝑡𝑎 + 𝜆 1 𝑙=1 3 𝑊 𝑙 1 + 𝜆 1 𝑙=1 3 𝑏 𝑙 1 + 𝜆 2 𝑙=1 3 𝑊 𝑙 2 2 + 𝜆 2 𝑙=1 3 𝑏 𝑙 2 2

Computation of the partial derivatives We assume that for input 𝑥 𝑖 the output of the neural network is then 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,3 = 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑦 𝑖,3 ⊙ 𝜕 𝑦 𝑖,3 𝜕 𝑧 𝑖,3 = 𝜕𝐿( 𝑦 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑦 𝑖,3 ⊙ 𝜕 𝜑 3 ( 𝑧 𝑖,3 ) 𝜕 𝑧 𝑖,3 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,2 = 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑦 𝑖,3 ⊙ 𝜕 𝑦 𝑖,3 𝜕 𝑧 𝑖,3 ∙ 𝜕 𝑧 𝑖,3 𝜕 𝑦 𝑖,2 ⊙ 𝜕 𝜑 2 ( 𝑧 𝑖,2 ) 𝜕 𝑧 𝑖,2 = 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,3 ∙ 𝑊 3 𝑇 ⊙ 𝑑 𝜑 2 ( 𝑧 𝑖,2 ) 𝑑 𝑧 𝑖,2 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,1 = 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,2 ∙ 𝑊 2 𝑇 ⊙ 𝜕 𝜑 1 ( 𝑧 𝑖,1 ) 𝜕 𝑧 𝑖,1 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑏 3 = 1 𝑛 𝑖=1 𝑛 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,3 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑏 2 = 1 𝑛 𝑖=1 𝑛 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,2 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑏 1 = 1 𝑛 𝑖=1 𝑛 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,1 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑊 3 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖,2 𝑇 ∙ 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,3 , 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑊 2 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖,1 𝑇 ∙ 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,2 𝜕 𝐸 𝑑𝑎𝑡𝑎 𝜕 𝑊 1 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 𝑇 ∙ 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑧 𝑖,1 𝑦 𝑖,3 = 𝜑 3 𝜑 2 𝜑 1 𝑥 𝑖 𝑊 1 + 𝑏 1 𝑊 2 + 𝑏 2 𝑊 3 + 𝑏 3

Learning with backpropagation 𝑑𝑊 3 =0 , 𝑑𝑊 2 =0 , 𝑑𝑊 1 =0 , 𝑑𝑏 3 =0, 𝑑𝑏 2 =0 , 𝑑𝑏 1 =0 For 𝑖=1:𝑛 𝑧 𝑖,1 = 𝑥 𝑖 𝑊 1 + 𝑏 1 𝑦 𝑖,1 = 𝜑 1 ( 𝑧 𝑖,1 ) // Forward pass 𝑧 𝑖,2 = 𝑦 𝑖,1 𝑊 2 + 𝑏 2 𝑦 𝑖,2 =𝜑( 𝑧 𝑖,2 ) 𝑧 𝑖,3 = 𝑦 𝑖,2 𝑊 3 + 𝑏 3 𝑦 𝑖,3 = 𝜑 3 ( 𝑧 𝑖,3 ) 𝐷 𝑖,3 = 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖,3 ) 𝜕 𝑦 𝑖,3 ⊙ 𝜕 𝜑 3 ( 𝑧 𝑖,3 ) 𝜕 𝑧 𝑖,3 // Backpropagation 𝐷 𝑖,2 = 𝐷 𝑖,3 ∙ 𝑊 3 𝑇 ⊙ 𝜕 𝜑 2 ( 𝑧 𝑖,2 ) 𝜕 𝑧 𝑖,2 𝐷 𝑖,1 = 𝐷 𝑖,2 ∙ 𝑊 2 𝑇 ⊙ 𝜕 𝜑 1 ( 𝑧 𝑖,1 ) 𝜕 𝑧 𝑖,1 𝑑𝑏 3 = 𝑑𝑏 3 + 𝐷 𝑖,3 , 𝑑𝑏 2 = 𝑑𝑏 2 + 𝐷 𝑖,2 , 𝑑𝑏 1 = 𝑑𝑏 1 + 𝐷 𝑖,1 𝑑𝑊 3 = 𝑑𝑊 3 + 𝑦 𝑖,2 𝑇 ∙𝐷 𝑖,3 , 𝑑𝑊 2 = 𝑑𝑊 2 + 𝑦 𝑖,1 𝑇 ∙𝐷 𝑖,2 , 𝑑𝑊 1 = 𝑑𝑊 1 + 𝑥 𝑖 𝑇 ∙𝐷 𝑖,1 For 𝑙=1:3 // SGD 𝑊 𝑙 𝑛𝑒𝑤 = 𝑊 𝑙 −𝜂 𝑑𝑊 𝑙 + 𝜆 1 𝑠𝑖𝑔𝑛 𝑊 𝑙 + 2𝜆 2 𝑊 𝑙 𝑏 𝑙 𝑛𝑒𝑤 = 𝑏 𝑙 −𝜂 𝑑𝑏 𝑙 + 𝜆 1 𝑠𝑖𝑔𝑛 𝑏 𝑙 + 2𝜆 2 𝑏 𝑙

Recurrent neural networks 𝑧 𝑖 = 𝑦 𝑖−1 𝑈+ 𝑥 𝑖 𝑊+𝑏 𝑦 𝑖 =𝜑( 𝑧 𝑖 ) × W + b × U 𝑥 𝑑 𝑖𝑛 × 𝑑 𝑜𝑢𝑡 1× 𝑑 𝑜𝑢𝑡 𝑦 ℝ 1× 𝑑 𝑜𝑢𝑡 𝑧 ℝ 1× 𝑑 𝑜𝑢𝑡 f1(.) φ(.) + 𝑑 𝑜𝑢𝑡 × 𝑑 𝑜𝑢𝑡 × U × U × U × U × U 𝑦 0 𝑦 1 𝑦 2 𝑦 3 𝑦 4 × W + b × W + b × W + b × W + b 𝑥 1 𝑥 2 𝑥 3 𝑥 4

Backpropagation in time 𝑑𝑊=0 , 𝑑𝑈=0 , 𝑑𝑏=0 For 𝑖=1:𝑛 𝑧 𝑖 = 𝑦 𝑖−1 𝑈+𝑥 𝑖 𝑊+𝑏 𝑦 𝑖 =𝜑 𝑧 𝑖 𝐷 𝑛+1 =0 // Backpropagation in time For 𝑖=𝑛:1 𝐷 𝑖 = 𝐷 𝑖+1 ∙ 𝑈 𝑇 + 𝜕𝐿( 𝑡 𝑖 , 𝑦 𝑖 ) 𝜕 𝑦 𝑖 ⊙ 𝜕𝜑 ( 𝑧 𝑖 ) 𝜕 𝑧 𝑖 𝑑𝑏=𝑑𝑏+ 𝐷 𝑖 𝑑𝑊=𝑑𝑊+ 𝑥 𝑖 𝑇 ∙𝐷 𝑖 𝑑𝑈=𝑑𝑈+ 𝑦 𝑖−1 𝑇 ∙𝐷 𝑖 𝑑𝑦 0 = 𝐷 1 ∙ 𝑈 𝑇 𝑈 𝑛𝑒𝑤 =𝑈−𝜂 𝑑𝑈+ 𝜆 1 𝑠𝑖𝑔𝑛 𝑈 + 2𝜆 2 𝑈 𝑊 𝑛𝑒𝑤 =𝑊−𝜂 𝑑𝑊+ 𝜆 1 𝑠𝑖𝑔𝑛 𝑊 + 2𝜆 2 𝑊 𝑏 𝑛𝑒𝑤 =𝑏−𝜂 𝑑𝑏+ 𝜆 1 𝑠𝑖𝑔𝑛 𝑏 + 2𝜆 2 𝑏 𝑦 0 𝑛𝑒𝑤 = 𝑦 0 −𝜂 𝑑 𝑦 0 + 𝜆 1 𝑠𝑖𝑔𝑛 𝑦 0 + 2𝜆 2 𝑦 0

Convolutional neural networks

Speech synthesis with neural networks Input: Full context labels “Author of the ...” pau ao th er ah v dh ax … pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$..... pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$..... pauâo-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$..... ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$..... thêr-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$..... erâh-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$..... ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$..... v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$..... Encoding of the full context label into a fixed length vector with values in the interval [0,1] Output: Speech parameters 40 mcep, lf0, vuv, bap, phi, (+ delta + delta-delta). The parameters have been normalized and have mean value 0 and standard deviation 1.

Speech synthesis with artificial neural networks

Speech synthesis with artificial neural networks Architecture The activation function of the last layer is linear The cost function is the mean squared errorSamples enUS_flh_neutral_a018 enUS_flh_neutral_g256 enUS_flh_neutral_i033 Hashimoto et al. 2016, Trajectory training considering global variance for speech synthesis based on neural networks enUS_flh_neutral_j257 enUS_flh_neutral_q178

Speech recognition with neural networks Input: Speech parameters, e.g., 24 mcep, (+ delta + delta-delta). The parametrs have been normalized so that to have mean value 0 and standard deviation equal to 1. Output: Phonemes, encoded in fixed-length vectors with elements 0-1 (one-hot) “Author of the ...” pau ao th er ah v dh ax … Architecture: Activation function of the last layer: softmax, Cost function: cross-entropy or CTC (Connectionist temporal Classification) Deep Neural Network architecture of Baidu’s speech recognition system

2006 Landmark year Before 2006 we couldn’t train deep neural networks The DNNs were trained with random initialization and they had worse performance from the swallow networks (up to 3 layers). After 2006: Algorithms for efficient training of DNN Stacked Restricted Boltzman Machines (RBM) or Deep Belief Networks (DBN): Hinton, Osindero & The “A Fast Learning Algorithm for Deep Belief Nets”, Neural Computation, 2006 Stacked Autoencoders (AE): Bengio, Lamblin, Popovici, Larochelle, “Greedy Layer-Wise Training of Deep Networks”, NIPS, 2006 Sparse Representations or Sparse Autoencoders: Ranzato, Poultney, Chopra, LeCun, “Efficient Learning of Sparse Representations with an Energy-Based Model”, NIPS, 2006 In addition: We have more data Faster computers: GPU’s, multi-core CPU’s

Computational power