Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

Neural Network Introduction Hung-yi Lee

Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function Set “Best” Function “2” (label) x: function input y: function output “2”“2”

Neural Network Realize it How to pick the “best” function? What is the “best” function? What does the function hypothesis set (model) look like?

Neural Network Fully Connected Feedforward Network …… Layer 1 …… Layer 2 …… Layer L …… … Input Output You can always connect the neurons in your own way. vector x vector y

Neural Network …… Layer 1 …… Layer 2 …… Layer L …… … Input Output Input layer Output layer Hidden Layers vector x vector y

Notation …… nodes Layer …… Layer nodes …… Output of a neuron: Neuron i Layer Output of one layer: : a vector

Notation …… nodes Layer …… Layer nodes …… Layer to Layer from neuron j to neuron i (Layer )

Notation …… nodes Layer …… Layer nodes …… : bias for neuron i at layer l bias for all neurons in layer l

Notation …… nodes Layer …… Layer nodes …… : input of the activation function for neuron i at layer l : input of the activation function all the neurons in layer l

Notation - Summary :output of a neuron :output of a layer : input of activation function : input of activation function for a layer : a weight : a weight matrix : a bias : a bias vector

Relations between Layer Outputs …… nodes Layer …… Layer nodes ……

Function of Neural Network vector x vector y

Format of Training Data The input/output of neural network model are vectors. Object x and label y should also be represented as vectors. “2” Example: Handwriting Digit Recognition “1” 10 dimensions for digit recognition “1” “2” “3” “1” “2” “3” 1: for ink, 0: otherwise Each pixel corresponds to an element in the vector 28 x 28 28 x 28 = 784 dimensions x: y:

What is the “Best” Function? Given training data: The “best” function f * is the one who makes for all training examples x r is most close to The best function f * is the one minimizes C. C(f) evaluate the badness of a function f C(f) is a “function of function” (error function, cost function, objective function ……)

 What is the “Best” Function? The best function f * is the one minimizes C(f). Do you like this definition of “best”? Question  Is the distance a good measure to evaluate the closeness? Reference: Golik, Pavel, Patrick Doetsch, and Hermann Ney. "Cross- entropy vs. squared error training: a theoretical and experimental comparison." INTERSPEECH. 2013.

What is the “Best” Function? Error function: Given training data: (“function of function”) How to find the best parameter θ * that minimizes C(θ). Pick the “best” parameter set θ* (Hypothesis Function Set) Pick the “best” function f*

Possible Solution Statement of problems: There is a function C(θ) θ is a set of parameters θ = {θ 1, θ 2, θ 3, ……} Find θ * that minimizes C(θ) Brute force? Enumerate all possible θ Calculus? Find θ * such that

Gradient descent Starting Parameters Hopefully, with sufficient iterations, we can finally find θ* such that C(θ*) is minimized. ……

Gradient descent – one variable For simplification, first consider that θ has only one variable  Randomly start at a point θ 0  Compute C(θ 0 -ε) and C(θ 0 +ε)  If C(θ 0 +ε) < C(θ 0 -ε) θ 1 = θ 0 + ε ……

Gradient descent – two variables Suppose that θ has two variables {θ 1, θ 2 } How to find the smallest value on the red circle? C(θ)

Taylor series Let h(x) be infinitely differentiable around x = x 0.

Taylor series Taylor series for h(x)=sin(x) around x 0 =π/4 sin(x)=

Taylor series Taylor series for h(x)=sin(x) around x 0 =π/4 The approximation is good around π/4. sin(x)= ……

Taylor series One variable: Multivariable: When x is close to x 0 When x and y is close to x 0 and y 0

Gradient descent – two variables Red Circle:(If the radius is small)

Gradient descent – two variables Red Circle:(If the radius is small) Find θ 1 and θ 2 to minimize C’(θ) Simple, right?

Gradient descent – two variables Red Circle:(If the radius is small) Find θ 1 and θ 2 to minimize C’(θ) To minimize C’(θ)

Gradient descent – two variables The results is intuitive, isn’t it?

Gradient descent – High dimension Space of parameter set θ A ball …… The point with minimum C(θ) on the ball is at θ = {θ 1, θ 2, θ 3, ……}

Gradient descent Starting Parameters …… η should be small enough, but should not be too small. η is called “learning rate”

Gradient descent - Problem Different Initializations lead to different local minimums Who is Afraid of Non-Convex Loss Functions? http://videolectures.net/eml07_lecun_wia/

Gradient descent - Problem Different Initializations lead to different local minimums 20 x y Toy Example

Gradient descent for Neural Network

Chain Rule Case 1 Case 2

(chain rule) Gradient descent for Neural Network … Layer L (Output layer) … … … Layer L-1 … … … … Example:

Gradient descent for Neural Network … Layer L (Output layer) … … … Layer L-1 … … … … (constant) (chain rule) Example:

Gradient descent for Neural Network … Layer L (Output layer) … … … Layer L-1 … … … … (chain rule) Example:

Gradient descent for Neural Network (as input is “1”) … Layer L (Output layer) … … … Layer L-1 … … … … Example:

… … Layer L-2 … Layer L (Output layer) … … … Layer L-1 … … … …

(chain rule) Sum over layer L … Layer L … … … Layer L-1 … … … …

(chain rule) Sum over layer L … … Layer L-2 … … Layer L-1

… … Layer L-2 … Layer L (Output layer) … … … Layer L-1 … … Layer L-3 … … … …

Sum over layer L Sum over layer L-1

Summarizing what we have done For parameters between layer L and L-1 For parameters between layer L-2 and L-1 For parameters between layer L-3 and L-2 There are efficient way to compute the gradient – backpropagation.

Reference for Neural network Chapter 2 of Neural network and Deep Learning http://neuralnetworksanddeeplearning.com/ch ap2.html LeCun, Yann A., et al. "Efficient backprop." http://yann.lecun.com/exdb/publis/pdf/lecun- 98b.pdf Bengio, Yoshua. "Practical recommendations for gradient-based training of deep architectures.“ http://www.iro.umontreal.ca/~bengioy/papers/ YB-tricks.pdf

Thank you for your listening!

Appendix

Layer-by-layer 20 20-20 20-20-20 20-20-20-20

(constant)

(chain rule) Sum over layer L … … Layer L-2 … … Layer L-1

(chain rule) Sum over layer L … Layer L … … … Layer L-1 … … … …

Gradient descent for Neural Network … Layer L (Output layer) … … … Layer L-1 … … … … (as input is “1”) Example:

What is the “Best” Function? (Hypothesis Function Set) The best function θ * is the one minimizes C(θ). Different θ Different f Different C Objective function C is a function of θ C(θ) How to find θ * ? The best function f * is the one minimizes C.

Notation

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

Similar presentations

Presentation on theme: "Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

Similar presentations

Presentation on theme: "Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function."— Presentation transcript:

Similar presentations

About project

Feedback