Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gradient-based Learning Applied to Document Recognition

Similar presentations


Presentation on theme: "Gradient-based Learning Applied to Document Recognition"— Presentation transcript:

1 Gradient-based Learning Applied to Document Recognition
Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner 1998 Ofir Liba Michael Kotlyar Deep learning seminar 2016/7

2 Outline Conclusions & discussion Introduction
Convolution neural network -LeNet5 Net structure Loss function Back Propagation Conclusions & discussion

3 Gradient-based Learning Applied to Document Recognition
Introduction

4 Inroduction Today’s talk is about the basic ideas of a single, inspiring, industry-proven paper from the nineties… Lecunn et al., “Gradient-Based Learning Applied to Document Recognition”, 1998

5 Introduction Paper Key Message: Feature extraction:
better pattern recognition systems can be built by relying more on automatic learning and less on hand-designed heuristics Feature extraction: Represent input with low dimensional vectors Hand crafted Success of algorithm depends on the chosen class of features

6 Introduction This paper published in 1998, at that time
SVM (and kernel learning) are quite popular. Hand-crafted features (e.g. SIFT) are dominant. MNIST (58k images) is a big and challenging data. Paper became popular only at 2012 after AlexNet CNN won at ImageNet challenge Today, CNN are everywhere Paper cited by 6008

7 Gradient-based Learning Applied to Document Recognition
Convolution neural network - LeNet5

8 CNN Problem description: Proposed Solution: Outcome:
Isolated Character Recognition No hand crafted feature extraction No local model training Proposed Solution: Convolutional neural network trained with Gradient back propagation Outcome: LeNet-5 network achieves the lowest error rate at that time (0.7)

9 CNN Convolutional networks combine three architectural ideas to ensure some degree of shift, scale, and distortion invariance: Local Receptive fields Shared weights Spatial sub-sampling

10 LeNet-5 – General Input: 32x32 pixel image. Largest character is 20x20 (All important info should be in the center of the receptive field of the highest level feature detectors) Cx: Convolutional layer Sx: Subsample layer Fx: Fully connected layer Black and white pixel values are normalized : E.g. white = -0.1, Black = (Mean of Pixels = 0, STD =1)  Accelerate learning

11 LeNet-5 – Layer C1 – Feature Map
Receptive Field of the Neuron Neuron\Unit C1: Convolutional layer with 6 feature maps of size 28x28 . C1k (k=1..6) Each Neuron\unit of C1 has a 5x5 receptive field in the input layer Sparse connection Shared weights Parameters : (5*5+1)*6 = 156 Connection: 28*28*(5*5+1)*6 = If it was fully connected we had (32*32+1)*(28*28)*6 parameters Sparse Connection – manage to extract features with less parameters Shared weights – extract the specific feature for all locations in the input

12 LeNet-5 – Layer S2 – Subsampling
S2: Subsampling layer with 6 feature maps of size 14x14 2x2 non overlapping receptive fields in C1 The four inputs to a unit in S2 are added, then multiplied by a trainable coefficient, and added to a trainable bias. The result is passed through a sigmoidal function. Once a feature has been detected, its exact location becomes less important. Only its approximate position relative to other features is relevant. For example, once we know that the input image contains the endpoint of a roughly horizontal segment in the upper left area, a corner in the upper right area, and the end point of a roughly vertical segment in the lower portion of the image, we can tell the input image is a 7,subsampling achieves it since we are reducing the resolution. Parameters: 6*2 Connections: 14*14*(2*2+1)*6 = 5880

13 LeNet-5 – Layer C3 C3: Convolutional layer with 16 feature maps if size 10x10 Each Neuron\unit in C3 is connected to several 5x5 receptive field at identical locations in S2 Number of Parameters & connection are in limited bound Forces a break of symmetry in the network, different feature maps are forced to extract different (hopefully complementary) features because they get different sets of input Parameters: 1516=(5x5x3+1)x6 + (5x5x4+1)x9 + (5x5x6+1)x1 Connections:

14 LeNet-5 – Layer S4 S4: subsampling layer with 16 feature maps of size 5x5 Each neuron\unit in S4 in connected to corresponding 2x2 receptive field at C3 Parameters: 16x2 = 32 Connections: 5x5x(2x2+1)x16 = 2000

15 LeNet-5 – Layer C5 v C5: Convolutional layer with 120 feature maps of size 1x1 Each neuron\unit in C5 in connected to all 16 -5x5 receptive fields in S4 Parameters: (5x5x16+1) x 120 = 48120 Connections: (5x5x16+1) x 120 = 48120 Note: the layer is actually fully connected, it is labeled as convolutional layer because if the input was bigger, then the feature maps in C5 were bigger then 1x1.

16 LeNet-5 – Layer F6 F6: 84 fully connected units
Parameters: 84x(120+1)=10164 Connections: 84x(120+1)=10164 84: 7x12 Stylized image 𝑥 𝑖 = 𝑓 𝑎 𝑖 𝑓 𝑎 =𝐴𝑡𝑎𝑛ℎ(𝑆𝑎) Sigmoid function A = S = 2 3

17 LeNet-5 – Output layer Output Layer: 10RBF (one for each class digit)
RBF = Radial Basis Function; 𝑦 𝑖 = 𝑗 ( 𝑥 𝑗 − 𝑤 𝑖𝑗 ) 2 Parameters were chosen initially by hand and were set to -1, 1 They we designed to represent a stylized image of the corresponding character class drawn on a 7x12 bitmap (hence the number 84) The output of a particular RBF can be interpreted as a penalty term measuring the fit between the input pattern and a model of the class associated with the RBF

18 Gradient-based Learning Applied to Document Recognition
Loss function

19 Loss Function poor job = high loss great job = low loss
“tool” to measure our unhappiness with outcomes Linear classifier Highest score = outcome poor job = high loss great job = low loss

20 SVM- hinge Loss Function
SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ SVM loss for the i-th Example 𝑥 𝑖 with label 𝑦 𝑖 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑙𝑎𝑏𝑒𝑙) 𝑠= 𝑓( 𝑥 𝑖 ,W) -Vector of class scores , computed by score function 𝑠 𝑗 = 𝑓( 𝑥 𝑖 ,W) 𝑗 - the score for the j-th class is the j-th element Total Loss

21 SVM hinge- Example 3 classes only - [ Cat , Dog , Fish] Our outcome s= [ 13 , , 11 ] 𝑦 𝑖 =0 cat for the i−th example ( 𝑠 𝑦 𝑖 score of correct label) ∆ =10 0 loss because difference above the margin Loss = 8 , score class 1 and score class 3 are close Incorrect classes correct class

22 Softmax cross entropy loss
SVM output uncalibrated and difficult to interpret softmax classifier gives intuitive output – probabilities of classes Softmax function takes a vector Z of scores and squashes it to a vector of values between zero and one that sum to one Cross entropy loss 𝑠= 𝑓( 𝑥 𝑖 ,W) stays unchanged 𝑓 𝑗  to mean the j-th element of the vector of class scores f Cross entropy stretches the probabilities

23 Softmax vs SVM same score vector f
the difference is in the interpretation of the scores in f Highest score by margin Highest normalized log probability In practice, SVM and Softmax are usually comparable Correct class

24 Gradient-based Learning Applied to Document Recognition
Back Propagation

25 Backpropagation Backpropagation is about understanding how changing the weights and biases in a network changes the loss function a way of computing gradients of expressions through recursive application of chain rule Problem statement given some function f(x) given x - vector of inputs we are interested in computing the gradient of f at x (i.e. ∇f(x) )

26 Backpropagation 𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕 𝒐𝒇 𝑳 𝒘𝒓𝒕 𝒕𝒐 𝒘 − 𝛁 𝒘 𝑳 −??? Motivation
f correspond to loss function ( L ) . –Training data (fixed data) - (xi,yi),i=1…N –Weights and biases (controllable) – W,b 𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕 𝒐𝒇 𝑳 𝒘𝒓𝒕 𝒕𝒐 𝒘 − 𝛁 𝒘 𝑳 −???

27 Backpropagation – toy example

28 Backpropagation Example –

29 Backpropagation

30 Backpropagation

31 Backpropagation

32 Backpropagation

33 Backpropagation

34 Backpropagation

35 Backpropagation

36 Backpropagation

37 Backpropagation

38 Backpropagation The derivative on each variable tells you
the sensitivity of the whole expression on its value

39 Backprop Lenet (*) 𝝏𝑬 𝝏 𝒚 𝒍−𝟏 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝝏 𝒚 𝒍 (𝒘, 𝒚 𝒍−𝟏 ) 𝝏 𝒚 𝒍−𝟏
We want to find parameters W, to minimize an error ( 𝒚 𝟎 - correct class) 𝑬 𝒇 𝒙 𝟎 ,𝑾 , 𝒚 𝟎 We use gradient descent to update weights 𝑾 𝒕 =𝑾 𝒕−𝟏 − 𝜶 ∗ 𝝏𝑬 𝝏𝒘 𝒕 How do we compute gradient of E wrt weights? Loss function E is cascade of function so WE use chain rule to compute gradient of E wrt to Weights , start from last layer (*) 𝝏𝑬 𝝏 𝒚 𝒍−𝟏 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝝏 𝒚 𝒍 (𝒘, 𝒚 𝒍−𝟏 ) 𝝏 𝒚 𝒍−𝟏 (**) 𝝏𝑬 𝝏 𝒘 𝒍 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝝏 𝒚 𝒍 (𝒘, 𝒚 𝒍−𝟏 ) 𝝏 𝒘 𝒍 𝒚 𝒍 𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 𝒘 𝒍 𝒚 𝒍−𝟏 E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

40 Backprop– Lenet (caffe)
Lenet Topology We start from last layer gradient 𝝏𝑬 𝝏 𝒚 𝒍 Propagate gradient back : 𝝏𝑬 𝝏 𝒚 𝒍 → 𝝏𝑬 𝝏 𝒚 𝒍−𝟏 Compute the gradient of E wrt 𝑤 𝑙 𝝏𝑬 𝝏 𝒘 𝒍 Relu BACKWARD E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

41 Backprop– softmax with LogLoss Layer
Lenet Topology Last layer Softmax +LogLoss 𝐸=−𝑙𝑜𝑔 𝑒 𝑦 𝑒 𝑦 𝑘 =− 𝑦 0 + 𝑙𝑜𝑔( 0 9 𝑒 𝑦 𝑘 ) For all k=0..9 except 𝑘 0 (right class) we want to decrease 𝑝 𝑘 𝝏𝑬 𝝏 𝒚 𝒌 = 𝑒 𝑦 𝑘 𝑒 𝑦 𝑘 = 𝑝 𝑘 For k= 𝑘 0 we want to increase 𝑝 𝑘 𝝏𝑬 𝝏 𝒚 𝑘 0 =−1+ 𝑒 𝑦 𝑘 𝑒 𝑦 𝑘 =−1+ 𝑝 𝑘 0 Relu BACKWARD E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

42 Backprop– Inner Product Layer
Lenet Topology Forward: FC layer is just Matrix –Vector multiplication 𝑦 𝑙 = 𝑊 𝑙 ∗ 𝑦 𝑙−1 Backward: According to chain rule (*) 𝝏𝑬 𝝏 𝒚 𝒍−𝟏 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝝏 𝒚 𝒍 (𝒘, 𝒚 𝒍−𝟏 ) 𝝏 𝒚 𝒍−𝟏  𝝏𝑬 𝝏 𝒚 𝒍−𝟏 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝑾 𝒍 (**) 𝝏𝑬 𝝏 𝒘 𝒍 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝝏 𝒚 𝒍 𝒘, 𝒚 𝒍−𝟏 𝝏 𝒘 𝒍  𝝏𝑬 𝝏 𝒘 𝒍 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝒚 𝒍−𝟏 Relu BACKWARD Known from the output layer E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

43 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕
Backprop– Relu Lenet Topology Forward Rectified Linear Unit: 𝑦 𝑙 =max⁡(0, 𝑦 𝑙−1 ) No weights Backward (*) 𝜕𝐸 𝜕 𝑦 𝑙−1 = 𝜕𝐸 𝜕 𝑦 𝑙 ∗ 𝜕 𝑦 𝑙 (𝑤, 𝑦 𝑙−1 ) 𝜕 𝑦 𝑙−  𝝏𝑬 𝝏 𝒚 𝒍−𝟏 = 𝟎, 𝒊𝒇 𝒚 𝒍 <𝟎 𝝏𝑬 𝝏 𝒚 𝒍 ∗𝟏 , 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 𝑦 𝑙 ′ = 𝟎, 𝒊𝒇 𝒚 𝒍−𝟏 <𝟎 𝟏 , 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 Relu BACKWARD E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

44 Backprop– Max Pooling Layer
Lenet Topology Froward pass: For (p=0; p<k ;p++) For (q=0; q<k; q++) 𝑦 𝑙 x,y =max ( 𝑦 𝑙 x,y , 𝑦 𝑙−1 x+p,y+q ); K- kernel size Backward : (*) 𝜕𝐸 𝜕 𝑦 𝑙−1 = 𝜕𝐸 𝜕 𝑦 𝑙 ∗ 𝜕 𝑦 𝑙 (𝑤, 𝑦 𝑙−1 ) 𝜕 𝑦 𝑙−  𝜕𝐸 𝜕 𝑦 𝑙−1 (𝑥+𝑝,𝑦+𝑞) = 𝟎, 𝒊𝒇 𝐲 𝐥 𝐱,𝐲 ≠𝐲 𝐥−𝟏 𝐱+𝐩,𝐲+𝐪 𝝏𝑬 𝝏 𝒚 𝒍 𝒙,𝒚 ∗𝟏, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 Relu BACKWARD E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

45 Backprop– Convolution Layer
Lenet Topology Froward pass: 𝑦 𝑙 𝑖,𝑗 = 𝑎=0 𝑚−1 𝑏=0 𝑚−1 𝑤(𝑎,𝑏) 𝑦 𝑙−1 (𝑖+𝑎,𝑗+𝑏) mxm- kernel size NxN – input size Backward : we need to compute for the previous layer is the partial of E with respect to each neuron output 𝜕𝐸 𝜕 𝑦 𝑙 (𝒊,𝒋) In this case, we must sum over all  𝒚 𝒍 (𝒊,𝒋 ) expressions in which ω(a,b) occurs. (This corresponds to weight-sharing in the neural network!) (**) 𝝏𝑬 𝝏 𝒘 𝒍 = 𝝏𝑬 𝝏 𝒚 𝒍 ∗ 𝝏 𝒚 𝒍 𝒘, 𝒚 𝒍−𝟏 𝝏 𝒘 𝒍 → 𝝏𝑬 𝝏 𝒘 𝒍 (𝒂,𝒃) = 𝑖=0 𝑁−𝑚 𝑗=0 𝑁−𝑚 𝝏𝑬 𝝏 𝒚 𝒍 (𝒊,𝒋) 𝝏 𝒚 𝒍 (𝒊,𝒋) 𝝏 𝒘 𝒍 (𝒂,𝒃) = 𝒊=𝟎 𝑵−𝒎 𝒋=𝟎 𝑵−𝒎 𝝏𝑬 𝝏 𝒚 𝒍 𝒊,𝒋 𝒚 𝒍−𝟏 (𝒊+𝒂,𝒋+𝒃) Relu BACKWARD 𝒚 𝒍−𝟏 𝒚 𝒍 𝒘 𝒍 E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

46 Backprop– Convolution Layer
𝒚 𝒍−𝟏 𝒚 𝒍 𝒘 𝒍 Lenet Topology Backward : (*) 𝝏𝑬 𝝏 𝒚 𝒍−𝟏 (𝒊,𝒋) = 𝝏𝑬 𝝏 𝒚 𝒍(𝒊,𝒋) ∗ 𝝏 𝒚 𝒍 (𝒊,𝒋) 𝝏 𝒚 𝒍−𝟏 (𝒊,𝒋)  𝝏𝑬 𝝏 𝒚 𝒍−𝟏 (𝒊,𝒋) = 𝑎=0 𝑚−1 𝑏=0 𝑚−1 𝝏𝑬 𝝏 𝒚 𝒍(𝒊,𝒋) 𝝏 𝒚 𝒍 (𝒊−𝒂,𝒋−𝒃) 𝝏 𝒚 𝒍−𝟏 (𝒊,𝒋) = 𝒂=𝟎 𝒎−𝟏 𝒃=𝟎 𝒎−𝟏 𝝏𝑬 𝝏 𝒚 𝒍(𝒊,𝒋) 𝒘 𝒍 (𝒂,𝒃) This gives us the above value for the error at the previous layer. As we can see, that looks slightly like a convolution! We have our filter ω being applied somehow to the layer Relu BACKWARD E - loss 𝒚 𝒍 =𝒇 𝒚 𝒍−𝟏 , 𝒘 𝒍 →𝒍𝒂𝒚𝒆𝒓 𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 𝒍−𝟏 −𝒍𝒂𝒚𝒆𝒓 𝒍 𝒊𝒏𝒑𝒖𝒕

47 Conclusions In Neural Networks , backpropagation allows us to efficiently compute the gradients on the connections of the neural network, with respect to a loss function. Hand crafted should be replaced by an automatic learned features Large sized systems can be learned by gradient based methods with efficient backpropagation

48 Discussion & Questions
Gradient convergence? Magic Gradient vanishing Activation function Sigmoid vs Relu GPU YOU ???

49 Gradient vanishing Problem Cause Solution
the gradients of the network's output with respect to the parameters in the early layers become extremely small. a large change in the value of parameters for the early layers doesn't have a big effect on the output. Cause Vanishing gradient problem depends on the choice of the activation function. Many common activation functions (e.g sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear fashion. For example, sigmoid maps the real number line onto a "small" range of [0, 1] As a result, there are large regions of the input space which are mapped to an extremely small range. In these regions of the input space, even a large change in the input will produce a small change in the output - hence the gradient is small. This becomes much worse when we stack multiple layers of such non-linearities on top of each other Solution We can avoid this problem by using activation functions which don't have this property of 'squashing' the input space into a small region. A popular choice is Rectified Linear Unit which maps x to max(0,x)

50 Links Gradient-based Learning Applied to Document Recognition paper
CS231n: Convolutional Neural Networks for Visual Recognition


Download ppt "Gradient-based Learning Applied to Document Recognition"

Similar presentations


Ads by Google