Download presentation
Presentation is loading. Please wait.
Published byEunice Lewis Modified over 8 years ago
1
Multinomial Regression and the Softmax Activation Function Gary Cottrell
2
Notation reminder We have N data points, or patterns, in the training set, with the pattern number as a superscript: {(x 1,t 1 ), (x 2,t 2 ),…(x n,t n ),…(x N,t N )}, and t n is the target for pattern n. The x’s are (usually) vectors of dimension d, so the n th input pattern is: For the output, the weighted sum of the input (a.k.a. the net input) is written: And if there are multiple outputs, k=1,..,c, we write: and then the k th output is: where g is the activation function. 11/20/2016 CSE 190 Lecture 5 2
3
What if we have multiple classes? E.g., 10 classes, as in the MNIST problem (to pick a random example!) Then it would be great to force the network to have outputs that are positive and sum to 1, i.e., to represent the probability distribution over the output categories. The softmax activation function does this. 11/20/2016 CSE 190 Lecture 5 3
4
Softmax 11/20/2016 CSE 190 Lecture 5 4 Note that since the denominator is over all categories, the sum over all categories is 1. This is a generalization of the logistic for multiple categories, because softmax can be written as: (proof left to the reader!)
5
What is the right error function? 11/20/2016 CSE 190 Lecture 5 5 The network looks like this: We use 1-out-of-c encoding for c categories, meaning, the target is 1 for the correct category, and 0 everywhere else. So for 10 outputs, the target vector would look like: (0,0,1,0,0,0,0,0,0,0) if the target was the third category. Using Kronecker’s delta, we can write this compactly as:
6
What is the right error function? 11/20/2016 CSE 190 Lecture 5 6 Now, the conditional probability for the n th example is Note that this picks out the j th output if the correct category is j, since t k will be 1 for the j th category and 0 for everything else. SO, the Likelihood for the entire data set is:
7
What is the right error function? 11/20/2016 CSE 190 Lecture 5 7 SO, the Likelihood for the entire data set is: And the error, the negative log likelihood is: This is also called the cross-entropy error.
8
What is the right error function? 11/20/2016 CSE 190 Lecture 5 8 The minimum of this error is at: which will be 0 if t is 0 or 1. …but this still applies if t is an actual probability between 0 and 1, and then the above quantity is not 0. So we can subtract it off of the error to get an error function that will hit 0 when minimized (i.e., when y=t):
9
What is the right error function? 11/20/2016 CSE 190 Lecture 5 9 So, we need to go downhill in this error. Recall gradient descent says: The second factor is just x j, as before. For the first factor, for one pattern n, the derivative with respect to the net input has to take into account all of the outputs, because changing the input to one output changes the activations of all the outputs, so:
10
What is the right error function? 11/20/2016 CSE 190 Lecture 5 10 So, in order to do gradient descent, we need this derivative: The first factor, using, is The second factor, using the definition of softmax is:
11
What is the right error function? 11/20/2016 CSE 190 Lecture 5 11 Which leads to: So, !!!! We get the delta rule again !!!!
12
So, the “right” way to do multinomial regression is: Start with a net like this: Initialize the weights to 0 Use the softmax activation function: For each pass through the data (an epoch): Randomize the order of the patterns Present a pattern, compute the output Update the weights according to the delta rule Repeat until the error stops decreasing (enough) or a maximum number of epochs has been reached. 11/20/2016 CSE 190 Lecture 5 12
13
Activation functions and Forward propagation We’ve already seen three kinds of activation functions: We’ve already seen three kinds of activation functions: Binary threshold units Binary threshold units Logistic units Logistic units Softmax units Softmax units 11/20/2016 CSE 190 Lecture 5 13 net input output
14
Activation functions and Forward propagation Some more: Some more: Rectified linear units (ReLu): Rectified linear units (ReLu): Leaky ReLu: Leaky ReLu: 11/20/2016 CSE 190 Lecture 5 14
15
Activation functions and Forward propagation Tanh: Linear! 11/20/2016 CSE 190 Lecture 5 15
16
Activation functions and Forward propagation Stochastic: The logistic is treated as a probability of being 1 (else 0) 11/20/2016 CSE 190 Lecture 5 16
17
Activation functions and Forward propagation These functions can be applied recursively: This is called forward propagation 11/20/2016 CSE 190 Lecture 5 17
18
Activation functions and Forward propagation These functions can be applied recursively: 11/20/2016 CSE 190 Lecture 5 18 This is called forward propagation
19
Activation functions and Forward propagation Different layers need not have the same activation functions: One popular one (not shown here): ReLu in the hiddens, softmax at the output. 11/20/2016 CSE 190 Lecture 5 19
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.