Giansalvo EXIN Cirrincione unit #7/8
ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables. Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.
ERROR FUNCTIONS Basic goal for TRAINING: model the underlying generator of the data for generalization on new data. The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space. The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space. For a set of training data {x n, t n } drawn independently from the same distribution: modelled by the feed-forward neural network
c o n t i n u o u s ( r e g r e s s i o n : p r e d i c t i o n ) c o n t i n u o u s ( c l a s s i f i c a t i o n : p r o b a b i l i t y o f c l a s s m e m b e r s h i p ) ERROR FUNCTIONS d i s c r e t e ( c l a s s i f i c a t i o n : c l a s s m e m b e r s h i p ) it determines the error function
Sum-of-squares error OLS approach c target variables t k the distributions of the target variables are independent the distributions of the target variables are Gaussian error k N( 0, ); doesn’t depend on x or on k c target variables t k the distributions of the target variables are independent the distributions of the target variables are Gaussian error k N( 0, ); doesn’t depend on x or on k
Sum-of-squares error
w * minimizes E minimize E, computed at w *, w.r.t. The optimum value of 2 is proportional to the residual value of the sum-of-squares error function at its minimum. Of course, the use of a sum-of-squares error doesn’t require the target data to have a Gaussian distribution. However, if we use this error, then the results cannot distinguish between the true distribution and any other distribution having the same mean and variance.
Sum-of-squares error training validation over all N patterns in the training set over all N ’ patterns in the test set * E = 0 perfect prediction of the test data * E = 1 it is predicting the test data in the mean
linear output units MLP,RBF M
linear output units SVD
linear output units fast slow fast slow reduction of the number of iterations (smaller search space) greater cost per iteration reduction of the number of iterations (smaller search space) greater cost per iteration
linear output units Suppose the TS target patterns satisfy an exact linear relation: If the final-layer weights are determined by OLS, then the outputs will satisfy the same linear constraint for arbitrary input patterns.
Interpretation of network inputs For a network trained by minimizing a sum-of-squares error function, the outputs approximate the conditional averages of the target data Consider the limit in which the size N of the TS goes to infinity
Interpretation of network inputs
regression of t k conditioned on x
KEY ASSUMPTIONS the TS must be sufficiently large that it approximates an infinite TS the network output must be sufficiently general (weights for minimum) training in such a way as to find the appropriate minimum of the cost This result doesn’t depend on the choice of network architecture or even if using a neural network at all. However, anns provide a framework for approximating arbitrary nonlinear multivariate mappings and can therefore in principle approximate the conditional average to arbitrary accuracy.
Interpretation of network inputs zero-mean
Interpretation of network inputs example
Interpretation of network inputs example is a RV drawn from a uniform distribution in the range (-0.1,0.1) MLP sum-of squares error
Interpretation of network inputs The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance. The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance.
ERROR FUNCTIONS part two Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables. Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.
We can exploit a number of results... Minimum error-rate decisions Note that the network outputs need not be close to 0 or 1 if the class-conditional density functions are overlapping.
We can exploit a number of results... Minimum error-rate decisions It can be enforced explicitly as part of the choice of network structure. Outputs sum to 1 The average of each output over all patterns in the TS should approximate the corresponding prior class probabilities. These estimated priors can be compared with the sample estimates of the priors obtained from the fractions of patterns in each class within the TS. Differences are an indication that the network is not modelling the posterior probabilities accurately.
We can exploit a number of results... Minimum error-rate decisions Outputs sum to 1 Compensating for different priors normalization factor Case: priors expected when the network is in use differ from those represented by the TS. 1 = TS 2 = use Changes in priors can be accomodated without retraining
sum-of-squares for classification every input vector in the TS is labelled by its class membership, represented by a set of target values t k n 1-of-c coding discrete RV
sum-of-squares for classification every input vector in the TS is labelled by its class membership, represented by a set of target values t k n 1-of-c coding
sum-of-squares for classification if the outputs represent probabilities, they should lie in the range (0,1) and should sum to 1 in the case of a 1-of-c coding scheme, the target values sum to unity for each pattern, and so the outputs will satisfy the same constraint no guarantee that the outputs lie in the range (0,1) for a network with linear output units and s-o-s error, if the target values satisfy a linear constraint, then the outputs will satisfy the same constraint for an arbitrary input The s-o-s error is not the most appropriate for classification because it is derived from ML on the assumption of Gaussian distributed target data.
sum-of-squares for classification two class problem 1-of-c coding two output units alternative approach single output
Interpretation of hidden units linear output units Total covariance matrix for the activations at the output of the final hidden layer w.r.t. TS
Interpretation of hidden units linear output units Between-class covariance matrix
Interpretation of hidden units linear output units min max Nothing is specific to MLP or indeed to anns. The same result is obtained regardless of the functions (of the weights) z j and applies to any generalized linear discriminant in which the kernels are adaptive.
Interpretation of hidden units linear output units min max The weights in the final layer are adjusted to produce an optimum discrimination of the classes of input vectors by means of a linear transformation. Minimizing the error of this linear discriminant requires the input data undergo a nonlinear transformation into the space spanned by the activations of the hidden units in such a way as to maximize the discriminant function J.
Interpretation of hidden units linear output units min max Strong weighting of the feature extraction criterion in favour of classes with larger number of patterns 1-of-c
Cross-entropy for two classes single output wanted target coding scheme cross-entropy error function Hopfield (1987) Baum and Wilczek (1988) Solla et al. (1988) Hinton (1989) Hampshire and Pearlmutter (1990)
Cross-entropy for two classes Absolute minimum logistic activation function for the output BP Natural pairing sum-of-squares + linear output units cross-entropy + logistic output unit Natural pairing sum-of-squares + linear output units cross-entropy + logistic output unit
Cross-entropy for two classes 1-of-c coding = 0 it doesn’t vanish when t n is continuous in the range (0,1) representing the probability of the input x n belonging to class C 1 min at 0
Class-conditional pdf’s used to generate the TS (equal priors) dashed = Bayes Cross-entropy for two classes MLP one input unit five hidden units (tanh) one output unit (logistic) cross-entropy BFGS example
sigmoid activation functions single-layer multi-layer exponential family of distributions (e.g. Gaussian, binomial, Bernoulli, Poisson) hidden unit output The network output is given by a logistic sigmoid activation function acting on a weighted linear combination of the outputs of those hidden units which send connections to the output unit. Extension to the hidden units: provided such units use logistic sigmoids, their outputs can be interpreted as probabilities of the presence of corresponding features conditioned on the inputs to the units.
properties of the cross-entropy error the error function depends on the relative errors of the outputs (its minimization tends to result in similar relative errors on both small and large targets) the s-o-s error function depends on the absolute errors (its minimization tends to result in similar absolute errors for each pattern) the cross-entropy error function performs better than s-o-s at estimating small probabilities
properties of the cross-entropy error target coding scheme Manhattan error function compared with s-o-s: much stronger weight to smaller errors better for incorrectly labelled data
justification of the cross-entropy error infinite data limit set the functional derivative w.r.t. y(x) to zero as for s-o-s, the output of the network approximates the conditional average of the target data for the given input
justification of the cross-entropy error target coding scheme
Multiple independent attributes Determine the probabilities of the presence or absence of a number of attributes (which need not be mutually exclusive). x Assumption independent attributes multiple outputs y k represents the probability that the kth attribute is present With this choice of error function, the outputs should each have a logistic sigmoid activation function
Multiple independent attributes Show that the entropy measure E, derived for targets t k =0, 1, applies also in the case where the targets are probabilities with values in (0,1). Do this by considering an extended data set in which each pattern t k n is replaced by a set of M patterns of which a fraction M t k n is set to 1 and the remainder is set to 0, and then applying E to this extended TS. HOMEWORK
Cross-entropy for multiple classes mutually exclusive classes The probability of observing the set of target values t k n = kl, given an input vector x n, is just: 1-of-c coding One output y k for each class The {y k } are not independent as a result of the constraint k y k = 1 The absolute minimum w.r.t.{y k n } occurs when y k n = t k n k, n discrete Kullback-Leibler distance
If the output values are to be interpreted as probabilities, they must lie in the range (0,1) and sum to unity. normalized exponentialsoftmax generalization of the logistic sigmoid Cross-entropy for multiple classes
As with the logistic sigmoid, we can give a general motivation for the softmax by considering the posterior probability that a hidden unit activation z belongs to class C k. The outputs can be interpreted as probabilities of class membership, conditioned on the outputs of the hidden units.
Cross-entropy for multiple classes BP training over the inputs to all output units Natural pairing sum-of-squares + linear output units 2-class cross-entropy + logistic output unit c-class cross-entropy + softmax output units Natural pairing sum-of-squares + linear output units 2-class cross-entropy + logistic output unit c-class cross-entropy + softmax output units
Consider the cross-entropy error function for multiple classes, together with a network whose outputs are given by a softmax activation function, in the limit of an infinite data set. Show that the network output functions y k (x) which minimize the error are given by the conditional averages of the target data hint Since the outputs are not independent, consider the functional derivative w.r.t. a k (x) instead.