Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
Chapter 5 NEURAL NETWORKS
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Statistical Methods Chichang Jou Tamkang University.
Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,
Neural Networks: A Statistical Pattern Recognition Perspective
Machine Learning CMPT 726 Simon Fraser University
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Lecture II-2: Probability Review
Classification and Prediction: Regression Analysis
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial Basis Function Networks
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
PATTERN RECOGNITION AND MACHINE LEARNING
Biointelligence Laboratory, Seoul National University
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
Linear Models for Classification
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
CpSc 881: Machine Learning
CSC321: Lecture 7:Ways to prevent overfitting
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Machine Learning 5. Parametric Methods.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Feedforward Networks
Roberto Battiti, Mauro Brunato
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
EE513 Audio Signals and Systems
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Using Clustering to Make Prediction Intervals For Neural Networks
Presentation transcript:

Giansalvo EXIN Cirrincione unit #7/8

ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables. Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.

ERROR FUNCTIONS Basic goal for TRAINING: model the underlying generator of the data for generalization on new data. The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space. The most general and complete description of the generator of the data is in terms of the probability density p(x,t) in the joint input-target space. For a set of training data {x n, t n } drawn independently from the same distribution: modelled by the feed-forward neural network

c o n t i n u o u s ( r e g r e s s i o n : p r e d i c t i o n ) c o n t i n u o u s ( c l a s s i f i c a t i o n : p r o b a b i l i t y o f c l a s s m e m b e r s h i p ) ERROR FUNCTIONS d i s c r e t e ( c l a s s i f i c a t i o n : c l a s s m e m b e r s h i p ) it determines the error function

Sum-of-squares error OLS approach  c target variables t k  the distributions of the target variables are independent  the distributions of the target variables are Gaussian  error  k  N( 0,  );  doesn’t depend on x or on k  c target variables t k  the distributions of the target variables are independent  the distributions of the target variables are Gaussian  error  k  N( 0,  );  doesn’t depend on x or on k

Sum-of-squares error

w * minimizes E minimize E, computed at w *, w.r.t.  The optimum value of  2 is proportional to the residual value of the sum-of-squares error function at its minimum. Of course, the use of a sum-of-squares error doesn’t require the target data to have a Gaussian distribution. However, if we use this error, then the results cannot distinguish between the true distribution and any other distribution having the same mean and variance.

Sum-of-squares error training validation over all N patterns in the training set over all N ’ patterns in the test set * E = 0  perfect prediction of the test data * E = 1  it is predicting the test data in the mean

linear output units MLP,RBF M

linear output units SVD

linear output units fast slow fast slow  reduction of the number of iterations (smaller search space)  greater cost per iteration  reduction of the number of iterations (smaller search space)  greater cost per iteration

linear output units Suppose the TS target patterns satisfy an exact linear relation: If the final-layer weights are determined by OLS, then the outputs will satisfy the same linear constraint for arbitrary input patterns.

Interpretation of network inputs For a network trained by minimizing a sum-of-squares error function, the outputs approximate the conditional averages of the target data Consider the limit in which the size N of the TS goes to infinity

Interpretation of network inputs

regression of t k conditioned on x

KEY ASSUMPTIONS the TS must be sufficiently large that it approximates an infinite TS the network output must be sufficiently general (weights for minimum) training in such a way as to find the appropriate minimum of the cost This result doesn’t depend on the choice of network architecture or even if using a neural network at all. However, anns provide a framework for approximating arbitrary nonlinear multivariate mappings and can therefore in principle approximate the conditional average to arbitrary accuracy.

Interpretation of network inputs zero-mean

Interpretation of network inputs example

Interpretation of network inputs example  is a RV drawn from a uniform distribution in the range (-0.1,0.1) MLP sum-of squares error

Interpretation of network inputs The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance. The sum-of-squares error function cannot distinguish between the true distribution and a Gaussian distribution having the same x-dependent mean and average variance.

ERROR FUNCTIONS part two Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned on the input variables. Goal for CLASSIFICATION: to model the posterior probabilities of class membership, conditioned on the input variables.

We can exploit a number of results... Minimum error-rate decisions Note that the network outputs need not be close to 0 or 1 if the class-conditional density functions are overlapping.

We can exploit a number of results... Minimum error-rate decisions It can be enforced explicitly as part of the choice of network structure. Outputs sum to 1 The average of each output over all patterns in the TS should approximate the corresponding prior class probabilities. These estimated priors can be compared with the sample estimates of the priors obtained from the fractions of patterns in each class within the TS. Differences are an indication that the network is not modelling the posterior probabilities accurately.

We can exploit a number of results... Minimum error-rate decisions Outputs sum to 1 Compensating for different priors normalization factor Case: priors expected when the network is in use differ from those represented by the TS. 1 = TS 2 = use Changes in priors can be accomodated without retraining

sum-of-squares for classification every input vector in the TS is labelled by its class membership, represented by a set of target values t k n 1-of-c coding discrete RV

sum-of-squares for classification every input vector in the TS is labelled by its class membership, represented by a set of target values t k n 1-of-c coding

sum-of-squares for classification if the outputs represent probabilities, they should lie in the range (0,1) and should sum to 1 in the case of a 1-of-c coding scheme, the target values sum to unity for each pattern, and so the outputs will satisfy the same constraint no guarantee that the outputs lie in the range (0,1) for a network with linear output units and s-o-s error, if the target values satisfy a linear constraint, then the outputs will satisfy the same constraint for an arbitrary input The s-o-s error is not the most appropriate for classification because it is derived from ML on the assumption of Gaussian distributed target data.

sum-of-squares for classification two class problem 1-of-c coding two output units alternative approach single output

Interpretation of hidden units linear output units Total covariance matrix for the activations at the output of the final hidden layer w.r.t. TS

Interpretation of hidden units linear output units Between-class covariance matrix

Interpretation of hidden units linear output units min max Nothing is specific to MLP or indeed to anns. The same result is obtained regardless of the functions (of the weights) z j and applies to any generalized linear discriminant in which the kernels are adaptive.

Interpretation of hidden units linear output units min max The weights in the final layer are adjusted to produce an optimum discrimination of the classes of input vectors by means of a linear transformation. Minimizing the error of this linear discriminant requires the input data undergo a nonlinear transformation into the space spanned by the activations of the hidden units in such a way as to maximize the discriminant function J.

Interpretation of hidden units linear output units min max Strong weighting of the feature extraction criterion in favour of classes with larger number of patterns 1-of-c

Cross-entropy for two classes single output wanted target coding scheme cross-entropy error function Hopfield (1987) Baum and Wilczek (1988) Solla et al. (1988) Hinton (1989) Hampshire and Pearlmutter (1990)

Cross-entropy for two classes Absolute minimum logistic activation function for the output BP Natural pairing  sum-of-squares + linear output units  cross-entropy + logistic output unit Natural pairing  sum-of-squares + linear output units  cross-entropy + logistic output unit

Cross-entropy for two classes 1-of-c coding = 0 it doesn’t vanish when t n is continuous in the range (0,1) representing the probability of the input x n belonging to class C 1 min at 0

Class-conditional pdf’s used to generate the TS (equal priors) dashed = Bayes Cross-entropy for two classes MLP one input unit five hidden units (tanh) one output unit (logistic) cross-entropy BFGS example

sigmoid activation functions single-layer multi-layer exponential family of distributions (e.g. Gaussian, binomial, Bernoulli, Poisson) hidden unit output The network output is given by a logistic sigmoid activation function acting on a weighted linear combination of the outputs of those hidden units which send connections to the output unit. Extension to the hidden units: provided such units use logistic sigmoids, their outputs can be interpreted as probabilities of the presence of corresponding features conditioned on the inputs to the units.

properties of the cross-entropy error the error function depends on the relative errors of the outputs (its minimization tends to result in similar relative errors on both small and large targets) the s-o-s error function depends on the absolute errors (its minimization tends to result in similar absolute errors for each pattern) the cross-entropy error function performs better than s-o-s at estimating small probabilities

properties of the cross-entropy error target coding scheme Manhattan error function compared with s-o-s: much stronger weight to smaller errors better for incorrectly labelled data

justification of the cross-entropy error infinite data limit set the functional derivative w.r.t. y(x) to zero as for s-o-s, the output of the network approximates the conditional average of the target data for the given input

justification of the cross-entropy error target coding scheme

Multiple independent attributes Determine the probabilities of the presence or absence of a number of attributes (which need not be mutually exclusive). x Assumption independent attributes multiple outputs y k represents the probability that the kth attribute is present With this choice of error function, the outputs should each have a logistic sigmoid activation function

Multiple independent attributes Show that the entropy measure E, derived for targets t k =0, 1, applies also in the case where the targets are probabilities with values in (0,1). Do this by considering an extended data set in which each pattern t k n is replaced by a set of M patterns of which a fraction M t k n is set to 1 and the remainder is set to 0, and then applying E to this extended TS. HOMEWORK

Cross-entropy for multiple classes mutually exclusive classes The probability of observing the set of target values t k n =  kl, given an input vector x n, is just: 1-of-c coding One output y k for each class The {y k } are not independent as a result of the constraint  k y k = 1 The absolute minimum w.r.t.{y k n } occurs when y k n = t k n  k, n discrete Kullback-Leibler distance

If the output values are to be interpreted as probabilities, they must lie in the range (0,1) and sum to unity. normalized exponentialsoftmax generalization of the logistic sigmoid Cross-entropy for multiple classes

As with the logistic sigmoid, we can give a general motivation for the softmax by considering the posterior probability that a hidden unit activation z belongs to class C k. The outputs can be interpreted as probabilities of class membership, conditioned on the outputs of the hidden units.

Cross-entropy for multiple classes BP training over the inputs to all output units Natural pairing  sum-of-squares + linear output units  2-class cross-entropy + logistic output unit  c-class cross-entropy + softmax output units Natural pairing  sum-of-squares + linear output units  2-class cross-entropy + logistic output unit  c-class cross-entropy + softmax output units

Consider the cross-entropy error function for multiple classes, together with a network whose outputs are given by a softmax activation function, in the limit of an infinite data set. Show that the network output functions y k (x) which minimize the error are given by the conditional averages of the target data hint Since the outputs are not independent, consider the functional derivative w.r.t. a k (x) instead.