Biointelligence Laboratory, Seoul National University

Slides:



Advertisements
Similar presentations
EA C461 - Artificial Intelligence
Advertisements

Slides from: Doug Gray, David Poole
Biointelligence Laboratory, Seoul National University
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Linear Models for Classification: Probabilistic Methods
Computer vision: models, learning and inference
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
Neural Networks I CMPUT 466/551 Nilanjan Ray. Outline Projection Pursuit Regression Neural Network –Background –Vanilla Neural Networks –Back-propagation.
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
x – independent variable (input)
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 5 NEURAL NETWORKS
Neural Networks Multi-stage regression/classification model activation function PPR also known as ridge functions in PPR output function bias unit synaptic.
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
Back-Propagation Algorithm
Chapter 6: Multilayer Neural Networks
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Artificial Neural Networks
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Classification Part 3: Artificial Neural Networks
Introduction to Neural Networks Debrup Chakraborty Pattern Recognition and Machine Learning 2006.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Last lecture summary. biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
Chapter 8: Adaptive Networks
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
One-layer neural networks Approximation problems
第 3 章 神经网络.
Classification / Regression Neural Networks 2
Neuro-Computing Lecture 4 Radial Basis Function Network
Artificial Intelligence Chapter 3 Neural Networks
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
Artificial Intelligence Chapter 3 Neural Networks
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Biointelligence Laboratory, Seoul National University Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by M.-O. Heo Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Contents 5.1 Feed-forward Network Functions 5.1.1 Weight-space symmetries 5.2 Network Training 5.2.1 Parameter optimization 5.2.2 Local quadratic approximation 5.2.3 Use of gradient information 5.2.4 Gradient descent optimization 5.3 Error Backpropagation 5.3.1 Evaluation of error-function derivatives 5.3.2 A simple example 5.3.3 Efficiency of backpropagation 5.3.4 The Jacobian matrix 5.4 The Hessian Matrix 5.4.1 Diagonal approximation 5.4.2 Outer product approximation 5.4.3 Inverse Hessian 5.4.4 Finite differences 5.4.5 Exact evaluation of the Hessian 5.4.6 Fast multiplication by the Hessian (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions Goal: to extend linear model by making the basis functions depend on parameters, allow these parameters to be adjusted. : diff. nonlinear activation function Forward propagation of information First layer activations weights biases (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions (Cont’d) A key difference compared to perceptron and multilayer perceptron (MLP) MLP: continuous sigmoidal nonlinearities in the hidden units Perceptron: step-function nonlinearities Skip-layer connections A network with sigmoidal hidden units can always mimic skip layer connections by using a sufficiently small first-layer weight that the hidden unit is effectively linear, and then compensating with a large weight value from the hidden unit to the output. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions (Cont’d) The approximation properties of feed-forward networks Universal approximator: a two-layer network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy provided the network has a sufficiently large number of hidden units. Ex) N = 50, 2-layer network having 3 hidden units with tanh activation functions and linear output. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions - Weight-space symmetries Sign-flip symmetry If we change the sign of all of the weights and the bias feeding into a particular hidden unit, then, for a given input pattern, the sign of the activation of the hidden unit will be reversed. Compensated by changing the sign of all of the weights of leading out of that hidden units. For M hidden units, by tanh(-a) = - tanh(a), there will be M sign-flip symmetries. Any given weight vector will be one of a set 2M equivalent weight vectors. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions - Weight-space symmetries (Cont’d) Interchange symmetry We can interchange the values of all of the weights (and the bias) leading both into and out of a particular hidden unit with the corresponding values of the weights (and bias) associated with a different hidden unit. This clearly leaves the network input-output mapping function unchanged. For M hidden units, any given weight vector will belong to a set of M! equivalent weight vectors (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Network Training t has a Gaussian distribution with an x-dependent mean (likelihood function) (negative log) Maximizing the likelihood function is equivalent to minimizing the sum-of-squares error function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training (Cont’d) The choice of output unit activation function and matching error function Standard regression problems: Error function: Negative log-likelihood function Output: identity For multiple binary classification problems: Error function: cross-entropy error function Output: logistic sigmoid For multiclass problems: Multiclass cross-entropy error function a softmax activation function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Parameter optimization Error E(w) is a smooth continuous function of w and smallest value will occur at a point in weight space Global minimum Local minima Most techniques involve choosing some initial value for weight vector and then moving through weight space in a succession of steps of the form Many algorithms make use of gradient info. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Local quadratic approximation Taylor expansion of E(w) around some point Local approximation to the gradient The particular case of a local quadratic approximation when at w* (gradient) (Hessian Matrix) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Use of gradient information In the quadratic approximation, computational cost to find minimum is O(W3) W is the dimensionality of w perform O(W2) evaluations, each of which would require O(W) steps. In an algorithm that makes use of the gradient information, computational cost is O(W2) By using error backpropagation, O(W) gradient evaluations and each such evaluation takes only O(W) steps. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Gradient descent optimization Weight update to comprise a small step in the direction of the negative gradient Batch method Techniques that use the whole data set at once. The error function always decreases at each iteration unless the weight vector has arrived at a local or global minimum. On-line version of gradient descent Sequential gradient descent or stochastic gradient descent Update to the weight vector based on one data point at a time. Can handle redundancy in the data much more efficiently. The possibility of escaping from local minima. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - Evaluation of error-function derivatives Apply an input vector Xn to the network and forward propagate through the network using equations below to find the activations of all the hidden and output units. Evaluate the for all the output units using Backpropagate the ’s using to obtain for each hidden unit in the network Use to evaluate the required derivatives. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - A simple example (yk: output unit k, tk: the corresponding target) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - Efficiency of backpropagation Computational efficiency of backpropagation: O(W) An alternative approach for computing derivatives of the error function: Using finite differences The accuracy of the approximation to the derivatives can be improved by making ε smaller, until numerical roundoff limit. Improved using symmetrical central differences => O(W2) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - The Jacobian matrix The technique of backpropagation can also be applied to the calculation of other derivatives. The Jacobian matrix Elements are given by the derivatives of the network outputs w.r.t. the inputs Minimizing an error function E w.r.t. the parameter w (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - The Jacobian matrix (Cont’d) A measure of the local sensitivity of the outputs to change in each of the input variables. In general, the network mapping is nonlinear, and so the elements will not be constants. This is valid provided the are small. The evaluation of the Jacobian Matrix (Sigmoidal activation function) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ The Hessian Matrix The Hessian matrix The second derivatives form the elements of H Important roles Considering the second-order properties of the error surfaces. Forms the basis of a fast procedure for re-training a feed-forward network following a small change in the training data The inverse of the Hessian is used to identify the least significant weights in a network (pruning algorithm) In the Laplace approximation for a Bayesian NN Various approximation schemes are used to evaluate the Hessian matrix for a NN. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Diagonal approximation Inverse of the Hessian is useful, and inverse of diagonal matrix is trivial to evaluate. Consider an error function Replaces the off-diagonal elements with zeros The diagonal elements of the Hessian: (neglect off-diagonal) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Outer product approximation (Levenberg-Marquardt) Output y happen to be very close to the target values t, the second term will be small and can be neglected. Or the value of y – t is uncorrelated with the value of the second derivative term, then the whole term will average to zero. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Inverse Hessian A procedure for approximating the inverse of the Hessian First, we write the outer-product approximation Derive a sequential procedure for building up the Hessian by including data points one at a time. The initial matrix H0 = H + αI α is a small quantity, not sensitive to the precise value of α. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Finite differences By using a symmetrical central differences formulation Require O(W3) operations to evaluate the complete Hessian By applying central differences to the first derivatives of the error function. Costs: O(W2) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Exact evaluation of the Hessian The Hessian can also be evaluated exactly. Using extension of the technique of backpropagation used to evaluate first derivatives. (Both weights in the second layer) (Both weights in the first layer) (Both weights in each layer) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Fast multiplication by the Hessian Try to find an efficient approach to evaluating vTH directly. O(W) operations vTH = vT∇ (∇E) (∇: the gradient operator in weight space) Introducing new notation R{w} = v to denote the operator vT ∇ The Implementation of this algorithm involves the introduction of additional variables for the hidden units and for the output units. For each input pattern, the values of these quantities can be found. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/