6/10/2015236875 Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Giansalvo EXIN Cirrincione unit #3 PROBABILITY DENSITY ESTIMATION labelled unlabelled A specific functional form for the density model is assumed. This.
Support Vector Machines
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Ch. 4: Radial Basis Functions Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from many Internet sources Longin.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Neural Networks II CMPUT 466/551 Nilanjan Ray. Outline Radial basis function network Bayesian neural network.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Simple Neural Nets For Pattern Classification
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Radial Basis Functions
Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Radial Basis Function Networks 표현아 Computer Science, KAIST.
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Aula 4 Radial Basis Function Networks
Radial Basis Function (RBF) Networks
Radial Basis Function G.Anuradha.
Last lecture summary.
Radial-Basis Function Networks
Radial Basis Function Networks
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Last lecture summary.
Radial Basis Function Networks
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Radial Basis Function Networks
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
Artificial Neural Networks Shreekanth Mandayam Robi Polikar …… …... … net k   
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
EEE502 Pattern Recognition
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
unit #3 Neural Networks and Pattern Recognition
Deep Feedforward Networks
Radial Basis Function G.Anuradha.
Latent Variables, Mixture Models and EM
Neuro-Computing Lecture 4 Radial Basis Function Network
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Introduction to Radial Basis Function Networks
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Presentation transcript:

6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST

6/10/ Visual Recognition2 contents Introduction Architecture Designing Learning strategies MLP vs RBFN

6/10/ Visual Recognition3 introduction Completely different approach by viewing the design of a neural network as a curve-fitting (approximation) problem in high-dimensional space ( I.e MLP )

6/10/ Visual Recognition4 In MLP introduction

6/10/ Visual Recognition5 In RBFN introduction

6/10/ Visual Recognition6 Radial Basis Function Network A kind of supervised neural networks Design of NN as curve-fitting problem Learning –find surface in multidimensional space best fit to training data Generalization –Use of this multidimensional surface to interpolate the test data introduction

6/10/ Visual Recognition7 Radial Basis Function Network Approximate function with linear combination of Radial basis functions F(x) =  w i h(x) h(x) is mostly Gaussian function introduction

6/10/ Visual Recognition8 architecture Input layerHidden layerOutput layer x1x1 x2x2 x3x3 xnxn h1h1 h2h2 h3h3 hmhm f(x) W1W1 W2W2 W3W3 WmWm

6/10/ Visual Recognition9 Three layers Input layer –Source nodes that connect to the network to its environment Hidden layer –Hidden units provide a set of basis function –High dimensionality Output layer –Linear combination of hidden functions architecture

6/10/ Visual Recognition10 Radial basis function h j (x) = exp( -(x-c j ) 2 / r j 2 ) f(x) =  w j h j (x) j=1 m Wherec j is center of a region, r j is width of the receptive field architecture

6/10/ Visual Recognition11 designing Require –Selection of the radial basis function width parameter –Number of radial basis neurons

6/10/ Visual Recognition12 Selection of the RBF width para. Not required for an MLP smaller width –alerting in untrained test data Larger width –network of smaller size & faster execution designing

6/10/ Visual Recognition13 Number of radial basis neurons By designer Max of neurons = number of input Min of neurons = ( experimentally determined) More neurons –More complex, but smaller tolerance designing

6/10/ Visual Recognition14 learning strategies Two levels of Learning –Center and spread learning (or determination) –Output layer Weights Learning Make # ( parameters) small as possible –Curse of Dimensionality

6/10/ Visual Recognition15 Various learning strategies how the centers of the radial-basis functions of the network are specified. Fixed centers selected at random Self-organized selection of centers Supervised selection of centers learning strategies

6/10/ Visual Recognition16 Fixed centers selected at random(1) Fixed RBFs of the hidden units The locations of the centers may be chosen randomly from the training data set. We can use different values of centers and widths for each radial basis function -> experimentation with training data is needed. learning strategies

6/10/ Visual Recognition17 Fixed centers selected at random(2) Only output layer weight is need to be learned. Obtain the value of the output layer weight by pseudo-inverse method Main problem –Require a large training set for a satisfactory level of performance learning strategies

6/10/ Visual Recognition18 Self-organized selection of centers(1) Hybrid learning –self-organized learning to estimate the centers of RBFs in hidden layer –supervised learning to estimate the linear weights of the output layer Self-organized learning of centers by means of clustering. Supervised learning of output weights by LMS algorithm. learning strategies

6/10/ Visual Recognition19 Self-organized selection of centers(2) k-means clustering 1.Initialization 2.Sampling 3.Similarity matching 4.Updating 5.Continuation learning strategies

6/10/ Visual Recognition20 Supervised selection of centers All free parameters of the network are changed by supervised learning process. Error-correction learning using LMS algorithm. learning strategies

6/10/ Visual Recognition21 Learning formula learning strategies Linear weights (output layer) Positions of centers (hidden layer) Spreads of centers (hidden layer)

6/10/ Visual Recognition22 MLP vs RBFN Global hyperplaneLocal receptive field EBPLMS Local minimaSerious local minima Smaller number of hidden neurons Larger number of hidden neurons Shorter computation timeLonger computation time Longer learning timeShorter learning time

6/10/ Visual Recognition23 Approximation MLP : Global network –All inputs cause an output RBF : Local network –Only inputs near a receptive field produce an activation –Can give “don’t know” output MLP vs RBFN

6/10/ Visual Recognition24 Gaussian Mixture Given a finite number of data points x n, n=1,…N, draw from an unknown distribution, the probability function p(x) of this distribution can be modeled by –Parametric methods Assuming a known density function (e.g., Gaussian) to start with, then Estimate their parameters by maximum likelihood For a data set of N vectors  ={x 1,…, x N } drawn independently from the distribution p(x|  the joint probability density of the whole data set  is given by

6/10/ Visual Recognition25 Gaussian Mixture L(  ) can be viewed as a function of  for fixed  in other words, it is the likelihood of  for the given  The technique of maximum likelihood sets the value of  by maximizing L(  ). In practice, often, the negative logarithm of the likelihood is considered, and the minimum of E is found. For normal distribution, the estimated parameters can be found by analytic differentiation of E:

6/10/ Visual Recognition26 Gaussian Mixture Non-parametric methods –Histograms An illustration of the histogram approach to density estimation. The set of 30 sample data points are drawn from the sum of two normal distribution, with means 0.3 and 0.8, standard deviations 0.1 and amplitudes 0.7 and 0.3 respectively. The original distribution is shown by the dashed curve, and the histogram estimates are shown by the rectangular bins. The number M of histogram bins within the given interval determines the width of the bins, which in turn controls the smoothness of the estimated density.

6/10/ Visual Recognition27 Gaussian Mixture –Density estimation by basis functions, e.g., Kenel functions, or k-nn (a) kernel function, (b) K-nn Examples of kernel and K-nn approaches to density estimation.

6/10/ Visual Recognition28 Discussions Parametric approach assumes a specific form for the density function, which may be different from the true density, but The density function can be evaluated rapidly for new input vectors Non-parametric methods allows very general forms of density functions, thus the number of variables in the model grows directly with the number of training data points. The model can not be rapidly evaluated for new input vectors Mixture model is a combine of both: (1) not restricted to specific functional form, and (2) yet the size of the model only grows with the complexity of the problem being solved, not the size of the data set. Gaussian Mixture

6/10/ Visual Recognition29 Gaussian Mixture The mixture model is a linear combination of component densities p(x| j ) in the form

6/10/ Visual Recognition30 Gaussian Mixture The key difference between the mixture model representation and a true classification problem lies in the nature of the training data, since in this case we are not provided with any “class labels” to say which component was responsible for generating each data point. This is so called the representation of “incomplete data” However, the technique of mixture modeling can be applied separately to each class-conditional density p(x|C k ) in a true classification problem. In this case, each class-conditional density p(x|C k ) is represented by an independent mixture model of the form

6/10/ Visual Recognition31 Gaussian Mixture Analog to conditional densities and using Bayes’ theorem, the posterior Probabilities of the component densities can be derived as The value of P(j|x) represents the probability that a component j was responsible for generating the data point x. Limited to the Gaussian distribution, each individual component densities are given by : Determine the parameters of Gaussian Mixture methods: (1) maximum likelihood, (2) EM algorithm.

6/10/ Visual Recognition32 Gaussian Mixture Representation of the mixture model in terms of a network diagram. For a component densities p(x|j), lines connecting the inputs x i to the component p(x|j) represents the elements  ji of the corresponding mean vectors  j of the component j.

6/10/ Visual Recognition33 Maximum likelihood The mixture density contains adjustable parameters: P(j),  j  and  j where j=1, …,M. The negative log-likelihood for the data set {x n } is given by: Maximizing the likelihood is then equivalent to minimizing E Differentiation E with respect to –the centres  j  –the variances  j :

6/10/ Visual Recognition34 Minimizing of E with respect to to the mixing parameters P(j), must subject to the constraints  P(j) =1, and 0< P(j) <1. This can be alleviated by changing P(j) in terms a set of M auxiliary variables {  j } such that: The transformation is called the softmax function, and the minimization of E with respect to  j is using chain rule in the form then, Maximum likelihood

6/10/ Visual Recognition35 Setting we obtain Setting These formulai give some insight of the maximum likelihood solution, they do not provide a direct method for calculating the parameters, i.e., these formulai are in terms of P(j|x). They do suggest an iterative scheme for finding the minimal of E Maximum likelihood

6/10/ Visual Recognition36 Maximum likelihood we can make some initial guess for the parameters, and use these formula to compute a revised value of the parameters. Then, using P(j|x n ) to estimate new parameters, Repeats these processes until converges

6/10/ Visual Recognition37 The EM algorithm The iteration process consists of (1) expectation and (2) maximization steps, thus it is called EM algorithm. We can write the change in error of E, in terms of old and new parameters by: Using we can rewrite this as follows Using Jensen’s inequality: given a set of numbers j  0, such that  j j=1,

6/10/ Visual Recognition38 Consider P old (j|x) as j, then the changes of E gives Let Q =, then, and is an upper bound of E new. As shown in figure, minimizing Q will lead to a decrease of E new, unless E new is already at a local minimum. Schematic plot of the error function E as a function of the new value  new of one of the parameters of the mixture model. The curve E old + Q(  new ) provides an upper bound on the value of E (  new ) and the EM algorithm involves finding the minimum value of this upper bound. The EM algorithm

6/10/ Visual Recognition39 Let’s drop terms in Q that depends on only old parameters, and rewrite Q as the smallest value for the upper bound is found by minimizing this quantity for the Gaussian mixture model, the quality can be we can now minimize this function with respect to ‘new’ parameters, and they are: The EM algorithm

6/10/ Visual Recognition40 For the mixing parameters P new (j), the constraint  j P new (j)=1 can be considered by using the Lagrange multiplier and minimizing the combined function Setting the derivative of Z with respect to P new (j) to zero, using  j P new (j)=1 and  j P old (j|x n )=1, we obtain = N, thus Since the  j P old (j|x n ) term is on the right side, thus this results are ready for iteration computation Exercise 2: shown on the nets The EM algorithm