Introduction to Radial Basis Function Networks

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
x – independent variable (input)
Introduction to Predictive Learning
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Radial Basis Functions
Chapter 5 NEURAL NETWORKS
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Aula 4 Radial Basis Function Networks
Radial Basis Function (RBF) Networks
Last lecture summary.
Radial-Basis Function Networks
Radial Basis Function Networks
Last lecture summary.
Radial Basis Function Networks
Radial Basis Function Networks
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Artificial Neural Networks Shreekanth Mandayam Robi Polikar …… …... … net k   
Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Introduction to Radial Basis Function Networks
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
Artificial Neural Networks
Neural Networks Winter-Spring 2014
Learning with Perceptrons and Neural Networks
One-layer neural networks Approximation problems
第 3 章 神经网络.
Radial Basis Function G.Anuradha.
Bias and Variance of the Estimator
Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B.
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Lecture 25 Radial Basis Network (II)
Biological and Artificial Neuron
Hyperparameters, bias-variance tradeoff, validation
Biological and Artificial Neuron
Neuro-Computing Lecture 4 Radial Basis Function Network
Artificial Intelligence Chapter 3 Neural Networks
Biological and Artificial Neuron
Neural Network - 2 Mayank Vatsa
Artificial Intelligence Chapter 3 Neural Networks
Overfitting and Underfitting
The loss function, the normal equation,
Introduction to Radial Basis Function Networks
Mathematical Foundations of BME Reza Shadmehr
Model generalization Brief summary of methods
Artificial Intelligence Chapter 3 Neural Networks
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence Chapter 3 Neural Networks
Prediction Networks Prediction A simple example (section 3.7.3)
Introduction to Neural Networks
Introduction to Machine learning
Support Vector Machines 2
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Introduction to Radial Basis Function Networks 主講人: 虞台文

Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations

Introduction to Radial Basis Function Networks Overview

Typical Applications of NN Pattern Classification Function Approximation Time-Series Forecasting

Function Approximation Unknown f Approximator ˆ f

Supervised Learning Unknown Function +  Neural Network

Neural Networks as Universal Approximators Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy. Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366. Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators. Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Radial-Basis-Function Networks," Neural Computation, 3(2), 246-257. Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis-Function Networks," Neural Computation, 5(2), 305-316.

Statistics vs. Neural Networks model network estimation learning regression supervised learning interpolation generalization observations training set parameters (synaptic) weights independent variables inputs dependent variables outputs ridge regression weight decay

Introduction to Radial Basis Function Networks The Model of Function Approximator

Linear Models Weights Fixed Basis Functions

Linearly weighted output Linear Models y 1 2 m x1 x2 xn w1 w2 wm x = Linearly weighted output Output Units Decomposition Feature Extraction Transformation Hidden Units Inputs Feature Vectors

Linearly weighted output Linear Models Can you say some bases? y Linearly weighted output Output Units w1 w2 wm Decomposition Feature Extraction Transformation Hidden Units 1 2 m Inputs Feature Vectors x = x1 x2 xn

Example Linear Models Are they orthogonal bases? Polynomial Fourier Series

Single-Layer Perceptrons as Universal Aproximators 1 2 m x1 x2 xn w1 w2 wm x = With sufficient number of sigmoidal units, it can be a universal approximator. Hidden Units

Radial Basis Function Networks as Universal Aproximators y 1 2 m x1 x2 xn w1 w2 wm x = With sufficient number of radial-basis-function units, it can also be a universal approximator. Hidden Units

Non-Linear Models Weights Adjusted by the Learning process

Introduction to Radial Basis Function Networks The Radial Basis Function Networks

Radial Basis Functions Three parameters for a radial function: i(x)= (||x  xi||) xi Center Distance Measure Shape r = ||x  xi|| 

Typical Radial Functions Gaussian Hardy Multiquadratic Inverse Multiquadratic

Gaussian Basis Function (=0.5,1.0,1.5)

Inverse Multiquadratic

Basis {i: i =1,2,…} is `near’ orthogonal. Most General RBF + + +

Properties of RBF’s On-Center, Off Surround Analogies with localized receptive fields found in several biological structures, e.g., visual cortex; ganglion cells

The Topology of RBF As a function approximator x1 x2 xn y1 ym Output Units Interpolation Hidden Units Projection Inputs Feature Vectors

The Topology of RBF As a pattern classifier. x1 x2 xn y1 ym Output Units Classes Hidden Units Subclasses Inputs Feature Vectors

Introduction to Radial Basis Function Networks RBFN’s for Function Approximation

The idea y Unknown Function to Approximate Training Data x

Basis Functions (Kernels) The idea y Unknown Function to Approximate Training Data x Basis Functions (Kernels)

Basis Functions (Kernels) The idea y Function Learned x Basis Functions (Kernels)

Basis Functions (Kernels) The idea Nontraining Sample y Function Learned x Basis Functions (Kernels)

The idea Nontraining Sample y Function Learned x

Radial Basis Function Networks as Universal Aproximators Training set x1 x2 xn w1 w2 wm x = Goal for all k

Learn the Optimal Weight Vector Training set x1 x2 xn x = Goal for all k w1 w2 wm

Regularization Training set If regularization is unneeded, set Goal for all k

Learn the Optimal Weight Vector Minimize

Learn the Optimal Weight Vector Define

Learn the Optimal Weight Vector Define

Learn the Optimal Weight Vector

Learn the Optimal Weight Vector Design Matrix Variance Matrix

Training set Summary

Introduction to Radial Basis Function Networks The Projection Matrix

The Empirical-Error Vector Unknown Function

The Empirical-Error Vector Unknown Function Error Vector

Sum-Squared-Error Error Vector If =0, the RBFN’s learning algorithm is to minimize SSE (MSE). Sum-Squared-Error Error Vector

The Projection Matrix Error Vector

Introduction to Radial Basis Function Networks Learning the Kernels

RBFN’s as Universal Approximators xn y1 yl 1 2 m w11 w12 w1m wl1 wl2 wlml Training set Kernels

What to Learn? x1 x2 xn y1 yl Weights wij’s Centers j’s of j’s 1 2 m w11 w12 w1m wl1 wl2 wlml Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s  Model Selection

One-Stage Learning

The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting. One-Stage Learning

Two-Stage Training x1 x2 xn y1 yl Step 2 Step 1 w11 w12 w1m wl1 wl2 1 2 m w11 w12 w1m wl1 wl2 wlml Step 2 Determines wij’s. E.g., using batch-learning. Step 1 Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.

Train the Kernels

Unsupervised Training + + + + +

Methods Subset Selection Clustering Algorithms Mixture Models Random Subset Selection Forward Selection Backward Elimination Clustering Algorithms KMEANS LVQ Mixture Models GMM

Subset Selection

Random Subset Selection Randomly choosing a subset of points from training set Sensitive to the initially chosen points. Using some adaptive techniques to tune Centers Widths #points

Clustering Algorithms Partition the data points into K clusters.

Clustering Algorithms Is such a partition satisfactory?

Clustering Algorithms How about this?

Clustering Algorithms 1 + + 2 + 4 + 3

Introduction to Radial Basis Function Networks Bias-Variance Dilemma

Goal Revisit Minimize Prediction Error Minimize Empirical Error Ultimate Goal  Generalization Minimize Prediction Error Goal of Our Learning Procedure Minimize Empirical Error

Badness of Fit Underfitting Overfitting A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. Produces excessive bias in the outputs. Overfitting A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting. Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance Model selection Jittering Early stopping Weight decay Regularization Ridge Regression Bayesian learning Combining networks

Best Way to Avoid Overfitting Use lots of training data, e.g., 30 times as many training cases as there are weights in the network. for noise-free data, 5 times as many training cases as weights may be sufficient. Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of Fit Underfit Overfit

Badness of Fit Underfit Overfit

Bias-Variance Dilemma However, it's not really a dilemma. Bias-Variance Dilemma Underfit Overfit Large bias Small bias Small variance Large variance

Bias-Variance Dilemma More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data. Underfit Overfit Large bias Small bias Small variance Large variance

Bias-Variance Dilemma It's not really a dilemma. Bias-Variance Dilemma Bias Variance fit more overfit underfit

Bias-Variance Dilemma The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma noise The true model bias bias bias Solution obtained with training set 3. Solution obtained with training set 2. Solution obtained with training set 1. Sets of functions E.g., depend on # hidden nodes used.

Bias-Variance Dilemma The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma noise Variance The true model bias Sets of functions E.g., depend on # hidden nodes used.

Model Selection Variance Reduce the effective number of parameters. Reduce the number of hidden nodes. Model Selection noise Variance The true model bias Sets of functions E.g., depend on # hidden nodes used.

Bias-Variance Dilemma Goal: Bias-Variance Dilemma noise The true model bias Sets of functions E.g., depend on # hidden nodes used.

Bias-Variance Dilemma Goal: Bias-Variance Dilemma Goal: constant

Bias-Variance Dilemma

Bias-Variance Dilemma Goal: Bias-Variance Dilemma bias2 variance Minimize both bias2 and variance noise Cannot be minimized

Model Complexity vs. Bias-Variance Goal: Model Complexity vs. Bias-Variance bias2 variance noise Model Complexity (Capacity)

Bias-Variance Dilemma Goal: Bias-Variance Dilemma bias2 variance noise Model Complexity (Capacity)

Example (Polynomial Fits)

Example (Polynomial Fits)

Example (Polynomial Fits) Degree 1 Degree 5 Degree 10 Degree 15

Introduction to Radial Basis Function Networks The Effective Number of Parameters

Variance Estimation Mean In general, not available. Variance

Variance Estimation Mean Variance Loss 1 degree of freedom

Simple Linear Regression

Simple Linear Regression Minimize Simple Linear Regression

Mean Squared Error (MSE) Minimize Mean Squared Error (MSE) Loss 2 degrees of freedom

Variance Estimation m: #parameters of the model Loss m degrees of freedom m: #parameters of the model

The Number of Parameters x1 x2 xn x = #degrees of freedom: w1 w2 wm

The Effective Number of Parameters () x1 x2 xn x = The projection Matrix w1 w2 wm

The Effective Number of Parameters () Facts: The Effective Number of Parameters () Pf) The projection Matrix

Regularization SSE The effective number of parameters: Penalize models with large weights SSE

The effective number of parameters  =m. Regularization Without penalty (i=0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters  =m. Penalize models with large weights SSE

The effective number of parameters: Regularization With penalty (i>0), the liberty to minimize SSE will be reduced. The effective number of parameters  <m. Penalize models with large weights SSE

Variance Estimation Loss  degrees of freedom The effective number of parameters: Variance Estimation Loss  degrees of freedom

The effective number of parameters: Variance Estimation

Introduction to Radial Basis Function Networks Model Selection

Model Selection Goal Criteria Main Tools (Estimate Model Fitness) Choose the fittest model Criteria Least prediction error Main Tools (Estimate Model Fitness) Cross validation Projection matrix Methods Weight decay (Ridge regression) Pruning and Growing RBFN’s

Empirical Error vs. Model Fitness Ultimate Goal  Generalization Minimize Prediction Error Goal of Our Learning Procedure Minimize Empirical Error (MSE) Minimize Prediction Error

Estimating Prediction Error When you have plenty of data use independent test sets E.g., use the same training set to train different models, and choose the best model by comparing on the test set. When data is scarce, use Cross-Validation Bootstrap

Cross Validation Simplest and most widely used method for estimating prediction error. Partition the original set into several different ways and to compute an average score over the different partitions, e.g., K-fold Cross-Validation Leave-One-Out Cross-Validation Generalize Cross-Validation

K-Fold CV Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk. Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.

K-Fold CV Available Data

K-Fold CV . . . . . . . . . . . . . . . Estimate 2 Test Set Training Set D1 Available Data D2 D3 . . . Dk D1 D2 D3 . . . Dk Estimate 2 D1 D2 D3 . . . Dk D1 D2 D3 . . . Dk D1 D2 D3 . . . Dk

A special case of k-fold CV. Leave-One-Out CV Split the p available input-output patterns into a training set of size p1 and a test set of size 1. Average the squared error on the left-out pattern over the p possible ways of partition.

Error Variance Predicted by LOO A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. Function learned using Di as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

Error Variance Predicted by LOO A special case of k-fold CV. Error Variance Predicted by LOO Given a model, the function with least empirical error for Di. Available input-output patterns. Training sets of LOO. Function learned using Di as training set. As an index of model’s fitness. We want to find a model also minimize this. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

Error Variance Predicted by LOO A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. Function learned using Di as training set. How to estimate? Are there any efficient ways? The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

Error Variance Predicted by LOO Error-square for the left-out element.

Generalized Cross-Validation

More Criteria Based on CV GCV (Generalized CV) Akaike’s Information Criterion UEV (Unbiased estimate of variance) FPE (Final Prediction Error) BIC (Bayesian Information Criterio)

More Criteria Based on CV

More Criteria Based on CV

Regularization SSE Standard Ridge Regression, Penalize models with large weights SSE

Regularization SSE Standard Ridge Regression, Penalize models with large weights SSE

Solution Review Used to compute model selection criteria

Example Width of RBF r = 0.5

Example Width of RBF r = 0.5

Example Width of RBF r = 0.5 How the determine the optimal regularization parameter effectively?

Optimizing the Regularization Parameter Re-Estimation Formula

Local Ridge Regression Re-Estimation Formula

Example — Width of RBF

Example — Width of RBF

Example — Width of RBF There are two local-minima. Using the about re-estimation formula, it will be stuck at the nearest local minimum. That is, the solution depends on the initial setting.

Example — Width of RBF There are two local-minima.

Example — Width of RBF There are two local-minima.

Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available.

Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available.

Local Ridge Regression Standard Ridge Regression Local Ridge Regression

Local Ridge Regression Standard Ridge Regression j  implies that j() can be removed. Local Ridge Regression

The Solutions Linear Regression Standard Ridge Regression Local Ridge Regression Used to compute model selection criteria

Optimizing the Regularization Parameters Incremental Operation P: The current projection Matrix. Pj: The projection Matrix obtained by removing j().

Optimizing the Regularization Parameters Solve Subject to

Optimizing the Regularization Parameters Solve Subject to

Optimizing the Regularization Parameters Remove j() Solve Subject to

The Algorithm Initialize i’s. e.g., performing standard ridge regression. Repeat the following until GCV converges: Randomly select j and compute Perform local ridge regression If GCV reduce & remove j()

References Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html. Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI).

Introduction to Radial Basis Function Networks Incremental Operations