Download presentation
Presentation is loading. Please wait.
1
Introduction to Radial Basis Function Networks
主講人: 虞台文
2
Content Overview The Models of Function Approximator
The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations
3
Introduction to Radial Basis Function Networks
Overview
4
Typical Applications of NN
Pattern Classification Function Approximation Time-Series Forecasting
5
Function Approximation
Unknown f Approximator ˆ f
6
Supervised Learning Unknown Function + Neural Network
7
Neural Networks as Universal Approximators
Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy. Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators. Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Radial-Basis-Function Networks," Neural Computation, 3(2), Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis-Function Networks," Neural Computation, 5(2),
8
Statistics vs. Neural Networks
model network estimation learning regression supervised learning interpolation generalization observations training set parameters (synaptic) weights independent variables inputs dependent variables outputs ridge regression weight decay
9
Introduction to Radial Basis Function Networks
The Model of Function Approximator
10
Linear Models Weights Fixed Basis Functions
11
Linearly weighted output
Linear Models y 1 2 m x1 x2 xn w1 w2 wm x = Linearly weighted output Output Units Decomposition Feature Extraction Transformation Hidden Units Inputs Feature Vectors
12
Linearly weighted output
Linear Models Can you say some bases? y Linearly weighted output Output Units w1 w2 wm Decomposition Feature Extraction Transformation Hidden Units 1 2 m Inputs Feature Vectors x = x1 x2 xn
13
Example Linear Models Are they orthogonal bases? Polynomial
Fourier Series
14
Single-Layer Perceptrons as Universal Aproximators
1 2 m x1 x2 xn w1 w2 wm x = With sufficient number of sigmoidal units, it can be a universal approximator. Hidden Units
15
Radial Basis Function Networks as Universal Aproximators
y 1 2 m x1 x2 xn w1 w2 wm x = With sufficient number of radial-basis-function units, it can also be a universal approximator. Hidden Units
16
Non-Linear Models Weights Adjusted by the Learning process
17
Introduction to Radial Basis Function Networks
The Radial Basis Function Networks
18
Radial Basis Functions
Three parameters for a radial function: i(x)= (||x xi||) xi Center Distance Measure Shape r = ||x xi||
19
Typical Radial Functions
Gaussian Hardy Multiquadratic Inverse Multiquadratic
20
Gaussian Basis Function (=0.5,1.0,1.5)
21
Inverse Multiquadratic
22
Basis {i: i =1,2,…} is `near’ orthogonal.
Most General RBF + + +
23
Properties of RBF’s On-Center, Off Surround
Analogies with localized receptive fields found in several biological structures, e.g., visual cortex; ganglion cells
24
The Topology of RBF As a function approximator x1 x2 xn y1 ym Output
Units Interpolation Hidden Units Projection Inputs Feature Vectors
25
The Topology of RBF As a pattern classifier. x1 x2 xn y1 ym Output
Units Classes Hidden Units Subclasses Inputs Feature Vectors
26
Introduction to Radial Basis Function Networks
RBFN’s for Function Approximation
27
The idea y Unknown Function to Approximate Training Data x
28
Basis Functions (Kernels)
The idea y Unknown Function to Approximate Training Data x Basis Functions (Kernels)
29
Basis Functions (Kernels)
The idea y Function Learned x Basis Functions (Kernels)
30
Basis Functions (Kernels)
The idea Nontraining Sample y Function Learned x Basis Functions (Kernels)
31
The idea Nontraining Sample y Function Learned x
32
Radial Basis Function Networks as Universal Aproximators
Training set x1 x2 xn w1 w2 wm x = Goal for all k
33
Learn the Optimal Weight Vector
Training set x1 x2 xn x = Goal for all k w1 w2 wm
34
Regularization Training set If regularization is unneeded, set Goal
for all k
35
Learn the Optimal Weight Vector
Minimize
36
Learn the Optimal Weight Vector
Define
37
Learn the Optimal Weight Vector
Define
38
Learn the Optimal Weight Vector
39
Learn the Optimal Weight Vector
Design Matrix Variance Matrix
40
Training set Summary
41
Introduction to Radial Basis Function Networks
The Projection Matrix
42
The Empirical-Error Vector
Unknown Function
43
The Empirical-Error Vector
Unknown Function Error Vector
44
Sum-Squared-Error Error Vector
If =0, the RBFN’s learning algorithm is to minimize SSE (MSE). Sum-Squared-Error Error Vector
45
The Projection Matrix Error Vector
46
Introduction to Radial Basis Function Networks
Learning the Kernels
47
RBFN’s as Universal Approximators
xn y1 yl 1 2 m w11 w12 w1m wl1 wl2 wlml Training set Kernels
48
What to Learn? x1 x2 xn y1 yl Weights wij’s Centers j’s of j’s
1 2 m w11 w12 w1m wl1 wl2 wlml Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s Model Selection
49
One-Stage Learning
50
The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting. One-Stage Learning
51
Two-Stage Training x1 x2 xn y1 yl Step 2 Step 1 w11 w12 w1m wl1 wl2
1 2 m w11 w12 w1m wl1 wl2 wlml Step 2 Determines wij’s. E.g., using batch-learning. Step 1 Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.
52
Train the Kernels
53
Unsupervised Training
+ + + + +
54
Methods Subset Selection Clustering Algorithms Mixture Models
Random Subset Selection Forward Selection Backward Elimination Clustering Algorithms KMEANS LVQ Mixture Models GMM
55
Subset Selection
56
Random Subset Selection
Randomly choosing a subset of points from training set Sensitive to the initially chosen points. Using some adaptive techniques to tune Centers Widths #points
57
Clustering Algorithms
Partition the data points into K clusters.
58
Clustering Algorithms
Is such a partition satisfactory?
59
Clustering Algorithms
How about this?
60
Clustering Algorithms
1 + + 2 + 4 + 3
61
Introduction to Radial Basis Function Networks
Bias-Variance Dilemma
62
Goal Revisit Minimize Prediction Error Minimize Empirical Error
Ultimate Goal Generalization Minimize Prediction Error Goal of Our Learning Procedure Minimize Empirical Error
63
Badness of Fit Underfitting Overfitting
A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. Produces excessive bias in the outputs. Overfitting A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting. Produces excessive variance in the outputs.
64
Underfitting/Overfitting Avoidance
Model selection Jittering Early stopping Weight decay Regularization Ridge Regression Bayesian learning Combining networks
65
Best Way to Avoid Overfitting
Use lots of training data, e.g., 30 times as many training cases as there are weights in the network. for noise-free data, 5 times as many training cases as weights may be sufficient. Don’t arbitrarily reduce the number of weights for fear of underfitting.
66
Badness of Fit Underfit Overfit
67
Badness of Fit Underfit Overfit
68
Bias-Variance Dilemma
However, it's not really a dilemma. Bias-Variance Dilemma Underfit Overfit Large bias Small bias Small variance Large variance
69
Bias-Variance Dilemma
More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data. Underfit Overfit Large bias Small bias Small variance Large variance
70
Bias-Variance Dilemma
It's not really a dilemma. Bias-Variance Dilemma Bias Variance fit more overfit underfit
71
Bias-Variance Dilemma
The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma noise The true model bias bias bias Solution obtained with training set 3. Solution obtained with training set 2. Solution obtained with training set 1. Sets of functions E.g., depend on # hidden nodes used.
72
Bias-Variance Dilemma
The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma noise Variance The true model bias Sets of functions E.g., depend on # hidden nodes used.
73
Model Selection Variance Reduce the effective number of parameters.
Reduce the number of hidden nodes. Model Selection noise Variance The true model bias Sets of functions E.g., depend on # hidden nodes used.
74
Bias-Variance Dilemma
Goal: Bias-Variance Dilemma noise The true model bias Sets of functions E.g., depend on # hidden nodes used.
75
Bias-Variance Dilemma
Goal: Bias-Variance Dilemma Goal: constant
76
Bias-Variance Dilemma
77
Bias-Variance Dilemma
Goal: Bias-Variance Dilemma bias2 variance Minimize both bias2 and variance noise Cannot be minimized
78
Model Complexity vs. Bias-Variance
Goal: Model Complexity vs. Bias-Variance bias2 variance noise Model Complexity (Capacity)
79
Bias-Variance Dilemma
Goal: Bias-Variance Dilemma bias2 variance noise Model Complexity (Capacity)
80
Example (Polynomial Fits)
81
Example (Polynomial Fits)
82
Example (Polynomial Fits)
Degree 1 Degree 5 Degree 10 Degree 15
83
Introduction to Radial Basis Function Networks
The Effective Number of Parameters
84
Variance Estimation Mean In general, not available. Variance
85
Variance Estimation Mean Variance Loss 1 degree of freedom
86
Simple Linear Regression
87
Simple Linear Regression
Minimize Simple Linear Regression
88
Mean Squared Error (MSE)
Minimize Mean Squared Error (MSE) Loss 2 degrees of freedom
89
Variance Estimation m: #parameters of the model Loss m degrees
of freedom m: #parameters of the model
90
The Number of Parameters
x1 x2 xn x = #degrees of freedom: w1 w2 wm
91
The Effective Number of Parameters ()
x1 x2 xn x = The projection Matrix w1 w2 wm
92
The Effective Number of Parameters ()
Facts: The Effective Number of Parameters () Pf) The projection Matrix
93
Regularization SSE The effective number of parameters:
Penalize models with large weights SSE
94
The effective number of parameters =m.
Regularization Without penalty (i=0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters =m. Penalize models with large weights SSE
95
The effective number of parameters:
Regularization With penalty (i>0), the liberty to minimize SSE will be reduced. The effective number of parameters <m. Penalize models with large weights SSE
96
Variance Estimation Loss degrees of freedom
The effective number of parameters: Variance Estimation Loss degrees of freedom
97
The effective number of parameters:
Variance Estimation
98
Introduction to Radial Basis Function Networks
Model Selection
99
Model Selection Goal Criteria Main Tools (Estimate Model Fitness)
Choose the fittest model Criteria Least prediction error Main Tools (Estimate Model Fitness) Cross validation Projection matrix Methods Weight decay (Ridge regression) Pruning and Growing RBFN’s
100
Empirical Error vs. Model Fitness
Ultimate Goal Generalization Minimize Prediction Error Goal of Our Learning Procedure Minimize Empirical Error (MSE) Minimize Prediction Error
101
Estimating Prediction Error
When you have plenty of data use independent test sets E.g., use the same training set to train different models, and choose the best model by comparing on the test set. When data is scarce, use Cross-Validation Bootstrap
102
Cross Validation Simplest and most widely used method for estimating prediction error. Partition the original set into several different ways and to compute an average score over the different partitions, e.g., K-fold Cross-Validation Leave-One-Out Cross-Validation Generalize Cross-Validation
103
K-Fold CV Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk. Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.
104
K-Fold CV Available Data
105
K-Fold CV . . . . . . . . . . . . . . . Estimate 2 Test Set
Training Set D1 Available Data D2 D3 . . . Dk D1 D2 D3 . . . Dk Estimate 2 D1 D2 D3 . . . Dk D1 D2 D3 . . . Dk D1 D2 D3 . . . Dk
106
A special case of k-fold CV.
Leave-One-Out CV Split the p available input-output patterns into a training set of size p1 and a test set of size 1. Average the squared error on the left-out pattern over the p possible ways of partition.
107
Error Variance Predicted by LOO
A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. Function learned using Di as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.
108
Error Variance Predicted by LOO
A special case of k-fold CV. Error Variance Predicted by LOO Given a model, the function with least empirical error for Di. Available input-output patterns. Training sets of LOO. Function learned using Di as training set. As an index of model’s fitness. We want to find a model also minimize this. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.
109
Error Variance Predicted by LOO
A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. Function learned using Di as training set. How to estimate? Are there any efficient ways? The estimate for the variance of prediction error using LOO: Error-square for the left-out element.
110
Error Variance Predicted by LOO
Error-square for the left-out element.
111
Generalized Cross-Validation
112
More Criteria Based on CV
GCV (Generalized CV) Akaike’s Information Criterion UEV (Unbiased estimate of variance) FPE (Final Prediction Error) BIC (Bayesian Information Criterio)
113
More Criteria Based on CV
114
More Criteria Based on CV
115
Regularization SSE Standard Ridge Regression, Penalize models with
large weights SSE
116
Regularization SSE Standard Ridge Regression, Penalize models with
large weights SSE
117
Solution Review Used to compute model selection criteria
118
Example Width of RBF r = 0.5
119
Example Width of RBF r = 0.5
120
Example Width of RBF r = 0.5 How the determine the optimal regularization parameter effectively?
121
Optimizing the Regularization Parameter
Re-Estimation Formula
122
Local Ridge Regression
Re-Estimation Formula
123
Example — Width of RBF
124
Example — Width of RBF
125
Example — Width of RBF There are two local-minima.
Using the about re-estimation formula, it will be stuck at the nearest local minimum. That is, the solution depends on the initial setting.
126
Example — Width of RBF There are two local-minima.
127
Example — Width of RBF There are two local-minima.
128
Example — Width of RBF RMSE: Root Mean Squared Error
In real case, it is not available.
129
Example — Width of RBF RMSE: Root Mean Squared Error
In real case, it is not available.
130
Local Ridge Regression
Standard Ridge Regression Local Ridge Regression
131
Local Ridge Regression
Standard Ridge Regression j implies that j() can be removed. Local Ridge Regression
132
The Solutions Linear Regression Standard Ridge Regression
Local Ridge Regression Used to compute model selection criteria
133
Optimizing the Regularization Parameters
Incremental Operation P: The current projection Matrix. Pj: The projection Matrix obtained by removing j().
134
Optimizing the Regularization Parameters
Solve Subject to
135
Optimizing the Regularization Parameters
Solve Subject to
136
Optimizing the Regularization Parameters
Remove j() Solve Subject to
137
The Algorithm Initialize i’s.
e.g., performing standard ridge regression. Repeat the following until GCV converges: Randomly select j and compute Perform local ridge regression If GCV reduce & remove j()
138
References Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI).
139
Introduction to Radial Basis Function Networks
Incremental Operations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.