Transfer functions: hidden possibilities for better neural networks. Włodzisław Duch and Norbert Jankowski Department of Computer Methods, Nicholas Copernicus University, Torun, Poland.
Why is this an important issue? MLPs are universal approximators - no need for other TF? Wrong bias => poor results, complex networks. Example of a 2-class problems: Class 1 inside the sphere, Class 2 outside. MLP: at least N +1 hyperplanes, O(N 2 ) parameters. RBF: 1 Gaussian, O(N) parameters. Class 1 in the corner defined by (1,1... 1) hyperplane, C2 outside. MLP: 1 hyperplane, O(N) parameters. RBF: many Gaussians, O(N 2 ) parameters, poor approximation.
Inspirations Logical rule: IF x 1 >0 & x 2 >0 THEN Class1 Else Class2 is not properly represented neither by MLP nor RBF! Result: decision trees and logical rules perform on some datasets (cf. hypothyroid) significantly better than MLPs! Speed of learning and network complexity depends on TF. Fast learning requires flexible „brain modules” - TF. Biological inspirations: sigmoidal neurons are crude approximation at the basic level of neural tissue. Interesting brain functions are done by interacting minicolumns, implementing complex functions. Modular networks: networks of networks. First step beyond single neurons: transfer functions providing flexible decision borders.
Transfer functions Transfer function f(I(X)): vector activation I(X) and scalar output o(I). 1. Fan-in, scalar product activation W. X, hyperplanes. 3. Mixed activation functions 2. Distance functions as activations, for example Gaussian functions:
Taxonomy - activation f.
Taxonomy - output f.
Taxonomy - TF
TF in Neural Networks Choices: 1. Homogenous NN: select best TF, try several types Ex: RBF networks; SVM kernels (today 50=>80% change). 2. Heterogenous NN: one network, several types of TF Ex: Adaptive Subspace SOM (Kohonen 1995), linear subspaces. Projections on a space of basis functions. 3. Input enhancement: adding f i (X) to achieve separability. Ex: functional link networks (Pao 1989), tensor products of inputs; D-MLP model. Heterogenous : 1. Start from large network with different TF, use regularization to prune 2. Construct network adding nodes selected from a pool of candidates 3. Use very flexible TF, force them to specialize.
Most flexible TFs Conical functions: mixed activations Lorentzian: mixed activations Bicentral - separable functions
Bicentral + rotations 6N parameters, most general. Box in N-1 dim x rotated window. Rotation matrix with band structure makes 2x2 rotations.
Some properties of TFs For logistic functions: Renormalization of a Gaussian gives logistic function where: W i =4D i /b i 2
Example of input transformation Minkovsky’s distance function: Sigmoidal activation changed to: Adding a single input renormalizing the vector:
Conclusions Radial and sigmoidal functions are not the only choice. StatLog report: large differences of RBF and MLP on many datasets. Better learning cannot repair wrong bias of the model. Systematic investigation and taxonomy of TF is worthwhile. Networks should select/optimize their functions. StatLog report: large differences of RBF and MLP on many datasets. Better learning cannot repair wrong bias of the model. Systematic investigation and taxonomy of TF is worthwhile. Networks should select/optimize their functions. Open questions: Optimal balance between complex nodes/interactions (weights)? How to train heterogeneous networks? How to optimize nodes in a constructive algorithms? Hierarchical, modular networks: nodes that are networks themselves. Open questions: Optimal balance between complex nodes/interactions (weights)? How to train heterogeneous networks? How to optimize nodes in a constructive algorithms? Hierarchical, modular networks: nodes that are networks themselves.
The End ? Perhaps the beginning...