Neural Networks An Introduction Kasin Prakobwaitayakit Department of Electrical Engineering Chiangmai University
Brain and Machine The Brain The Machine Pattern Recognition Association Complexity Noise Tolerance The Machine Calculation Precision Logic
The contrast in architecture The Von Neumann architecture uses a single processing unit; Tens of millions of operations per second Absolute arithmetic precision The brain uses many slow unreliable processors acting in parallel
Features of the Brain Ten billion neurons Average several thousand connections Hundreds of operations per second Reliability low Die off frequently (never replaced) Compensates for problems by massive parallelism
The biological inspiration The brain has been extensively studied by scientists. Vast complexity prevents all but rudimentary understanding. Even the behavior of an individual neuron is extremely complex
The biological inspiration Single “percepts” distributed among many neurons Localized parts of the brain are responsible for certain well-defined functions (e.g.. vision, motion). Which features are integral to the brain's performance? Which are incidentals imposed by the fact of biology?
The Structure of Neurones
The Structure of Neurones A neurone has a cell body, a branching input structure (the dendrIte) and a branching output structure (th axOn) Axons connect to dendrites via synapses. Electro-chemical signals are propagated from the dendritic input, through the cell body, and down the axon to other neurons
The Structure of Neurones A neurone only fires if its input signal exceeds a certain amount (the threshold) in a short time period. Synapses vary in strength Good connections allowing a large signal Slight connections allow only a weak signal. Synapses can be either excitatory or inhibitory.
A Classic Artifical Neuron(1) Sj f (Sj) Xj ao a1 a2 an +1 wj0 wj1 wj2 wjn
A Classic Artifical Neuron(2) All neurons contain an activation function which determines whether the signal is strong enough to produce an output. Shows several functions that could be used as an activation function.
Learning When the output is calculated, the desire output is then given to the program to modify the weights. After modifications are done, the same inputs given will produce the outputs desired. Formula : Weight N = Weight N + learning rate * (Desire Output-Actual Output) * Input N * Weight N
Tractable Architectures Feedforward Neural Networks Connections in one direction only Partial biological justification Complex models with constraints (Hopfield and ART). Feedback loops included Complex behaviour, limited by constraining architecture
Fig. 1: Multilayer Perceptron Output Values Input Signals (External Stimuli) Output Layer Adjustable Weights Input Layer
Types of Layer The input layer. The hidden layer(s). Introduces input values into the network. No activation function or other processing. The hidden layer(s). Perform classification of features Two hidden layers are sufficient to solve any problem Features imply more layers may be better
Types of Layer (continued) The output layer. Functionally just like the hidden layers Outputs are passed on to the world outside the neural network.
A Simple Model of a Neuron w1j w2j w3j wij y1 y2 y3 yi O Each neuron has a threshold value Each neuron has weighted inputs from other neurons The input signals form a weighted sum If the activation level exceeds the threshold, the neuron “fires”
An Artificial Neuron O f(x) w1j w2j w3j wij y1 y2 y3 yi Each hidden or output neuron has weighted input connections from each of the units in the preceding layer. The unit performs a weighted sum of its inputs, and subtracts its threshold value, to give its activation level. Activation level is passed through a sigmoid activation function to determine output.
Mathematical Definition Number all the neurons from 1 up to N The output of the j'th neuron is oj The threshold of the j'th neuron is qj The weight of the connection from unit i to unit j is wij The activation of the j'th unit is aj The activation function is written as f(x)
Mathematical Definition Since the activation aj is given by the sum of the weighted inputs minus the threshold, we can write: aj = S ( wijoi ) - qj i oj = f(aj )
Activation functions Transforms neuron’s input into output. Features of activation functions: A squashing effect is required Prevents accelerating growth of activation levels through the network. Simple and easy to calculate Monotonically non-decreasing order-preserving
Standard activation functions The hard-limiting threshold function Corresponds to the biological paradigm either fires or not Sigmoid functions ('S'-shaped curves) The logistic function The hyperbolic tangent (symmetrical) Both functions have a simple differential Only the shape is important f(x) = 1 1 + e -ax
Training Algorithms Adjust neural network weights to map inputs to outputs. Use a set of sample patterns where the desired output (given the inputs presented) is known. The purpose is to learn to generalize Recognize features which are common to good and bad exemplars
Back-Propagation A training procedure which allows multi-layer feedforward Neural Networks to be trained; Can theoretically perform “any” input-output mapping; Can learn to solve linearly inseparable problems.
Activation functions and training For feedforward networks: A continuous function can be differentiated allowing gradient-descent. Back-propagation is an example of a gradient-descent technique. Reason for prevalence of sigmoid
Training versus Analysis Understanding how the network is doing what it does Predicting behaviour under novel conditions
Applications The properties of neural networks define where they are useful. Can learn complex mappings from inputs to outputs, based solely on samples Difficult to analyze: firm predictions about neural network behavior difficult; Unsuitable for safety-critical applications. Require limited understanding from trainer, who can be guided by heuristics.
Engine management The behaviour of a car engine is influenced by a large number of parameters temperature at various points fuel/air mixture lubricant viscosity. A major company have used neural networks to dynamically tune an engine depending on current settings.
Signature recognition Each person's signature is different. There are structural similarities which are difficult to quantify. One company have manufactured a machine which recognizes signatures to within a high level of accuracy. Considers speed in addition to gross shape. Makes forgery even more difficult.
Sonar target recognition Distinguish mines from rocks on sea-bed The neural network is provided with a large number of parameters which are extracted from the sonar signal. The training set consists of sets of signals from rocks and mines.
Stock market prediction “Technical trading” refers to trading based solely on known statistical parameters; e.g. previous price Neural networks have been used to attempt to predict changes in prices. Difficult to assess success since companies using these techniques are reluctant to disclose information.
Mortgage assessment Assess risk of lending to an individual. Difficult to decide on marginal cases. Neural networks have been trained to make decisions, based upon the opinions of expert underwriters. Neural network produced a 12% reduction in delinquencies compared with human experts.
Types of Problem Pattern Classification Regression Assign patterns to one of two or more classes Regression Predict value of a continuous variable
Pattern Classification Decide which class a particular pattern belongs to. In the most common case, there are only two classes. This implies that the neural network is modelling a step-function The most common use of Neural Networks.
Pattern Classification A feature is a measurement of some kind (a real number). Corresponds to inputs of neural network. A pattern is called a feature vector Points in N-dimensional space. Bifurcate feature space. Division is based on sample patterns.
Decision boundaries In simple cases, divide feature space by drawing a hyper-plane across it. Known as a decision boundary. Discriminant function: returns different values on opposite sides. Problems which can be thus classified are linearly separable.
Linear Separability Decision Boundary X1 A A A A A A A X2 B B B B B B
Nearest neighbour New pattern is assigned the same class as its nearest neighbour. Can be improved by taking k nearest neighbours and assigning to the majority
Hyper-plane partitions A single Perceptron (i.e. output unit) with connections from each input can perform, and learn, a linear separation. Perceptrons have a step function activation. Units with a sigmoid activation also act as a linear discriminant, if interpreted correctly. Use activation mid-point
Hyper-plane partitions An extra layer models a convex hull “An area with no dents in it” Perceptron models, but can’t learn Sigmoid function learning of convex hulls Two layers add convex hulls together Sufficient to classify anything “sane”. In theory, further layers add nothing In practice, extra layers may be better
Different Non-Linearly Separable Problems Types of Decision Regions Exclusive-OR Problem Classes with Meshed regions Most General Region Shapes Structure Single-Layer Half Plane Bounded By Hyperplane A B B A Two-Layer Convex Open Or Closed Regions A B B A Abitrary (Complexity Limited by No. of Nodes) Three-Layer A B B A
Over-training With sufficient nodes can classify any training set exactly May have poor generalisation ability. Cross-validation with some patterns Typically 50% of training patterns Validation set error is checked each epoch Stop training if validation error goes up
Training time How many epochs of training? Stop if the error fails to improve (has reached a minimum) Stop if the rate of improvement drops below a certain level Stop if the error reaches an acceptable level Stop when a certain number of epochs have passed
Rugby players & Ballet dancers 2 Height (m) Ballet? 1 50 100 Weight (Kg)
Clustering K means Use Euclidean distance ||x - mean || randomly assign each point to 1 of K sets calculate mean vector of each set reassign points to set with closed mean vector repeat until no further changes Use Euclidean distance ||x - mean || Caution - scaling inputs is important
MLPs versus RBFs MLPs separate classes using hyper-planes MLP RBF’s separate classes using hyper-spheres. MLPs may have one or more hidden layers, RBFs have just one. X2 MLP X1 X2 RBF X1
MLPs versus RBIs 2 MLPs use distributed learning, RBFs use localized learning. RBFs usually require more hidden units to model the problem. RBFs said to be more robust for novel data and they train faster. However, RBFs can suffer from ‘curse of dimensionality.
3-Class Problem classifying New Data using MLP LEGEND: Class A Class B Class C New Data (Unknown Class) Decision Boundary
3-Class Problem Classifying New Data using RBFN LEGEND: Class A Class B Class C New Data (Unknown Class) Decision Boundary
Radial Basis Function Architecture Centre pattern, stored in 1st layer weights Distance measure, determines how far an input pattern is from the centre. Gaussian transfer function. Outputs 2nd layer weights Radial Units 1st layer weights Inputs
A Radial Basis Function Neuron +1 X1 ck1 ck2 X2 ck3 X3 Euclidean summation Ik=||X-ck|| Transfer function vk= j (Ik) vk ckn Xn
RBF 3 Stage Training 1. Find cluster centres by e.g. K-means clustering. 2. Find the width of the function (deviation) e.g. K-nearest neighbours. 3. Supervised training phase. Adjust 2nd layer weights to map input patterns onto the known output values.
Selecting Radial Centres Radial Sampling (or sub-sampling) randomly select centres from training points K-means centre assignment k means clustering Kohonen Self Organising Maps competitive learning (ref: Kohonen lecture)
Calculating Centre Widths Explicit Assignment K Nearest Neighbours - need to specify size of K Isotropic - determined by the number of centres and how spread out they are d is distance between most distant centres k is number of centres
Supervised phase Optimize second layer weights using the known outputs linear optimization using Pseudo inverse Can use backpropagation, quick propagation or delta-bar-delta instead
Number of Patterns A simple formula gives a reasonable guideline: Use w/e training patterns, w = number of weights, e = desired accuracy Train until the error is less than e/2 on the training set The mathematical justification is quite complex (omitted!) In practice, can RARELY meet this criterion
Advantage & Disadvantage Pattern recognition Solve problems with many inputs Damage to the network does not screw up the output completely Disadvantage: Slow, compare to other computers Black box model
Conclusion True AI can be achieved, but our understand of human brain and lack of technologies do not enable us to study this field further. It is highly possible that androids like Data could be created. What would happen if the androids are superior than us?