COMP 2208 Dr. Long Tran-Thanh University of Southampton Neural Networks
Topics covered in the remaining lectures Classification: Neural networks (W7) K-NN (W8) Decision trees (W8) Search: Local search (W9) Reasoning: Bayes nets and Bayesian inference (W9) Sequential decision making: Markov decision processes (W9) Bandit theory (W10) Applied AI: Robotics + Vision (W10) Collaborative AI (W10)
But before neural nets: a little history John McCarthy, Marvin L. Minsky, Nathaniel Rochester, and Claude E. Shannon (1955)
A little history (cont’d) 7 main key requirements of AI: 1.Automatic computer 2.Language understanding 3.Usage of neuron nets 4.Computational efficiency 5.Self-improvement 6.Abstractions 7.Creativity The concept of learning John McCarthy, Marvin L. Minsky, Nathaniel Rochester, and Claude E. Shannon (1955) The concept of agent
Learning agents Environment Perception Behaviour Agent: anything capable of autonomous functioning in some environment (e.g., people, animals, robots, software agents) Learning: self-improvement through iterative actions / interactions Russel and Norvig book; Wooldridge and Jennings (1995)
Supervised vs. unsupervised learning Supervised learning: Given input X, predict output Y Input X Output Y Training set: a set of examples with correct input-output pairs dog woman? ? woman Labelled data: input data with its correct output
Supervised vs. unsupervised learning (cont’d) Unsupervised learning: Given input X, predict output Y Input X Output Y NO training sets: there is no labelled data Predict outcome of an investment (there’s no “correct” output) Common feature? (we don’t know the correct output) Semi-supervised learning (not covered): mix of supervised and unsupervised
Offline vs. online learning X1 X2 X3 Xn Y1 Y2 Y3 Yn Offline learning: all the inputs are available from the beginning Xn, …X3, X2, X1Yn, …Y3, Y2, Y1 Online learning: inputs come into the system as a stream
Neural networks (finally)
What does a neural network do? Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour
Idea: imitating human brains Why neural nets? 7 main key requirements of AI: 1.Automatic computer 2.Language understanding 3.Usage of neuron nets 4.Computational efficiency 5.Self-improvement 6.Abstractions 7.Creativity John McCarthy, Marvin L. Minsky, Nathaniel Rochester, and Claude E. Shannon (1955) Real neuron
Inspiration from the brain Warren McCulloch and Walter Pitts (1943) "A Logical Calculus of the Ideas Immanent in Nervous Activity". Contains key properties of real neurons: Synaptic weights Cumulative affect Threshold for activation "all or nothing” (neuron fires an output signal if the sum of inputs is above threshold) Neuron X1 X2 X3 Y: Output
w1 neuron X1 X2 X3 Y: Output w2 w3 The perceptron model (Rosenblatt, 1957) Threshold for activation "all or nothing” Cumulative affect Synaptic weights Self-training the weights
Nice! But how does it work? ? ? X1 X2 X3 Y: Output Intuition: Consider a black box that takes numerical inputs, does something to them, and gives a numerical output. We can observe: some input-output pairs Wouldn't it be great to know the generic relationship between inputs and outputs? Regression analysis: estimate relationship f from the observed data
? ? X1 X2 X3 Y: Output Idea 1: what if we consider f as a sum of the inputs? Idea 2: If we allow the possibility of weighting each input differently, we gain some expressivity w1 w2 w3 1 1 b Vector form: … remind you of anything? Not so expressive Weighted sum of the inputs
Weighted sum of the inputs (cont’d) 1 dimensional version (i.e., there is only 1 input) : equation of a line on the plane y: dependent variable x: independent variable w: coefficient, rate, slope of line b: intercept (where the line crosses the y-axis) Higher dimensions (i.e., more than 1 inputs): hyperplanes Any straight-line (hyperplane) relationship between X and Y can be expressed by our black box
Explaining the linear relationship: Weighted sum of the inputs (cont’d) Positive values of w mean that Y gets bigger as X gets bigger. Negative values of w mean that Y gets smaller as X gets bigger. The value of b tells us what Y should be when X = 0. Problem: in many cases, the relationship is not perfect Not all the points lie on the line Noisy data
Example: basketball ability vs. height
Linear regression No single straight line will match all of the X values (height) to the appropriate Y value (basketball skill). But we can imagine a "line of best fit" through the centre of the cloud of points that summarizes the relationship. This constitutes the statistical technique called linear regression.
Linear regression (cont’d) Which line is the best regression? / How to measure the efficiency of a particular regression line? Idea: method of least squares For a given line Y = wX + b, we can measure the differences between the actual Y values and those predicted by the line. The sum/average of the squared differences between the actual Y values and the predicted ones is a reasonable way to measure goodness of fit. Minimizing this value is the "training method" for regression analysis.
Example: mean squared error Height (X) True ability (Y) Est. ability (Y’) Difference Squared diff Average of squared differences: Mean squared error (MSE) = ( )/4 = 100.5
Back to the basketball example Basketball skill = 0.3*Height
Back to our perceptrons w1 neuron X1 X2 X3 Y: Output w2 w3 f: activation function
Types of activation functions
Basketball example (again) Y = 1 Y = 0
Expressiveness of perceptrons Idea: consider the all-or-nothing threshold function: What sorts of problems can it solve? This suggests a mapping to True and False, i.e., logic problems.
Expressiveness of perceptrons (cont’d) AND gate OR gate
1 f = threshold function X1 X2 1 1 Y: Output f = threshold function X1 X2 1 1 Y: Output Expressiveness of perceptrons (cont’d) Perceptron as: AND gate Perceptron as: OR gate
Training a perceptron So far so good, but how do find the optimal weight values? Well, we can minimise the MSE…But how to do this? Hand-designing the weights: not very practical We want to train the network by showing it examples and somehow getting it to learn the relevant pattern. This is where the perceptron learning rule (delta rule, Widrow-Hoff rule) comes in.
The Widrow-Hoff learning rule Very simple idea: start with random weights. Present example input to the neuron and calculate the output. Compare output to target value (i.e., y), and nudge each weight slightly in the direction that would have helped to produce the correct output. Repeat until happy with performance. What you need to know is:
Limitations of the perceptron model A perceptron cuts its input space into a "high output” (y = 1) and a "low output” (y = 0) regions. The cut is linear (straight line, hyperplane, etc), so the perceptron can only solve linearly separable problems Linearly separable problems: regions are linearly separable (with one line) in the input space This means that there are problems a perceptron can’t solve
Limitations of the perceptron model (cont’d) Example: XOR gate (Minsky and Papert, 1969)
Limitations of the perceptron model (cont’d)
Multi-layered neural networks How can we overcome this issue? Possible solution: multi-layer neural nets Instead of having inputs feeding directly into output neurons, let’s add some intervening "hidden" neurons in between? The brain is certainly like that. Intuition: If we think of perceptrons as dividing a space into low vs high output with a single line... … then multiple perceptrons = multiple dividing lines Non-linear separation can be approximated by a set of linear lines
Multi-layered neural networks (cont’d)
f f X1 X2 1 1 Y Y f f f f Input layerOutput layer Hidden layers Perceptrons feeding into other perceptrons... Our black box is quite complicated now; can approximate arbitrary functions given enough hidden neurons.
Training multi-layered neural networks This sounds cool! But bow can we train this complex back box? Idea 1: We could use the usual delta-rule approach to train the weights between the last hidden layer and the output layer. Input layer Hidden layer 1 Hidden layer N Output layer Issue: what about the weights of the other hidden layers? Solution: backpropagation of errors (Rumelhart, Hinton, and Williams, 1986)
The backpropagation method An extension of the delta rule: We build an error function such that: E = sum of squared differences between the actual and target output values. We employ a bit of calculus to calculate the partial derivative of E with respect to each weight (we use chain rule to do so) Input layer Hidden layer 1 Hidden layer N Output layer Use a differentiable activation function We can thus know which way we need to "nudge" each weight for a given training example. In practice: we use the sigmoid function
Some further issues of neural networks How fast should the learning rate be? How many hidden neurons do I need for a given problem? Some guidelines available but the only reliable approach is to try different values and see how it goes. How do I get things "just right"? Other issues: Large datasets Large input space Computational issues
Modern time neural nets Another historical sum up: 1. A long time ago in a galaxy far, far away.... (in the ’s) McCulloch-Pitts, perceptron, multi-layer neural nets 2. Minsky and Papert book (1969) The XOR counter example (… I feel disturbance in the force) Were mistakenly believed to conjecture the same limitations for multi-layer NNs 3. Backpropagation (Hinton et al.) – 1980’s A new hope
Still historical sum up 4. Another disturbance: Real-world applications are very complex Requires new solutions to handle large data + complexity 4. Deep learning: Hinton et al., 2007 New heroes
Modern day neural nets: deep learning Main idea of deep learning: transform the input space into higher level abstractions with lower dimensions (unsupervised learning) Multi-layer architecture (typically with many hidden layers) – hence the name deep learning Each layer is responsible for a space transformation step By doing so, the complexity of non-linearity is decreased This is, however, is very expensive. Needs to rely on new computational solutions: GPUs, grid computing
Acknowledgement Thanks to Dr. Brendan Neville for many slides + contents