Artificial Neural Network

1 Artificial Neural Network
Update weights Compare output with Target ….. Hidden Units Input Units Output Units

2 Artificial Neural Networks (ANN)
Perceptrons Gradient Descent Backpropagation

3 Biological Inspiration
Human brain has (est.) 1011 neurons Each connected to 104 others on average Switching times on the order of 10-3 sec. 10-10 with computer systems It takes 0.1 second for one to visually recognize one's mother How much time does a computer system need?

4 ANNs Are Simplifications!
Let's be realistic: Biological neural systems are much more complex in connection structure! Neurons communicate in parallel But most ANNs are run on sequential machines Neurons output a complex series of spikes But ANNs output one value

5 Motivation For Studying ANNs
To model biological learning Can we do that without building much more complex ANNs? To invent effective machine learning algorithms If this is the goal, does it matter whether the ANNs resemble biological systems?

6 Example: ANN for Car Driving
ALVINN (1993) Travelled at 70 mph for 90 miles Other vehicles present

7 When To Use ANNs Input: attribute-value pairs
Output: a vector of discrete or real valued quite flexible Noisy data Long training time acceptable Fast response required in application Black-box solution acceptable

8 Perceptron One type of ANN is based on Perceptrons
Input: a vector of real-valued x1 … xn Connections with weights w1 … wn Linear combination: w1x1 + … + wnxn Compare with threshold w0 Output 1 if w1x1 + … + wnxn > w0 Output –1 otherwise

9 Structure of a Perceptron
Input x0 = 1 x1 w1 w0 (threshold) w2 x2 .... xn wn Weights (weights are to be trained)

10 Linear Separability The input represents an n-dimensional space
A perceptron represents a hyperplane That cuts the space into two classes: 1 and – 1 wixi can only represent linear hyperplanes Examples that can be separated by linear hyperplanes are called linear separable Obviously not all examples are linear separable

11 Example on Linear Separability
+ x2 + + + x1 + + Not linear separable (XOR) Linear separable

12 Expressive Power of Perceptrons
Many boolean functions can be represented Represent true by 1, false by –1 e.g. x1 AND x2: make w0 = 0.8, w1 = w2 = 0.5 AND, OR, NAND, NOR can be represented by one perceptron Not XOR, which is not linear separable However, every boolean function can be represented by perceptrons of two levels deep

The hope is to adjust the weights so that the actual output matches the target Initialize wi with random weights Repeat as many times as necessary: Feed the perceptron with one example If misclassified, then adjust the weights Until all classifications are correct

14 The Perceptron Training Rule
wi  wi + wi where wi = (t – o) xi Learning rate : small, positive, e.g. 0.1 Suppose xi = 0.8, target t = +1, output o = –1 wi = 0.1(1 – (– 1)) 0.8 = 0.16 This will bring o closer to the target +1 Proven: will converge within finite iterations to classify all linear separable instances

15 Gradient Descent, Principle
Can handle data not linear separable Will converge to a best-fit approximation Principle: gradient descent Define the error Then move to maximally reduce error (how?) Linear unit: output with no threshold o(x) = w · x

16 Gradient Descent: Training Error
One common way to measure training error: use squared difference between target and actual output, summed over all examples where D = training examples, d = one example td = target output for d, od = actual output for d

17 Gradient Descent Calculation
Gradients can be measured by computing the derivative of E with respect to each wi E(wi) =  E(w) / wi To maximally reduce E(w), choose wi's: wi = – E(wi) where  is the learning rate (>0, small) wi  wi + wi

18 Algorithm for Gradient Descent
We need to compute E(w) efficiently As we need to compute it iteratively  E(w) / wi = dD (td – od) (– xid) D is the set of all examples td and od are target and actual output for data d xid is the value of input xi for data d wi = dD (td – od) xid

19 Procedure Gradient-Descent
Initialize wi with small random values Until termination condition met Do For all i, wi  0 For each example <x, t> Do Input x to compute o wi  wi +  (t – o) xi For each linear unit weight wi Do wi  wi + wi

20 Remarks on Gradient-Descent
If  too large, may over-step the minimum Variation: gradually reduce  over iterations Weights are changed once everytime all examples are considered Can be slow (thousands of steps) Can be trapped in local minimum Variation: use incremental gradient descent

21 Stochastic Gradient Descent
Initialize wi with small random values Until termination condition met Do For all i, wi  0 For each example <x, t> Do Input x to compute o wi  wi +  (t – o) xi (Delta Rule) Unlike the perceptron rule, the delta rule involves no threshold

22 Standard vs Stochastic GD
Stochastic GD defines Ed(w) as opposed to ED(w) in standard GD Standard GD takes bigger steps but it takes longer time to make each step Stochastic GD has a chance of not falling into local minimum

23 Perceptron Training vs Delta Rule
Perceptron training uses threshold Delta Rule training does not use threshold Perceptron training converges to perfectly classify linear separable data Delta Rule does not assume linear separability but only converges asymptotically Remarks: linear programming can handle linear separable data too, brilliantly!

24 Multilayer Networks, Introduction
Multiple inputs & outputs + hidden units to allow us represent more complex functions Will linear units work? We want a function that is: continuous nonlinear monotonic increasing

25 Sigmoid Function Edward Tsang (Copyright)
Wednesday, 26 April 2017Wednesday, 26 April 2017

26 (weights are to be trained)
Sigmoid Unit Input x0 = 1 x1 w1 w0 (threshold) w2 x2 .... xn wn Weights (weights are to be trained)

27 Training Error Redefined
Training error: sum over all output units There are multiple outputs, hence koutput tkd and okd are the kth target & actual outputs Target is to minimize E (as before) (Stochastic) gradient descent can be used

28 Multilayer Networks, Details
Normally structure fixed Feedforward vs recurrent wji: input from node i to unit j n : error associated to unit n analogous to (t – o) in delta rule Details of weight-tuning rule derivation ommitted

29 Backpropagation Algorithm
Parameters: nin inputs, nhidden hidden units, nout outputs Initialize wi with small random values Until termination condition met Do For each training example <x, t> Do Input x to compute output of every node Propagate errors backward (to elaborate)

30 Propagating Errors Backward
For each output unit k, compute error k k  ok (1 – ok) (tk – ok) For each hidden unit h, compute error h h  oh (1 – oh) koutputs wkhk Update network weight wji wji  wji + wji where wji = j xji

31 Momentum An alternative to Backpropagation
to update the weight differently Let wji(t) be weight from i to j in itearation t wji(n) = j xji +  wji (n – 1) where 0   < 1 is the momentum

32 Convergence & Local Minima
Backpropagation will converge Using gradient descent … but not necessarily to global minimum Why does it work effectively in practice? Multi-dimensions provide "escape routes"? Gradient descent over complex surfaces is still poorly understood

33 Expressive Power, Feedforward ANN
All boolean functions can be represented by two layers may need exponential number of hidden layers Bounded continuous functions can be approximated by two layers Arbitrary functions can be approximated by three layers

34 Hypothesis Space & Inductive Bias
Space of possible network weights space is n-dimensional where n is the number of weights in the network space is continuous contrast with decision trees and version space This enables well-defined gradient descent Inductive bias: difficult to characterize smooth interpolation between data points?

35 Hidden Layer Representation
Normally use few hidden units Hidden units have lots of freedom Only input & output units are governed by data Backpropagation define hidden layer features that are not explicit in input E.g. given 8 input , , … 3 hidden units found representation 000, 001, ...

36 Overfitting When should Backpropagation stop?
When error is less than some threshold? Over-training could lead to overfitting Measuring overfitting: use validation data to measure generalization accuracy Terminate learning once error is significant then restore earlier weights

