Download presentation
1
Artificial Neural Network
Update weights Compare output with Target ….. Hidden Units Input Units Output Units Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
2
Artificial Neural Networks (ANN)
Perceptrons Gradient Descent Backpropagation Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
3
Biological Inspiration
Human brain has (est.) 1011 neurons Each connected to 104 others on average Switching times on the order of 10-3 sec. 10-10 with computer systems It takes 0.1 second for one to visually recognize one’s mother How much time does a computer system need? Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
4
ANNs Are Simplifications!
Let’s be realistic: Biological neural systems are much more complex in connection structure! Neurons communicate in parallel But most ANNs are run on sequential machines Neurons output a complex series of spikes But ANNs output one value Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
5
Motivation For Studying ANNs
To model biological learning Can we do that without building much more complex ANNs? To invent effective machine learning algorithms If this is the goal, does it matter whether the ANNs resemble biological systems? Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
6
Example: ANN for Car Driving
ALVINN (1993) Travelled at 70 mph for 90 miles Other vehicles present Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
7
When To Use ANNs Input: attribute-value pairs
Output: a vector of discrete or real valued quite flexible Noisy data Long training time acceptable Fast response required in application Black-box solution acceptable Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
8
Perceptron One type of ANN is based on Perceptrons
Input: a vector of real-valued x1 … xn Connections with weights w1 … wn Linear combination: w1x1 + … + wnxn Compare with threshold w0 Output 1 if w1x1 + … + wnxn > w0 Output –1 otherwise Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
9
Structure of a Perceptron
Input x0 = 1 x1 w1 w0 (threshold) w2 x2 .... xn wn Weights (weights are to be trained) Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
10
Linear Separability The input represents an n-dimensional space
A perceptron represents a hyperplane That cuts the space into two classes: 1 and – 1 wixi can only represent linear hyperplanes Examples that can be separated by linear hyperplanes are called linear separable Obviously not all examples are linear separable Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
11
Example on Linear Separability
+ – x2 – + – + – + x1 – + – + Not linear separable (XOR) Linear separable Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
12
Expressive Power of Perceptrons
Many boolean functions can be represented Represent true by 1, false by –1 e.g. x1 AND x2: make w0 = 0.8, w1 = w2 = 0.5 AND, OR, NAND, NOR can be represented by one perceptron Not XOR, which is not linear separable However, every boolean function can be represented by perceptrons of two levels deep Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
13
Training Perceptrons The hope is to adjust the weights so that the actual output matches the target Initialize wi with random weights Repeat as many times as necessary: Feed the perceptron with one example If misclassified, then adjust the weights Until all classifications are correct Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
14
The Perceptron Training Rule
wi wi + wi where wi = (t – o) xi Learning rate : small, positive, e.g. 0.1 Suppose xi = 0.8, target t = +1, output o = –1 wi = 0.1(1 – (– 1)) 0.8 = 0.16 This will bring o closer to the target +1 Proven: will converge within finite iterations to classify all linear separable instances Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
15
Gradient Descent, Principle
Can handle data not linear separable Will converge to a best-fit approximation Principle: gradient descent Define the error Then move to maximally reduce error (how?) Linear unit: output with no threshold o(x) = w · x Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
16
Gradient Descent: Training Error
One common way to measure training error: use squared difference between target and actual output, summed over all examples where D = training examples, d = one example td = target output for d, od = actual output for d Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
17
Gradient Descent Calculation
Gradients can be measured by computing the derivative of E with respect to each wi E(wi) = E(w) / wi To maximally reduce E(w), choose wi’s: wi = – E(wi) where is the learning rate (>0, small) wi wi + wi Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
18
Algorithm for Gradient Descent
We need to compute E(w) efficiently As we need to compute it iteratively E(w) / wi = dD (td – od) (– xid) D is the set of all examples td and od are target and actual output for data d xid is the value of input xi for data d wi = dD (td – od) xid Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
19
Procedure Gradient-Descent
Initialize wi with small random values Until termination condition met Do For all i, wi 0 For each example <x, t> Do Input x to compute o wi wi + (t – o) xi For each linear unit weight wi Do wi wi + wi Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
20
Remarks on Gradient-Descent
If too large, may over-step the minimum Variation: gradually reduce over iterations Weights are changed once everytime all examples are considered Can be slow (thousands of steps) Can be trapped in local minimum Variation: use incremental gradient descent Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
21
Stochastic Gradient Descent
Initialize wi with small random values Until termination condition met Do For all i, wi 0 For each example <x, t> Do Input x to compute o wi wi + (t – o) xi (Delta Rule) Unlike the perceptron rule, the delta rule involves no threshold Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
22
Standard vs Stochastic GD
Stochastic GD defines Ed(w) as opposed to ED(w) in standard GD Standard GD takes bigger steps but it takes longer time to make each step Stochastic GD has a chance of not falling into local minimum Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
23
Perceptron Training vs Delta Rule
Perceptron training uses threshold Delta Rule training does not use threshold Perceptron training converges to perfectly classify linear separable data Delta Rule does not assume linear separability but only converges asymptotically Remarks: linear programming can handle linear separable data too, brilliantly! Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
24
Multilayer Networks, Introduction
Multiple inputs & outputs + hidden units to allow us represent more complex functions Will linear units work? We want a function that is: continuous nonlinear monotonic increasing Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
25
Sigmoid Function Edward Tsang (Copyright)
Wednesday, 26 April 2017Wednesday, 26 April 2017
26
(weights are to be trained)
Sigmoid Unit Input x0 = 1 x1 w1 w0 (threshold) w2 x2 .... xn wn Weights (weights are to be trained) Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
27
Training Error Redefined
Training error: sum over all output units There are multiple outputs, hence koutput tkd and okd are the kth target & actual outputs Target is to minimize E (as before) (Stochastic) gradient descent can be used Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
28
Multilayer Networks, Details
Normally structure fixed Feedforward vs recurrent wji: input from node i to unit j n : error associated to unit n analogous to (t – o) in delta rule Details of weight-tuning rule derivation ommitted Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
29
Backpropagation Algorithm
Parameters: nin inputs, nhidden hidden units, nout outputs Initialize wi with small random values Until termination condition met Do For each training example <x, t> Do Input x to compute output of every node Propagate errors backward (to elaborate) Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
30
Propagating Errors Backward
For each output unit k, compute error k k ok (1 – ok) (tk – ok) For each hidden unit h, compute error h h oh (1 – oh) koutputs wkhk Update network weight wji wji wji + wji where wji = j xji Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
31
Momentum An alternative to Backpropagation
to update the weight differently Let wji(t) be weight from i to j in itearation t wji(n) = j xji + wji (n – 1) where 0 < 1 is the momentum Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
32
Convergence & Local Minima
Backpropagation will converge Using gradient descent … but not necessarily to global minimum Why does it work effectively in practice? Multi-dimensions provide “escape routes”? Gradient descent over complex surfaces is still poorly understood Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
33
Expressive Power, Feedforward ANN
All boolean functions can be represented by two layers may need exponential number of hidden layers Bounded continuous functions can be approximated by two layers Arbitrary functions can be approximated by three layers Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
34
Hypothesis Space & Inductive Bias
Space of possible network weights space is n-dimensional where n is the number of weights in the network space is continuous contrast with decision trees and version space This enables well-defined gradient descent Inductive bias: difficult to characterize smooth interpolation between data points? Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
35
Hidden Layer Representation
Normally use few hidden units Hidden units have lots of freedom Only input & output units are governed by data Backpropagation define hidden layer features that are not explicit in input E.g. given 8 input , , … 3 hidden units found representation 000, 001, ... Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
36
Overfitting When should Backpropagation stop?
When error is less than some threshold? Over-training could lead to overfitting Measuring overfitting: use validation data to measure generalization accuracy Terminate learning once error is significant then restore earlier weights Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.