CS-424 Gregory Dudek Today’s Lecture Neural networks –Training Backpropagation of error (backprop) –Example –Radial basis functions
CS-424 Gregory Dudek Simple neural models Oldest ANN model is McCulloch-Pitts neuron [1943]. –Inputs are +1 or -1 with real-valued weights. –If sum of weighted inputs is > 0, then the neuron “fires” and gives +1 as an output. –Showed you can compute logical functions. –Relation to learning proposed (later!) by Donald Hebb [1949]. Perceptron model [Rosenblatt, 1958]. –Single-layer network with same kind of neuron. Firing when input is about a threshold: ∑x i w i >t. –Added a learning rule to allow weight selection. Not in text
CS-424 Gregory Dudek Perceptrons: Early motivation Von Neumann (1951), among others, worried about the theory behind how you get a network like the brain to learn things given: –Random (?) connections –Broken connections –Noisy signal. Boolean algebra & symbolic logic were not well suited to these questions. The perceptron (originally photo-perceptron) was used as a model for a cell that responds to illumination patterns. Not in text
CS-424 Gregory Dudek Perceptron nets
CS-424 Gregory Dudek Perceptron learning Perceptron learning: –Have a set of training examples (TS) encoded as input values (I.e. in the form of binary vectors) –Have a set of desired output values associated with these inputs. This is supervised learning. –Problem: how to adjust the weights to make the actual outputs match the training examples. NOTE: we to not allow the topology to change! [You should be thinking of a question here.] Intuition: when a perceptron makes a mistake, it’s weights are wrong. –Modify them so make the output bigger or smaller, as desired.
CS-424 Gregory Dudek Learning algorithm Desired T i Actual output O i Weight update formula (weight from unit j to i): W j,i = W j,I + k* x j * (T i - O i ) Where k is the learning rate. If the examples can be learned (encoded), then the perceptron learning rule will find the weights. –How? Gradient descent. Key thing to prove is the absence of local minima.
CS-424 Gregory Dudek Perceptrons: what can they learn? Only linearly separable functions [Minsky & Papert 1969]. N dimensions: N-dimensional hyperplane.
CS-424 Gregory Dudek More general networks Generalize in 3 ways: –Allow continuous output values [0,1] –Allow multiple layers. This is key to learning a larger class of functions. –Allow a more complicated function than thresholded summation [why??] Generalize the learning rule to accommodate this: let’s see how it works.
CS-424 Gregory Dudek The threshold The key variant: –Change threshold into a differentiable function –Sigmoid, known as a “soft non-linearity” (silly). M = ∑x i w i O = 1 / (1 + e -k M )
CS-424 Gregory Dudek Recall: training For a single input-output layer, we could adjust the weights to get linear classification. –The perceptron computed a hyperplane over the space defined by the inputs. This is known as a linear classifier. By stacking layers, we can compute a wider range of functions. Compute error derivative with respect to weights. output inputs Hidden layer
CS-424 Gregory Dudek “Train” the weights to correctly classify a set of examples (TS: the training set). Started with perceptron, which used summing and a step function, and binary inputs and outputs. Embellished by allowing continuous activations and a more complex “threshold” function. –In particular, we considered a sigmoid activation function, which is like a “blurred” threshold.
CS-424 Gregory Dudek The Gaussian Another continuous, differentiable function that is commonly used is the Gaussian function. Gaussian(x) = where is the width of the Gaussian. The Gaussian is a continuous, differentiable version of the step function. | | 1 _____________ | | 0 __________| | |__________ c
CS-424 Gregory Dudek What is learning? For a fixed set of weights w 1,...,w n f(x 1,...,x n ) = Sigma(x 1 w x n w n ) represents a particular scalar function of n variables. If we allow the weights to vary, then we can represent a family of scalar function of n variables. F(x 1,...,x n,w 1,...,w n ) = Sigma(x 1 w x n w n ) If the weights are real-valued, then the family of functions is determined by an n-dimensional parameter space, R n. Learning involves searching in this parameter space.
CS-424 Gregory Dudek Basis functions Here is another family of functions. In this case, the family is defined by a linear combination of basis functions {g 1,g 2,...,g n }. The input x could be scalar or vector valued. F(x,w 1,...,w n ) = w 1 g 1 (x) w n g n (x)
CS-424 Gregory Dudek Combining basis functions We can build a network as follows: g 1 (x) --- w 1 ---\ g 2 (x) --- w \... \... ∑ --- f(x) / g n (x) --- w n ---/ E.g. From the basis {1,x,x 2 } we can build quadratics: F(x, w 1,w 2,w 3 ) = w 1 + w 2 x + w 3 x 2
CS-424 Gregory Dudek Receptive Field It can be generalized to an arbitrary vector space (e.g., R n ). Often used to model what are called “localized receptive fields” in biological learning theory. –Such receptive fields are specially designed to represent the output of a learned function on a small portion of the input space. –How would you approximate an arbitrary continuous function using a sum of gaussians or a sum of piecewise constant functions of the sort described above?
CS-424 Gregory Dudek Backprop Consider sigmoid activation functions. We can examine the output of the net as a function of the weights. –How does the output change with changes in the weights? –Linear analysis: consider partial derivative of output with respect to weight(s). We saw this last lecture. –If we have multiple layers, consider effect on each layer as a function of the preceding layer(s). We propagate the error backwards through the net (using the chain rule for differentiation). Derivation on overheads [reference: DAA p. 212]
CS-424 Gregory Dudek Backprop observations We can do gradient descent in weight space. What is the dimensionality of this space? –Very high: each weight is a free variable. There are as many dimensions as weights. A “typical” net might have hundreds of weights. Can we find the minimum? –It turns out that for multi-layer networks, the error space (often called the “energy” of the network) is NOT CONVEX. [so?] –Commonest approach: multiple restart gradient descent. i.e. Try learning given various random initial weigh t distributions.
CS-424 Gregory Dudek Success? Stopping? We have a training algorithm (backprop). We might like to ask: –1. Have we done enough training (yet)? –2. How good is our network at solving the problem? –3. Should we try again to learn the problem (from the beginning)? The first 2 problems have standard answers: –Can’t just look at energy. Why not? Because we want to GENERALIZE across examples. “I understand multiplication: I know 3*6=18, 5*4=20.” –What’s 7*3? Hmmmm. –Must have additional examples to validate the training. Separate input data into 2 classes: training and testing sets. Can also use cross-validation.
CS-424 Gregory Dudek What can we learn? For any mapping from input to output units, we can learn it if we have enough hidden units with the right weights! In practice, many weights means difficulty. The right representation is critical! Generalization depends on bias. –The hidden units form an internal representation of the problem. make them learn something general. Bad example: one hidden unit learns exactly one training example. –Want to avoid learning by table lookup.
CS-424 Gregory Dudek Representation Much learning can be equated with selected a good problem representation. –If we have the right hidden layer, things become easy. Consider the problem of face recognition from photographs. Or fingerprints. –Digitized photos: a big array (256x256 or 512x512) of intensities. –How do we match one array to another? (Either manually or by computer.) –Key: measure important properties, use those as criteria for estimating similarity?
CS-424 Gregory Dudek Faces (an example) What is an important property to measure for faces? –Eye distance? –Average intensity BAD! –Nose width? –Forehead height? These measurements form the basis functions for describing faces. –BUT NOT NECESSARILY photographs!!! We don’t need to reconstruct the photo. Some information is not needed.
CS-424 Gregory Dudek Radial basis functions Use “blobs” summed together to create an arbitrary function. –A good kind of blob is a Gaussian: circular, variable width, can be easily generalized to 2D, 3D,....
CS-424 Gregory Dudek Topology changes Can we get by with fewer connections? When every neuron from one layer is connected to every layer in the next layer, we call the network fully-connected. What if we allow signals to flow backwards to a preceding layer? Recurrent networks
CS-424 Gregory Dudek Inductive bias? Where’s the inductive bias? –In the topology and architecture of the network. –In the learning rules. –In the input and output representation. –In the initial weights. Not in text