What is an artificial neural network? Network of simple neuron-like computing elements … - what is ANN? network simple neuron-like computing elements, basis? learned earlier structure real neurons, dendrites where other neurons connect to cell at synapses, cause electrical changes propagate to cell body where info integrated to produce response in form of spikes propagate along axon to other cells, inputs can excite or inhibit electrical activity of cell - in somewhat analogous way, each element in neural net receives number of inputs, integrated to produce output - inputs integrated in way could increase or decrease output each unit, ANNs typically organized in layers, input layer, output layer (may also have lots of output units), layer in between called hidden layer because hides between inputs we provide and outputs we can observe … that can learn to associate inputs with desired outputs
Computing in an artificial neural network How does each unit integrate inputs to produce an output? w1 , w2 : weights t : threshold I1 w1 - to see more about how neural nets work, start simplified model how units in network might integrate inputs to produce output, then generalize in couple ways - pull one hidden unit out of network, first imagine simple case only two inputs labeled I1, I2, each connection input unit to hidden unit has weight associated, controls how much input influences behavior of H, take each input, multiply by weight, sum together quantities from each input, here w1 x I1 + w2 x I2, then compare to threshold, if above threshold then output 1, otherwise output 0 – can we do something useful with such simple units? this way of describing how units work will enable us to construct very simple network that we can hand simulate to get concrete understanding how can perform particular function H 1 or 0 w2 I2 Can such a simple network perform a useful function? w1 • I1 + w2 • I2 > t if true: H = 1 if false: H = 0 •
A (very!) simple neural network H1 O1 Jack 1 Acquaintances 1 >0.5 >1.5 1 or 0 1 I2 Jean 1 -1 H2 O2 I3 Pam 1 >0.5 >-1.5 1 or 0 - very simple network, multiple layers (4 inputs to overall network, 2 hidden, 2 output), both H and O perform simple computation described, each has 2 inputs (H from input layer, O from hidden layer), each connection has weight, all 1 or -1 to make things easier to hand simulate, weights can be negative – what is this analogous to in real neuron? inhibition, so each unit will multiply inputs by weights, add together, compare to threshold, each has own threshold put inside units - what does this network do? attach meaning inputs/outputs, inputs 0 or 1, associated with real people (Jack, Jean, Pam, Paul), say in this application that valid combos of inputs have exactly two 1s, two 0s, happens Jack/Jean siblings, Pam/Paul siblings, other combos just acquaintances - if two 1s, two 0s (bottom), want network to tell us correctly, whether combo siblings or acquaintances - let’s see what does -1 1 I4 Siblings Paul thresholds weights network inputs: 1 or 0, valid input combinations have exactly two 1s network outputs: 1 or 0
A (very!) simple neural network H1 O1 I1 Jack 1 1 Acquaintances 1 >0.5 >1.5 ?? 1 I2 Jean 1 1 -1 H2 O2 I3 Pam 1 >0.5 >-1.5 ?? - what happens if give Jack/Jean = 1 as inputs, step through computations together for this first example, to see what network does here -1 1 Paul I4 Siblings thresholds weights network inputs: 1 or 0, valid input combinations have exactly two 1’s network outputs: 1 or 0
A (very!) simple neural network H1 O1 I1 Jack 1 Acquaintances 1 >0.5 >1.5 ?? 1 I2 Jean 1 -1 H2 O2 I3 Pam 1 1 >0.5 >-1.5 ?? - what if Pam/Paul = 1 – what does network produce? - what would it take to get acquaintance? H1 = H2 = 1, must have at least single 1 in each pair of inputs (only situation 0 is both inputs 0) -1 1 Paul I4 Siblings 1 thresholds weights network inputs: 1 or 0, valid input combinations have exactly two 1’s network outputs: 1 or 0
A (very!) simple neural network H1 O1 I1 Jack 1 1 Acquaintances 1 >0.5 >1.5 ?? 1 I2 Jean 1 -1 H2 O2 I3 Pam 1 >0.5 >-1.5 ?? - O1 outputs a 1 if both H1 and H2 output 1 (so must be acquaintances) - for O2, if both H1 and H2 output 1, then doesn’t pass threshold (-2 from sum), but otherwise passes (siblings) - here weights on all links, thresholds for each unit, is there a way to simplify picture bit, so just have weights on links, all thresholds are something simple like >0? (small detail but comes up in any design of real neural net) -1 1 Paul I4 Siblings 1 thresholds weights network inputs: 1 or 0, valid input combinations have exactly two 1’s network outputs: 1 or 0
Add “bias” units to simplify thresholds 1 1 > 0 >1.5 1 or 0 1 I2 1 -0.5 Do the same for the output units +1 -0.5 -1 - add single input to network, bit different from others, always has value 1, add connection to each hidden unit and associate weight related to threshold put in the hidden unit - I1 x w1 + I2 x w2 > t, now I1 x w1 + I2 x w2 - t > 0? (negative of threshold), can do same for output units (add extra hidden unit, always 1, with connection to output unit with weight related to threshold) - why, simplify structure network, just have weights, not separate thresholds, but one more generalization want to make H2 O2 I3 1 > 0 >-1.5 1 or 0 -1 1 I4 weights
Computing in a ”typical” neural network +1 sigmoid w0 I1 w1 H 1 or 0 w2 I2 w0 • 1 + w1 • I1 + w2 • I2 + ... + wn • In > 0 - quantity: sum of weights x inputs, and could be more, often referred to as activation value for unit (sometimes use term with real neurons) - said so far, axes, 0 threshold, step function 0 to 1 at activation = 0 - rather than sudden jump in behavior with slight change to activation, make a smooth transition, still between 0 and 1, but as increase activation from negative value, slowly increase response of unit, more steeply around 0, but not sudden jump, sigmoid function - 1/(1+e-s) where s is sum of weighted inputs (activation) - in end, what actually do in simple neural networks - sum of weighted inputs to sigmoid function to output between 0 and 1 • activation How does each unit integrate its inputs to produce an output? sum of weighted inputs sigmoid function output between 0 and 1
Learning to recognize handwritten zip codes LeCun et al. (1989) System could recognize image samples provided by the US postal service, with high accuracy (MNIST database) - neural nets first drew attention back in 1960s, originally applied to “toy” problems (hype about potential, but in early days didn’t pan out), long time wasn’t efficient way to train networks to learn to solve problems of reasonable size, but gained traction again in late 1980s – mention three examples of work that suggested they really could solve more challenging problems – first Yan LeCun, convinced US post office to share image data on handwritten zip codes (building systems to recognize handwritten zip codes to facilitate automatic sorting of mail) – so-called MNIST database (work with tomorrow in lab), built system achieve high level of performance learning to correctly identify digits in image samples
NETtalk learned phonemes from text Sejnowski & Rosenberg (1989) features of phonemes - second system created big splash was NETtalk, learned phonemes from text, could be provided as input to speech synthesizer that could then speak text - one-minute video – three phases, beginning before training, midway, performance of final trained network (old video at time) https://www.youtube.com/watch?v=gakJlr3GecE https://www.youtube.com/watch?v=gakJlr3GecE
ALVINN learned to control steering actions Pomerleau (1991) - back to vision, early days design automated vehicles, ALVINN created by Dean Pomerleau at CMU, pixels of coarse image of driving scene, output appropriate steering angle, teacher was driver – during training, viewing road scene with camera and recording real driver’s steering angle, use that to train network to perform analogous steering actions automatically, performed pretty well for times, helped convince computer vision researchers to consider approaches based on neural nets ALVINN learned to steer by observing a human driver Multiple networks for different roads (e.g. dirt road, two-lane road, highway (up to 70mph!)) (960 inputs)
Learning to recognize input patterns feedforward processing network weights can be learned from training examples (map inputs to correct outputs) back-propagation: iterative algorithm progressively reduces error between computed and desired output until performance is satisfactory on each iteration: compute output of current network and assess performance compute weight adjustments from hidden to output layer that reduce output errors compute weight adjustments from input to hidden units that improve hidden layer change network weights, incorporating a rate parameter - how train network to solve particular problem? so far, learned how computes in forward direction inputs to outputs, how can earn pattern of weights that enables net to produce correct outputs for each set inputs? learn from set training examples (pairs input image & correct digit label) - start random values for weights, use iterative strategy progressively reduces error computed & desired output until performance good enough - how adjust weights to improve performance? most common strategy back-propagation algorithm - on each iteration, take current network, compute output for each input, assess performance (measure errors computed vs. desired), if not good enough, work backwards, determine changes in weights hidden to output layers that reduce errors, then move further back, determine adjustments of weights input to hidden layers that improve contributions of hidden layer to reducing final output, after figure out how to adjust, then (typically) change all at once – final rate parameter controls how aggressively to make changes – scale factor (change lot each iteration, or more conservative) – graph effect of different rates back-propagation algorithm to learn network weights
Learning handwritten digits MNIST database: 3000 28x28 images of handwritten digits start with random initial weights and use back-propagation to learn weights to recognize digits 28 one output unit for each digit select output unit with maximum response e.g. 9 28 - example of structure of network to recognize handwritten digits sample image 25 hidden units one input unit for each pixel
Results: Learning handwritten digits correct examples wrong examples confusion matrix - taste of results – different information useful – overall classification accuracy, number of iterations to reach, confusion matrix, ROC curves (thresholds to decide output 0 or 1, how does performance depend on choice of threshold? overall classification accuracy: 93.1%