Download presentation
Presentation is loading. Please wait.
Published bymorad moto Modified over 5 years ago
1
1 CogNova Technologies Theory and Application of Artificial Neural Networks with Daniel L. Silver, PhD Daniel L. Silver, PhD Copyright (c), 2014 All Rights Reserved
2
2 CogNova Technologies Seminar Outline DAY 1 v ANN Background and Motivation v Classification Systems and Inductive Learning v From Biological to Artificial Neurons v Learning in a Simple Neuron v Limitations of Simple Neural Networks v Visualizing the Learning Process v Multi-layer Feed-forward ANNs v The Back-propagation Algorithm DAY 2 v Generalization in ANNs v How to Design a Network v How to Train a Network v Mastering ANN Parameters v The Training Data v Post-Training Analysis v Pros and Cons of Back-prop v Advanced issues and networks
3
3 CogNova Technologies ANN Background and Motivation
4
4 CogNova Technologies Background and Motivation v Growth has been explosive since 1987 – education institutions, industry, military – > 500 books on subject – > 20 journals dedicated to ANNs – numerous popular, industry, academic articles v Truly inter-disciplinary area of study v No longer a flash in the pan technology
5
5 CogNova Technologies Background and Motivation v Computers and the Brain: A Contrast – Arithmetic:1 brain = 1/10 pocket calculator – Vision: 1 brain = 1000 super computers – Memory of arbitrary details: computer wins – Memory of real-world facts: brain wins – A computer must be programmed explicitly – The brain can learn by experiencing the world
6
6 CogNova Technologies Background and Motivation
7
7 CogNova Technologies Background and Motivation Inherent Advantages of the Brain: “distributed processing and representation” “distributed processing and representation” – Parallel processing speeds – Fault tolerance – Graceful degradation – Ability to generalize I O f(x) x
8
8 CogNova Technologies Background and Motivation History of Artificial Neural Networks v Creation: 1890: William James - defined a neuronal process of learning v Promising Technology: 1943: McCulloch and Pitts - earliest mathematical models 1954: Donald Hebb and IBM research group - earliest simulations 1958: Frank Rosenblatt - The Perceptron v Disenchantment: 1969: Minsky and Papert - perceptrons have severe limitations v Re-emergence: 1985: Multi-layer nets that use back-propagation 1986: PDP Research Group - multi-disciplined approach
9
9 CogNova Technologies ANN application areas... v Science and medicine: modeling, prediction, diagnosis, pattern recognition diagnosis, pattern recognition v Manufacturing: process modeling and analysis v Marketing and Sales: analysis, classification, customer targeting customer targeting v Finance: portfolio trading, investment support v Banking & Insurance: credit and policy approval v Security: bomb, iceberg, fraud detection v Engineering: dynamic load schedding, pattern recognition recognition Background and Motivation
10
10 CogNova Technologies Classification Systems and Inductive Learning
11
11 CogNova Technologies Classification Systems and Inductive Learning Basic Framework for Inductive Learning Inductive Learning System Environment Training Examples Testing Examples Induced Model of Classifier Output Classification (x, f(x)) (x, h(x)) h(x) = f(x)? A problem of representation and search for the best hypothesis, h(x). ~
12
12 CogNova Technologies Classification Systems and Inductive Learning Vector Representation & Discriminate Functions x x Height Age 2 1 * o * o Class Clusters “Input or Attribute Space” * * * o o o A B
13
13 CogNova Technologies Classification Systems and Inductive Learning Vector Representation & Discriminate Functions x x Height Age 2 1 * o * o * * * o o o A B Linear Discriminate Function f(X)=(x 1,x 2 ) = w 0 +w 1 x 1 +w 2 x 2 =0 or WX = 0 f(x 1,x 2 ) > 0 => A f(x 1,x 2 ) B -w 0 /w 2
14
14 CogNova Technologies Classification Systems and Inductive Learning v f(X) = WX =0 will discriminate class A from B, v BUT... we do not know the appropriate values for : w 0, w 1, w 2
15
15 CogNova Technologies Classification Systems and Inductive Learning We will consider one family of neural network classifiers: v continuous valued input v feed-forward v supervised learning v global error
16
16 CogNova Technologies From Biological to Artificial Neurons
17
17 CogNova Technologies From Biological to Artificial Neurons The Neuron - A Biological Information Processor v dentrites - the receivers v soma - neuron cell body (sums input signals) v axon - the transmitter v synapse - point of transmission v neuron activates after a certain threshold is met Learning occurs via electro-chemical changes in effectiveness of synaptic junction.
18
18 CogNova Technologies From Biological to Artificial Neurons An Artificial Neuron - The Perceptron v simulated on hardware or by software v input connections - the receivers v node, unit, or PE simulates neuron body v output connection - the transmitter v activation function employs a threshold or bias v connection weights act as synaptic junctions Learning occurs via changes in value of the connection weights.
19
19 CogNova Technologies From Biological to Artificial Neurons An Artificial Neuron - The Perceptron v Basic function of neuron is to sum inputs, and produce output given sum is greater than threshold v ANN node produces an output as follows: 1. Multiplies each component of the input pattern by the weight of its connection 2. Sums all weighted inputs and subtracts the threshold value => total weighted input 3. Transforms the total weighted input into the output using the activation function
20
20 CogNova Technologies From Biological to Artificial Neurons Hidden Nodes Output Nodes Input Nodes I1I2I3I4 O1O2 “Distributed processing and representation” 3-Layer Network has 2 active layers
21
21 CogNova Technologies From Biological to Artificial Neurons Behaviour of an artificial neural network to any particular input depends upon: v structure of each node (activation function) v structure of the network (architecture) v weights on each of the connections.... these must be learned !.... these must be learned !
22
22 CogNova Technologies Learning in a Simple Neuron
23
23 CogNova Technologies Learning in a Simple Neuron H = {W|W R (n+1) } x1x1 x2x2 y 0 0 0 0 1 0 1 0 0 1 1 1 x1x1 x2x2 x 0= 1 w1w1 w 0= w2w2 FriesBurger where f(a) is the step function, such that: f(a)=1, a > 0 f(a)=0, a <= 0 “Full Meal Deal”
24
24 CogNova Technologies Learning in a Simple Neuron Perceptron Learning Algorithm: 1. Initialize weights 2. Present a pattern and target output 3. Compute output : 4. Update weights : Repeat starting at 2 until acceptable level of error
25
25 CogNova Technologies Learning in a Simple Neuron Widrow-Hoff or Delta Rule for Weight Modification Where: = learning rate ( o < <= 1 ), typically set = 0.1 d = error signal = desired output - network output = t - y = t - y ;
26
26 CogNova Technologies Learning in a Simple Neuron Perceptron Learning - A Walk Through v The PERCEPT.XLS table represents 4 iterations through training data for “ full meal deal ” network PERCEPT.XLS v On-line weight updates Varying learning rate, , will vary training time
27
27 CogNova Technologies TUTORIAL #1 v Your ANN software package: A Primer v Develop and train a simple neural network to learn the OR function
28
28 CogNova Technologies Limitations of Simple Neural Networks
29
29 CogNova Technologies Limitations of Simple Neural Networks What is a Perceptron doing when it learns? v We will see it is often good to visualize network activity v A discriminate function is generated v Has the power to map input patterns to output class values v For 3-dimensional input, must visualize 3-D space and 2-D hyper-planes
30
30 CogNova Technologies EXAMPLE Logical OR Function x 1 x 2 y 000 011 101 111 x2x2 x1x1 0,0 0,1 1,01,1 y = f(w 0 +w 1 x 1 +w 2 x 2 ) What is an artificial neuron doing when it learns? Simple Neural Network
31
31 CogNova Technologies Limitations of Simple Neural Networks The Limitations of Perceptrons (Minsky and Papert, 1969) v Able to form only linear discriminate functions ; i.e. classes which can be divided by a line or hyper-plane v Most functions are more complex; i.e. they are non-linear or not linearly separable v This crippled research in neural net theory for 15 years....
32
32 CogNova Technologies EXAMPLE Logical XOR Function x1x2 y 000 011 101 110 0,0 0,1 1,01,1 Two neurons are need! Their combined results can produce good classification. Hidden layer of neurons Multi-layer Neural Network x2x2 x1x1
33
33 CogNova Technologies EXAMPLE More complex multi-layer networks are needed to solve more difficult problems. A B
34
34 CogNova Technologies TUTORIAL #2 v Develop and train a simple neural network to learn the XOR function v Also see: http://www.neuro.sfc.keio.ac.jp/~masato/jv/sl/BP.html
35
35 CogNova Technologies Multi-layer Feed-forward ANNs
36
36 CogNova Technologies Multi-layer Feed-forward ANNs Over the 15 years (1969-1984) some research continued... v hidden layer of nodes allowed combinations of linear functions v non-linear activation functions displayed properties closer to real neurons: – output varies continuously but not linearly – differentiable.... sigmoid non-linear ANN classifier was possible non-linear ANN classifier was possible
37
37 CogNova Technologies Multi-layer Feed-forward ANNs v However... there was no learning algorithm to adjust the weights of a multi-layer network - weights had to be set by hand. v How could the weights below the hidden layer be updated?
38
38 CogNova Technologies Visualizing Network Behaviour
39
39 CogNova Technologies Visualizing Network Behaviour v Pattern Space v Weight Space v Visualizing the process of learning – function surface in weight space – error surface in weight space x1x1 x2x2 w0w0 w2w2 w1w1
40
40 CogNova Technologies The Back-propagation Algorithm
41
41 CogNova Technologies The Back-propagation Algorithm v 1986: the solution to multi-layer ANN weight update rediscovered v Conceptually simple - the global error is backward propagated to network nodes, weights are modified proportional to their contribution v Most important ANN learning algorithm v Become known as back-propagation because the error is send back through the network to correct all weights
42
42 CogNova Technologies The Back-propagation Algorithm v Like the Perceptron - calculation of error is based on difference between target and actual output: v However in BP it is the rate of change of the error which is the important feedback through the network generalized delta rule generalized delta rule v Relies on the sigmoid activation function for communication
43
43 CogNova Technologies The Back-propagation Algorithm Objective: compute for all Definitions: = weight from node i to node j = weight from node i to node j = totaled weighted input of node = totaled weighted input of node = output of node = output of node = error for 1 pattern over all output nodes = error for 1 pattern over all output nodes
44
44 CogNova Technologies The Back-propagation Algorithm Objective: compute for all Four step process: 1. Compute how fast error changes as output of node j is changed 2. Compute how fast error changes as total input to node j is changed 3. Compute how fast error changes as weight coming into node j is changed 4. Compute how fast error changes as output of node i in previous layer is changed
45
45 CogNova Technologies The Back-propagation Algorithm On-Line algorithm: 1. Initialize weights 2. Present a pattern and target output 3. Compute output : 4. Update weights : where Repeat starting at 2 until acceptable level of error
46
46 CogNova Technologies The Back-propagation Algorithm Where: For output nodes: For hidden nodes:
47
47 CogNova Technologies The Back-propagation Algorithm Visualizing the bp learning process: The bp algorithm performs a gradient descent in weights space toward a minimum level of error using a fixed step size or learning rate The gradient is given by : = rate at which error changes as weights change
48
48 CogNova Technologies The Back-propagation Algorithm Momentum Descent: v Minimization can be speed-up if an additional term is added to the update equation: where: where: v Thus: u Augments the effective learning rate to vary the amount a weight is updated u Analogous to momentum of a ball - maintains direction u Rolls through small local minima u Increases weight upadte when on stable gradient
49
49 CogNova Technologies The Back-propagation Algorithm Line Search Techniques: v Steepest and momentum descent use only gradient of error surface v More advanced techniques explore the weight space using various heuristics v Most common is to search ahead in the direction defined by the gradient
50
50 CogNova Technologies The Back-propagation Algorithm On-line vs. Batch algorithms: v Batch (or cumulative) method reviews a set of training examples known as an epoch and computes global error: v Weight updates are based on this cumulative error signal v On-line more stochastic and typically a little more accurate, batch more efficient
51
51 CogNova Technologies The Back-propagation Algorithm Several Interesting Questions: v What is BP’s inductive bias? v Can BP get stuck in local minimum? v How does learning time scale with size of the network & number of training examples? v Is it biologically plausible? v Do we have to use the sigmoid activation function? v How well does a trained network generalize to unseen test cases?
52
52 CogNova Technologies TUTORIAL #3 v The XOR function revisited v Software package tutorial : Electric Cost Prediction
53
53 CogNova Technologies Generalization
54
54 CogNova Technologies Generalization v The objective of learning is to achieve good generalization to new cases, otherwise just use a look-up table. v Generalization can be defined as a mathematical interpolation or regression over a set of training points: f(x) x
55
55 CogNova Technologies Generalization An Example: Computing Parity Can it learn from m examples to generalize to all 2^ n possibilities? >0>1>2 Parity bit value (n+1)^2 weights n bits of input 2^n possible examples +1 +1
56
56 CogNova Technologies Generalization Fraction of cases used during training Test Error 100% 0.25.50.751.0 Network test of 10-bit parity (Denker et. al., 1987) When number of training cases, m >> number of weights, then generalization occurs.
57
57 CogNova Technologies Generalization A Probabilistic Guarantee N = # hidden nodesm = # training cases W = # weights = error tolerance (< 1/8) Network will generalize with 95% confidence if: 1. Error on training set < 2. Based on PAC theory => provides a good rule of practice.
58
58 CogNova Technologies Generalization Consider 20-bit parity problem: v 20-20-1 net has 441 weights v For 95% confidence that net will predict with, we need training examples training examples v Not bad considering
59
59 CogNova Technologies Generalization Training Sample & Network Complexity Based on : Based on : W - to reduced size of training sample W - to supply freedom to construct desired function Optimum W => Optimum # Hidden Nodes
60
60 CogNova Technologies Generalization How can we control number of effective weights? v Manually or automatically select optimum number of hidden nodes and connections v Prevent over-fitting = over-training v Add a weight-cost term to the bp error equation
61
61 CogNova Technologies Generalization Over-Training v Is the equivalent of over-fitting a set of data points to a curve which is too complex v Occam’s Razor (1300s) : “plurality should not be assumed without necessity” v The simplest model which explains the majority of the data is usually the best
62
62 CogNova Technologies Generalization Preventing Over-training: v Use a separate test or tuning set of examples v Monitor error on the test set as network trains v Stop network training just prior to over-fit error occurring - early stopping or tuning v Number of effective weights is reduced v Most new systems have automated early stopping methods
63
63 CogNova Technologies Generalization Weight Decay: an automated method of effective weight control v Adjust the bp error function to penalize the growth of unnecessary weights: where: = weight -cost parameter where: = weight -cost parameter is decayed by an amount proportional to its magnitude; those not reinforced => 0 is decayed by an amount proportional to its magnitude; those not reinforced => 0
64
64 CogNova Technologies TUTORIAL #4 v Generalization: Develop and train a BP network to learn the OVT function
65
65 CogNova Technologies Network Design & Training
66
66 CogNova Technologies Network Design & Training Issues Design: v Architecture of network v Structure of artificial neurons v Learning rules Training: v Ensuring optimum training v Learning parameters v Data preparation v and more....
67
67 CogNova Technologies Network Design
68
68 CogNova Technologies Network Design Architecture of the network: How many nodes? v Determines number of network weights v How many layers? v How many nodes per layer? Input Layer Hidden Layer Output Layer v Automated methods: – augmentation (cascade correlation) – weight pruning and elimination
69
69 CogNova Technologies Network Design Architecture of the network: Connectivity? v Concept of model or hypothesis space v Constraining the number of hypotheses: – selective connectivity – shared weights – recursive connections
70
70 CogNova Technologies Network Design Structure of artificial neuron nodes v Choice of input integration: – summed, squared and summed – multiplied v Choice of activation (transfer) function: – sigmoid (logistic) – hyperbolic tangent – Guassian – linear – soft-max
71
71 CogNova Technologies Network Design Selecting a Learning Rule v Generalized delta rule (steepest descent) v Momentum descent v Advanced weight space search techniques v Global Error function can also vary - normal - quadratic - cubic - normal - quadratic - cubic
72
72 CogNova Technologies Network Training
73
73 CogNova Technologies Network Training How do you ensure that a network has been well trained? v Objective: To achieve good generalization accuracy on new examples/cases accuracy on new examples/cases v Establish a maximum acceptable error rate v Train the network using a validation test set to tune it v Validate the trained network against a separate test set which is usually referred to as a production test set
74
74 CogNova Technologies Network Training Available Examples Training Set Production Set Approach #1: Large Sample When the amount of available data is large... 70% 30% Used to develop one ANN model Compute Test error Divide randomly Generalization error = test error Test Set
75
75 CogNova Technologies Network Training Available Examples Training Set Pro. Set Approach #2: Cross-validation When the amount of available data is small... 10% 90% Repeat 10 times Used to develop 10 different ANN models Accumulate test errors Generalization error determined by mean test error and stddev Test Set
76
76 CogNova Technologies Network Training How do you select between two ANN designs ? v A statistical test of hypothesis is required to ensure that a significant difference exists between the error rates of two ANN models v If Large Sample method has been used then apply McNemar’s test* v If Cross-validation then use a paired t test for difference of two proportions *We assume a classification problem, if this is function approximation then use paired t test for difference of means
77
77 CogNova Technologies Network Training Mastering ANN Parameters Typical Range Typical Range learning rate - 0.1 0.01 - 0.99 momentum - 0.8 0.1 - 0.9 weight-cost - 0.1 0.001 - 0.5 Fine tuning : - adjust individual parameters at each node and/or connection weight – automatic adjustment during training
78
78 CogNova Technologies Network Training Network weight initialization v Random initial values +/- some range v Smaller weight values for nodes with many incoming connections v Rule of thumb: initial weight range should be approximately coming into a node
79
79 CogNova Technologies Network Training Typical Problems During Training E # iter E E Would like: But sometimes: Steady, rapid decline in total error Seldom a local minimum - reduce learning or momentum parameter Reduce learning parms. - may indicate data is not learnable
80
80 CogNova Technologies Data Preparation
81
81 CogNova Technologies Data Preparation Garbage in Garbage out v The quality of results relates directly to quality of the data v 50%-70% of ANN development time will be spent on data preparation v The three steps of data preparation: – Consolidation and Cleaning – Selection and Preprocessing – Transformation and Encoding
82
82 CogNova Technologies Data Preparation Data Types and ANNs v Four basic data types: – nominal discrete symbolic (blue,red,green) – ordinal discrete ranking (1 st, 2 nd, 3 rd ) – interval measurable numeric (-5, 3, 24) – continuous numeric (0.23, -45.2, 500.43) v bp ANNs accept only continuous numeric values (typically 0 - 1 range)
83
83 CogNova Technologies Data Preparation Consolidation and Cleaning v Determine appropriate input attributes v Consolidate data into working database v Eliminate or estimate missing values v Remove outliers (obvious exceptions) v Determine prior probabilities of categories and deal with volume bias
84
84 CogNova Technologies Data Preparation Selection and Preprocessing v Select examples random sampling Consider number of training examples? v Reduce attribute dimensionality – remove redundant and/or correlating attributes – combine attributes (sum, multiply, difference) v Reduce attribute value ranges – group symbolic discrete values – quantize continuous numeric values
85
85 CogNova Technologies Data Preparation Transformation and Encoding Nominal or Ordinal values v Transform to discrete numeric values v Encode the value 4 as follows: – one-of-N code (0 1 0 0 0) - five inputs – thermometer code ( 1 1 1 1 0) - five inputs – real value (0.4)* - one input if ordinal v Consider relationship between values – ( single, married, divorce ) vs. ( youth, adult, senior ) * Target values should be 0.1 - 0.9, not 0.0 - 1.0 range
86
86 CogNova Technologies Data Preparation Transformation and Encoding Interval or continuous numeric values v De-correlate example attributes via normalization of values: – Euclidean: n = x/sqrt(sum of all x^2) – Percentage: n = x/(sum of all x) – Variance based: n = (x - (mean of all x))/variance v Scale values using a linear transform if data is uniformly distributed or use non-linear (log, power) if skewed distribution
87
87 CogNova Technologies Data Preparation Transformation and Encoding Interval or continuous numeric values Encode the value 1.6 as: – Single real-valued number (0.16)* - OK! – Bits of a binary number (010000) - BAD! – one-of-N quantized intervals (0 1 0 0 0) - NOT GREAT! - discontinuities - NOT GREAT! - discontinuities – distributed (fuzzy) overlapping intervals ( 0.3 0.8 0.1 0.0 0.0) - BEST! ( 0.3 0.8 0.1 0.0 0.0) - BEST! * Target values should be 0.1 - 0.9, not 0.0 - 1.0 range
88
88 CogNova Technologies TUTORIAL #5 v Develop and train a BP network on real-world data v Also see slides covering Mitchell’s Face Recognition example Mitchell’s Face Recognition Mitchell’s Face Recognition
89
89 CogNova Technologies Post-Training Analysis
90
90 CogNova Technologies Post-Training Analysis Examining the neural net model: v Visualizing the constructed model v Detailed network analysis Sensitivity analysis of input attributes: v Analytical techniques v Attribute elimination
91
91 CogNova Technologies Post-Training Analysis Visualizing the Constructed Model v Graphical tools can be used to display output response as selected input variables are changed Response Size Temp
92
92 CogNova Technologies Post-Training Analysis Detailed network analysis v Hidden nodes form internal representation v Manual analysis of weight values often difficult - graphics very helpful v Conversion to equation, executable code v Automated ANN to symbolic logic conversion is a hot area of research
93
93 CogNova Technologies Post-Training Analysis Sensitivity analysis of input attributes v Analytical techniques – factor analysis – network weight analysis v Feature (attribute) elimination – forward feature elimination – backward feature elimination
94
94 CogNova Technologies The ANN Application Development Process Guidelines for using neural networks 1. Try the best existing method first 2. Get a big training set 3. Try a net without hidden units 4. Use a sensible coding for input variables 5. Consider methods of constraining network 6. Use a test set to prevent over-training 7. Determine confidence in generalization through cross-validation
95
95 CogNova Technologies Example Applications v Pattern Recognition (reading zip codes) v Signal Filtering (reduction of radio noise) v Data Segmentation (detection of seismic onsets) v Data Compression (TV image transmission) v Database Mining (marketing, finance analysis) v Adaptive Control (vehicle guidance)
96
96 CogNova Technologies Pros and Cons of Back-Prop
97
97 CogNova Technologies Pros and Cons of Back-Prop Cons: v Local minimum - but not generally a concern v Seems biologically implausible v Space and time complexity: lengthy training times v It’s a black box! I can’t see how it’s making decisions? v Best suited for supervised learning v Works poorly on dense data with few input variables
98
98 CogNova Technologies Pros and Cons of Back-Prop Pros: v Proven training method for multi-layer nets v Able to learn any arbitrary function (XOR) v Most useful for non-linear mappings v Works well with noisy data v Generalizes well given sufficient examples v Rapid recognition speed v Has inspired many new learning algorithms
99
99 CogNova Technologies Other Networks and Advanced Issues
100
10 0 CogNova Technologies Other Networks and Advanced Issues v Variations in feed-forward architecture – jump connections to output nodes – hidden nodes that vary in structure v Recurrent networks with feedback connections v Probabilistic networks v General Regression networks v Unsupervised self-organizing networks
101
10 1 CogNova Technologies THE END Thanks for your participation!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.