Download presentation
1
Feed-Forward Neural Networks
主講人: 虞台文
2
Content Introduction Single-Layer Perceptron Networks
Learning Rules for Single-Layer Perceptron Networks Perceptron Learning Rule Adaline Leaning Rule -Leaning Rule Multilayer Perceptron Back Propagation Learning algorithm
3
Feed-Forward Neural Networks
Introduction
4
Historical Background
1943 McCulloch and Pitts proposed the first computational models of neuron. 1949 Hebb proposed the first learning rule. 1958 Rosenblatt’s work in perceptrons. 1969 Minsky and Papert’s exposed limitation of the theory. 1970s Decade of dormancy for neural networks. s Neural network return (self-organization, back-propagation algorithms, etc)
5
Nervous Systems Human brain contains ~ 1011 neurons.
Each neuron is connected ~ 104 others. Some scientists compared the brain with a “complex, nonlinear, parallel computer”. The largest modern neural networks achieve the complexity comparable to a nervous system of a fly.
6
Neurons The main purpose of neurons is to receive, analyze and transmit further the information in a form of signals (electric pulses). When a neuron sends the information we say that a neuron “fires”.
7
Neurons Acting through specialized projections known as dendrites and axons, neurons carry information throughout the neural network. This animation demonstrates the firing of a synapse between the pre-synaptic terminal of one neuron to the soma (cell body) of another neuron.
8
A Model of Artificial Neuron
x1 x2 xm= 1 wi1 wi2 wim =i . f (.) a (.) yi bias
9
A Model of Artificial Neuron
x1 x2 xm= 1 wi1 wi2 wim =i . f (.) a (.) yi bias
10
Feed-Forward Neural Networks
Graph representation: nodes: neurons arrows: signal flow directions A neural network that does not contain cycles (feedback loops) is called a feed–forward network (or perceptron). x1 x2 xm y1 y2 yn
11
Layered Structure Output Layer Hidden Layer(s) Input Layer x1 x2 xm y1
x1 x2 xm y1 y2 yn Output Layer Hidden Layer(s) Input Layer
12
Knowledge and Memory x1 x2 xm y1 y2 yn
The output behavior of a network is determined by the weights. Weights the memory of an NN. Knowledge distributed across the network. Large number of nodes increases the storage “capacity”; ensures that the knowledge is robust; fault tolerance. Store new information by changing weights. x1 x2 xm y1 y2 yn
13
Pattern Classification
output pattern y Function: x y The NN’s output is used to distinguish between and recognize different input patterns. Different output patterns correspond to particular classes of input patterns. Networks with hidden layers can be used for solving more complex problems then just a linear pattern classification. x1 x2 xm y1 y2 yn input pattern x
14
Training Goal: Training Set yi1 yi2 yin di1 di2 din xi1 xi2 xim . . .
Goal: xi1 xi2 xim
15
Generalization x1 x2 xm y1 y2 yn
By properly training a neural network may produce reasonable answers for input patterns not seen during training (generalization). Generalization is particularly useful for the analysis of a “noisy” data (e.g. time–series). x1 x2 xm y1 y2 yn
16
Generalization x1 x2 xm y1 y2 yn
By properly training a neural network may produce reasonable answers for input patterns not seen during training (generalization). Generalization is particularly useful for the analysis of a “noisy” data (e.g. time–series). x1 x2 xm y1 y2 yn without noise with noise
17
Applications Pattern classification Object recognition
Function approximation Data compression Time series analysis and forecast . . .
18
Feed-Forward Neural Networks
Single-Layer Perceptron Networks
19
The Single-Layered Perceptron
x1 x2 xm= 1 y1 y2 yn xm-1 w11 w12 w1m w21 w22 w2m wn1 wnm wn2
20
Training a Single-Layered Perceptron
Training Set x1 x2 xm= 1 y1 y2 yn xm-1 w11 w12 w1m w21 w22 w2m wn1 wnm wn2 Goal: d1 d2 dn
21
Learning Rules . . . x1 x2 xm= 1 y1 y2 yn xm-1 d1 d2 dn Training Set
Linear Threshold Units (LTUs) : Perceptron Learning Rule Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule Training Set x1 x2 xm= 1 y1 y2 yn xm-1 w11 w12 w1m w21 w22 w2m wn1 wnm wn2 Goal: d1 d2 dn
22
Feed-Forward Neural Networks
Learning Rules for Single-Layered Perceptron Networks Perceptron Learning Rule Adline Leaning Rule -Learning Rule
23
Perceptron Linear Threshold Unit x1 wi1 x2 sgn wi2 wim =i xm= 1 .
+1 1 sgn
24
Perceptron Linear Threshold Unit Goal: x1 wi1 x2 sgn wi2 wim =i
xm= 1 wi1 wi2 wim =i . +1 1 sgn
25
Example Goal: x1 x2 x3= 1 y Class 1 (+1) Class 2 (1) Class 1 2 1
g(x) = 2x1 +x2+2=0 Class 2
26
Augmented input vector
Goal: Augmented input vector Class 1 (+1) Class 2 (1) x1 x2 x3= 1 2 1 y Class 1 (+1) Class 2 (1)
27
Augmented input vector
Goal: Augmented input vector x1 x2 x3 (0,0,0) Class 1 (1, 2, 1) (1.5, 1, 1) (1,0, 1) x1 x2 x3= 1 2 1 y Class 2 (1, 2, 1) (2.5, 1, 1) (2,0, 1)
28
Augmented input vector
Goal: Augmented input vector A plane passes through the origin in the augmented input space. x1 x2 x3 (0,0,0) Class 1 (1, 2, 1) (1.5, 1, 1) (1,0, 1) x1 x2 x3= 1 2 1 y Class 2 (1, 2, 1) (2.5, 1, 1) (2,0, 1)
29
Linearly Separable vs. Linearly Non-Separable
AND OR XOR 1 1 1 Linearly Separable Linearly Separable Linearly Non-Separable
30
Goal Given training sets T1C1 and T2 C2 with elements in form of x=(x1, x2 , ... , xm-1 , xm) T , where x1, x2 , ... , xm-1 R and xm= 1. Assume T1 and T2 are linearly separable. Find w=(w1, w2 , ... , wm) T such that
31
wTx = 0 is a hyperplain passes through the origin of augmented input space.
Goal Given training sets T1C1 and T2 C2 with elements in form of x=(x1, x2 , ... , xm-1 , xm) T , where x1, x2 , ... , xm-1 R and xm= 1. Assume T1 and T2 are linearly separable. Find w=(w1, w2 , ... , wm) T such that
32
Observation x1 x2 Which w’s correctly classify x? w1 x w2 w6 w3 w4 w5
+ d = +1 d = 1 Observation x1 x2 Which w’s correctly classify x? + w1 x w2 w6 w3 w4 w5 What trick can be used?
33
+ d = +1 d = 1 Observation x1 x2 Is this w ok? + w x w1x1 +w2x2 = 0
34
+ d = +1 d = 1 Observation x1 x2 w1x1 +w2x2 = 0 Is this w ok? + x w
35
Observation x2 Is this w ok? x How to adjust w? ? x1 w ? w = ? d = +1
d = +1 d = 1 Observation x1 x2 w1x1 +w2x2 = 0 Is this w ok? + x How to adjust w? ? w ? w = ?
36
Observation x2 Is this w ok? x How to adjust w? x1 w w = x d = +1
reasonable? <0 >0
37
Observation x2 Is this w ok? x How to adjust w? w = x x1 w d = +1
d = +1 d = 1 Observation x1 x2 Is this w ok? + x reasonable? How to adjust w? w = x w <0 >0
38
Observation w = ? +x x x2 Is this w ok? w x x1 or d = +1 d = 1 +
39
Perceptron Learning Rule
Upon misclassification on + d = +1 d = 1 Define error + No error
40
Perceptron Learning Rule
Define error + No error
41
Perceptron Learning Rule
Input Learning Rate Error (d y)
42
Summary Perceptron Learning Rule
Based on the general weight learning rule. Summary Perceptron Learning Rule correct incorrect
43
Summary Perceptron Learning Rule
Converge? Summary Perceptron Learning Rule x y . d +
44
Perceptron Convergence Theorem
Exercise: Reference some papers or textbooks to prove the theorem. Perceptron Convergence Theorem If the given training set is linearly separable, the learning process will converge in a finite number of steps. x y . d +
45
The Learning Scenario x2 x1 Linearly Separable. x(1) x(2) x(3) x(4) +
x(3) x(4)
46
The Learning Scenario x1 x2 + x(1) + x(2) w0 x(3) x(4)
47
The Learning Scenario x1 x2 + x(1) + x(2) w0 w1 w0 x(3) x(4)
48
The Learning Scenario x2 x1 x(1) w2 w1 x(2) w1 w0 x(3) x(4) w0 + +
49
The Learning Scenario x2 x1 w2 x(1) w3 w2 w1 x(2) w1 w0 x(3) x(4)
+ x(1) w3 w2 w1 + x(2) w0 w0 x(3) x(4)
50
w4 = w3 The Learning Scenario x2 x1 w2 x(1) w3 w2 w1 x(2) w1 w0
+ x(1) w3 w2 w1 + x(2) w0 w0 x(3) x(4)
51
The Learning Scenario x1 x2 + x(1) w + x(2) x(3) x(4)
52
The Learning Scenario x2 x1 x(1) w x(2) x(3) x(4)
The demonstration is in augmented space. The Learning Scenario x1 x2 + x(1) w + x(2) x(3) x(4) Conceptually, in augmented space, we adjust the weight vector to fit the data.
53
A weight in the shaded area will give correct classification for the positive example.
Weight Space w1 w2 w + x
54
Weight Space w2 w = x x w w1
A weight in the shaded area will give correct classification for the positive example. Weight Space w1 w2 w = x + x w
55
A weight not in the shaded area will give correct classification for the negative example.
Weight Space w1 w2 x w
56
Weight Space w2 w w = x x w1
A weight not in the shaded area will give correct classification for the negative example. Weight Space w1 w2 w w = x x
57
The Learning Scenario in Weight Space
+ x(2) + x(1) x(3) x(4)
58
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w1 x(3) x(4)
59
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w1 w1 x(3) x(4) w0
60
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w2 w1 w1 x(3) x(4) w0
61
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w2 =w3 w1 w1 x(3) x(4) w0
62
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w4 w2 =w3 w1 w1 x(3) x(4) w0
63
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w4 w2 =w3 w5 w1 w1 x(3) x(4) w0
64
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w4 w2 =w3 w5 w6 w1 w1 x(3) x(4) w0
65
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1 x(3) x(4) w0
66
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1 x(3) x(4) w0
67
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w9 w2 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1 x(3) x(4) w0
68
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w10 w9 w2 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1 x(3) x(4) w0
69
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w10 w9 w2 w11 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1 x(3) x(4) w0
70
The Learning Scenario in Weight Space
To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 w11 + x(2) + x(1) w1 x(3) x(4) w0 Conceptually, in weight space, we move the weight into the feasible region.
71
Feed-Forward Neural Networks
Learning Rules for Single-Layered Perceptron Networks Perceptron Learning Rule Adaline Leaning Rule -Learning Rule
72
Adaline (Adaptive Linear Element)
Widrow [1962] x1 x2 xm wi1 wi2 wim .
73
Adaline (Adaptive Linear Element)
In what condition, the goal is reachable? Adaline (Adaptive Linear Element) Goal: Widrow [1962] x1 x2 xm wi1 wi2 wim .
74
LMS (Least Mean Square)
Minimize the cost function (error function):
75
Gradient Decent Algorithm
Our goal is to go downhill. Gradient Decent Algorithm E(w) w1 w2 Contour Map w w2 (w1, w2) w1
76
Gradient Decent Algorithm
Our goal is to go downhill. Gradient Decent Algorithm How to find the steepest decent direction? E(w) w1 w2 Contour Map w w2 (w1, w2) w1
77
Gradient Operator Let f(w) = f (w1, w2,…, wm) be a function over Rm.
Define
78
Gradient Operator f w df : positive df : zero df : negative
Go uphill Plain Go downhill
79
The Steepest Decent Direction
To minimize f , we choose w = f f w df : positive df : zero df : negative Go uphill Plain Go downhill
80
LMS (Least Mean Square)
Minimize the cost function (error function): (k)
81
Adaline Learning Rule Minimize the cost function (error function):
Weight Modification Rule
82
Learning Modes Batch Learning Mode: Incremental Learning Mode:
83
Summary Adaline Learning Rule
LMS Algorithm Widrow-Hoff Learning Rule Summary Adaline Learning Rule Converge? x y . d +
84
LMS Convergence Based on the independence theory (Widrow, 1976).
The successive input vectors are statistically independent. At time t, the input vector x(t) is statistically independent of all previous samples of the desired response, namely d(1), d(2), …, d(t1). At time t, the desired response d(t) is dependent on x(t), but statistically independent of all previous values of the desired response. The input vector x(t) and desired response d(t) are drawn from Gaussian distributed populations.
85
LMS Convergence It can be shown that LMS is convergent if
where max is the largest eigenvalue of the correlation matrix Rx for the inputs.
86
LMS Convergence It can be shown that LMS is convergent if
Since max is hardly available, we commonly use LMS Convergence It can be shown that LMS is convergent if where max is the largest eigenvalue of the correlation matrix Rx for the inputs.
87
Comparisons Perceptron Learning Rule Adaline (Widrow-Hoff) Hebbian
Assumption Gradient Decent Fundamental Converge Asymptotically Convergence In finite steps Linearly Separable Linear Independence Constraint
88
Feed-Forward Neural Networks
Learning Rules for Single-Layered Perceptron Networks Perceptron Learning Rule Adaline Leaning Rule -Learning Rule
89
Adaline x1 x2 xm wi1 wi2 wim .
90
Unipolar Sigmoid x1 x2 xm wi1 wi2 wim .
91
Bipolar Sigmoid x1 x2 xm wi1 wi2 wim .
92
Goal Minimize
93
Gradient Decent Algorithm
Minimize
94
Depends on the activation function used.
The Gradient Minimize Depends on the activation function used. ? ?
95
Weight Modification Rule
Minimize Batch Learning Rule Incremental
96
The Learning Efficacy Minimize Exercise Sigmoid Adaline Unipolar
Bipolar Exercise
97
Learning Rule Unipolar Sigmoid
Minimize Weight Modification Rule
98
Comparisons Batch Adaline Incremental Batch Sigmoid Incremental
99
The Learning Efficacy Sigmoid Adaline y=a(net) y=a(net) = net net net
constant depends on output
100
The Learning Efficacy Sigmoid Adaline
The learning efficacy of Adaline is constant meaning that the Adline will never get saturated. net y=a(net) = net net y=a(net) y=net a’(net) 1 constant depends on output
101
The Learning Efficacy Sigmoid Adaline
The sigmoid will get saturated if its output value nears the two extremes. Adaline net y=a(net) = net net y=a(net) y a’(net) 1 constant depends on output
102
Initialization for Sigmoid Neurons
Why? Before training, it weight must be sufficiently small. x1 x2 xm . wi1 wi2 wim
103
Feed-Forward Neural Networks
Multilayer Perceptron
104
Multilayer Perceptron
x1 x2 xm y1 y2 yn Output Layer Hidden Layer Input Layer
105
Multilayer Perceptron
Where the knowledge from? Classification Output Analysis Learning Input
106
How an MLP Works? Example: XOR Not linearly separable.
Is a single layer perceptron workable? XOR 1 x1 x2
107
How an MLP Works? 00 01 11 Example: XOR x1 x2 x1 x2 x3= 1 y1 y2 L1 1
1 XOR x1 x2 00 L2 L1 x1 x2 x3= 1 y1 y2 L2 01 11
108
How an MLP Works? Example: L1 1 XOR x1 x2 L3 00 L2 01 1 y1 y2 11
109
How an MLP Works? Example: L1 1 XOR x1 x2 L3 00 L2 01 1 y1 y2 11
110
How an MLP Works? Example: y1 y2 y3= 1 z y1 y2 x1 x2 x3= 1 L3 L3 1
1 y1 y2 L2 L1 x1 x2 x3= 1
111
Parity Problem Is the problem linearly separable? 1 000 001 010 011
1 000 001 010 011 100 101 110 111 x1 x2 x3 x1 x2 x3
112
Parity Problem 1 000 001 010 011 100 101 110 111 x3 x2 x1 x1 x2 x3 P1
1 000 001 010 011 100 101 110 111 x1 x2 x3 x3 P1 P2 x2 P3 x1
113
Parity Problem 1 000 001 010 011 100 101 110 111 x1 x2 x3 x1 x2 x3 P1 P2 P3 111 011 001 000
114
Parity Problem x1 x2 x3 x1 x2 x3 y1 y2 y3 111 011 001 000 P1 P2 P3 P1
115
Parity Problem y1 y2 y3 x1 x2 x3 P1 P2 P3 111 P4 011 001 000
116
Parity Problem y1 z y3 P4 y2 y1 y2 y3 P4 P1 x1 x2 x3 P2 P3
117
General Problem
118
General Problem
119
Hyperspace Partition L3 L2 L1
120
Region Encoding 001 000 L3 L2 L1 010 100 101 110 111
121
Hyperspace Partition & Region Encoding Layer
101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
122
Region Identification Layer
101 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
123
Region Identification Layer
001 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
124
Region Identification Layer
000 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
125
Region Identification Layer
110 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
126
Region Identification Layer
010 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
127
Region Identification Layer
100 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
128
Region Identification Layer
111 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
129
Classification 1 1 1 x1 x2 x3 001 101 000 110 010 100 111 L1 L2 L3
1 1 1 001 101 000 110 010 100 111 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3
130
Feed-Forward Neural Networks
Back Propagation Learning algorithm
131
Activation Function — Sigmoid
net 1 0.5 Remember this
132
Supervised Learning Output Layer Hidden Layer Input Layer x1 x2 xm o1
Training Set Supervised Learning x1 x2 xm o1 o2 on d1 d2 dn Output Layer Hidden Layer Input Layer
133
Supervised Learning Goal: Sum of Squared Errors Minimize x1 x2 xm o1
Training Set Supervised Learning x1 x2 xm o1 o2 on Sum of Squared Errors d1 d2 dn Goal: Minimize
134
Back Propagation Learning Algorithm
x1 x2 xm o1 o2 on d1 d2 dn Learning on Output Neurons Learning on Hidden Neurons
135
Learning on Output Neurons
j i o1 oj on d1 dj dn wji ? ?
136
Learning on Output Neurons
j i o1 oj on d1 dj dn wji depends on the activation function
137
Learning on Output Neurons
j i o1 oj on d1 dj dn wji Using sigmoid,
138
Learning on Output Neurons
j i o1 oj on d1 dj dn wji Using sigmoid,
139
Learning on Output Neurons
j i o1 oj on d1 dj dn wji
140
Learning on Output Neurons
j i o1 oj on d1 dj dn wji How to train the weights connecting to output neurons?
141
Learning on Hidden Neurons
j k i wik wji ? ?
142
Learning on Hidden Neurons
j k i wik wji
143
Learning on Hidden Neurons
j k i wik wji ?
144
Learning on Hidden Neurons
j k i wik wji
145
Learning on Hidden Neurons
j k i wik wji
146
Back Propagation o1 oj on j k i d1 dj dn x1 xm
147
Back Propagation o1 oj on j k i d1 dj dn x1 xm
148
Back Propagation o1 oj on j k i d1 dj dn x1 xm
149
Learning Factors Initial Weights Learning Constant () Cost Functions
Momentum Update Rules Training Data and Generalization Number of Layers Number of Hidden Nodes
150
Reading Assignments Shi Zhong and Vladimir Cherkassky, “Factors Controlling Generalization Ability of MLP Networks.” In Proc. IEEE Int. Joint Conf. on Neural Networks, vol. 1, pp , Washington DC. July ( Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). "Learning Internal Representations by Error Propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. I, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group. MIT Press, Cambridge (1986). (
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.