Download presentation
Presentation is loading. Please wait.
1
CHAPTER 11: Multilayer Perceptrons
國立雲林科技大學 資訊工程研究所 張傳育(Chuan-Yu Chang ) 博士 Office: EB 212 TEL: ext. 4516 Website:
2
Neural Networks Networks of processing units (neurons) with connections (synapses) between them The brain is composed of Large number of neurons: 1010 Large connectitivity: 105 Parallel processing Distributed computation/memory Robust to noise, failures
3
Basic Models of Artificial neurons
An artificial neuron can be referred to as a processing element, node, or a threshold logic unit. There are four basic components of a neuron A set of synapses with associated synaptic weights A summing device, each input is multiplied by the associated synaptic weight and then summed. An activation function, serves to limit the amplitude of the neuron output. A threshold function, externally applied and lowers the cumulative input to the activation function.
4
Basic Models of Artificial neurons
5
Basic Models of Artificial neurons
The output of the linear combiner is where The output of the activation function is The output of the neuron is given by
6
Basic Models of Artificial neurons
The threshold (or bias) is incorporated into the synaptic weight vector wq for neuron q.
7
Basic Models of Artificial neurons
The effective internal activation potential is written as The output of neuron q is written as
8
Basic Activation Functions
The activation function, transfer function, Linear or nonlinear Linear (identity) activation function
9
Basic Activation Functions
Hard limiter Binary function, threshold function (0,1) The output of the binary hard limiter can be written as Hard limiter activation function
10
Basic Activation Functions
Bipolar, symmetric hard limiter (-1, 1) The output of the symmetric hard limiter can be written as Sometimes referred to as the signum (or sign) function. Symmetric limiter activation function
11
Basic Activation Functions
Saturation linear function, piecewise linear function The output of the saturation linear function is given by Saturation linear activation function
12
Basic Activation Functions
Saturation linear function The output of the symmetric saturation linear function is given by Saturation linear activation function
13
Basic Activation Functions
Sigmoid function (S-shaped function) The output of the Binary sigmoid function is given by where a is the slope parameter of the binary sigmoid function Binary sigmoid function
14
Basic Activation Functions
Sigmoid function (S-shaped function) Bipolar sigmoid function, hyperbolic tangent sigmoid is given by Hard limiter has no derivative at the origin, the sigmoid is a continuous and differentiable function
15
Perceptron The perceptron is the basic processing element.
(Rosenblatt, 1962)
16
What a Perceptron Does? Classification:y=1(wx+w0>0)
Regression: y=wx+w0 y y y s w0 w0 w w x w0 x x x0=+1
17
K Outputs K parallel perceptrons. xj, j = 0, , d are the inputs and yi, i =1,. . .,K are the outputs. wij is the weight of the connection from input xj to output yi . When used for K-class classification problem, there is a post- processing to choose the maximum, or softmax if we need the posterior probabilities.
18
K Outputs Classification:
there are K perceptrons, each of which has a weight vector wi where wij is the weight from input xj to output yi . W is the K × (d + 1) weight matrix of wij When used for classification, during testing, we Activation function
19
Training Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in time Wear and degradation in system components Stochastic gradient-descent: Update after a single pattern Generic update rule (LMS rule):
20
Simple adaptive linear combiner
x0=1, wo=b (bias) inputs
21
Simple adaptive linear combiner
The difference between the desired response and the network response is The MSE criterion can be written as Expanding Eq(2) (1) (2) (3)
22
Simple adaptive linear combiner
Cross correlation vector between the desired response and the input patterns Covariance matrix for the input pattern In the vector space of the weights, the MSE surface for J(w) has a unique minimum. Accordingly, we can compute the gradient of the performance measure in Eq(3), with respect to the weight vector w, and set this result equal to zero for the optimum conditions The optimal weights w* are obtained as (4) (5)
23
The LMS Algorithm Typical MSE surface of an adaptive linear combiner
24
The LMS Algorithm Practical use of Eq(5) is limited for two reasons:
Evaluation of the inverse of the covariance matrix is very computationally costly. Eq(5) is not suitable for online modifications of the weights because in most cases the covariance matrix and the cross- correlation vector are not know a priori. To resolve these problems, Widow and Hoff develops the LMS algorithm: To obtain the optimal values of the synaptic weights when J(w) is minimum. Search the error surface using a gradient descent method to find the minimum value. (when the gradient is zero) We can reach the bottom of the error surface by changing the weights in the direction of the negative gradient of the surface.
25
The LMS Algorithm Because the gradient on the surface cannot be computed without knowledge of the input covariance matrix and the cross-correlation vector, these must be estimated during an iterative procedure. Estimate of the MSE gradient surface can be obtained by taking the gradient of the instantaneous error surface. The gradient of J(w) approximated as The learning rule for updating the weights using the steepest descent gradients method as (6) (7) Learning rate specifies the magnitude of the update step for the weights in the negative gradient direction.
26
The LMS Algorithm If the value of h is chosen to be too small, the learning algorithm will modify the weights slowly and a relatively large number of iterations will be required. If the value of h is set too large, the learning rule can become numerically unstable leading to the weights not converging.
27
The largest eigenvalue of the input covariance matrix Cx
The LMS Algorithm The scalar form of the LMS algorithm can be written We must have an upper bound established for the learning rate parameter to ensure stability. (8) (9) (10) The largest eigenvalue of the input covariance matrix Cx
28
The LMS Algorithm To have convergence of the LMS algorithm be less sensitive to stability problems, the acceptable values for the learning rate are commonly bounded by The bound on the learning rate in (11) is more stable the (10), because (11) (12)
29
The LMS Algorithm Both (10) and (11) assume that we at least have an estimate of the input covariance matrix. In most practical cases such an estimate is difficult to obtain. Even if some estimate of the covariance matrix is available, the learning rate is frequently set to a fixed value. One of the major problems with a fixed learning rate is the accuracy of the results. It is perhaps more appropriate to have the learning rate change with time. Therefore, Robbin’s and Monro’s proposed a root-finding algorithm to change the learning rate (stochastic approximation ) where k is a very small constant. 缺點:learning rate減低的速度太快。 (13)
30
The LMS Algorithm A reasonable solution is that during the learning process m should be large at the beginning of training and then gradually decrease as the network converges. (Schedule-type adjustment) Darken and Moody Search-then converge algorithm Search phase: h is relatively large and almost constant. Converge phase: h is decrease exponentially to zero. h0 >0 and t>>1, typically 100<=t<=500 These methods of adjusting the learning rate are commonly called learning rate schedules. (14)
31
The LMS Algorithm Adaptive normalization approach (non-schedule- type)
m is adjusted according to the input data every time step where h0 is a fixed constant. Stability is guaranteed if 0< h0 <2; the practical range is 0.1<= h0 <=1 (15)
32
The LMS Algorithm Comparison of two learning rate schedules: stochastic approximation schedule and the search-then-converge schedule. m is a constant Eq.(14) Eq.(13)
33
Summary of the LMS algorithm
Step 1: set k=1, initialize the synaptic weight vector w(k=1), and select values for h0 and t. Step 2: Compute the learning rate parameter Step 3: Computer the error Step 4: Update the synaptic weights Step 5: If convergence is achieved, stop; else set k=k+1, then go to step 2.
34
Example : Parametric system identification
Input data consist of 1000 zero-mean Gaussian random vectors with three components. The bias is set to zero. The variance of the components of x are 5, 1, and 0.5. The assumed linear model is given by b=[1, 0.8, -1]T. To generate the target values the 1000 input vectors are used to form a matrix X=[x1x2…x1000] the desired outputs are computed according to d=bTX b x d The learning process was terminated when The progress of the learning rate parameter as it is adjusted according to the search-then converge schedule.
35
Example (cont.) Parametric system identification: estimating a parameter vector associated with a dynamic model of a system given only input/output data from the system. The root mean square (RMS) value of the performance measure.
36
Training a Perceptron: Regression
In online learning, we do not write the error function over the whole sample but on individual instances. Starting from random initial weights, at each iteration we adjust the parameters a little bit to minimize the error, without forgetting what we have previously learned. If this error function is differentiable, we can use gradient descent. In regression the error on the single instance pair with index t, (xt, rt ), is the online update is (16) where η is the learning factor, which is gradually decreased in time. This is known as stochastic gradient descent.
37
Classification Update rules can be derived for classification problems using logistic discrimination where updates are done after each pattern. With two classes, for the single instance (xt , rt) where rit = 1 if xt ∈ C1 and rit = 0 if xt ∈ C2, the single sigmoid output is The cross-entropy is Using gradient descent, we get the following online update rule for j = 0, , d: (17) (18) (19)
38
Classification When there are K > 2 classes, for the single instance (xt , rt) where rit = 1 if xt ∈ C1 and rit = 0 otherwise, the outputs are and the cross-entropy is Using gradient descent, we get the following online update rule for i = 1, . . .,K, j = 0, , d: (20) (21) (22) Update = Learning Factor· (Desired Output − Actual Output) · Input
39
Classification Perceptron training algorithm implementing stochastic online gradient descent for the case with K > 2 classes.
40
Learning Boolean AND y = s(x1 + x2 − 1.5)
41
XOR No w0, w1, w2 satisfy: (Minsky and Papert, 1969)
42
Multilayer Perceptron
The multilayer perceptron is an artificial neural network structure and is a nonparametric estimator that can be used for classification and regression.
43
Multilayer Perceptron
The branches can only broadcast information in one direction. Synaptic weight can be adjusted according to a defined learning rule. h-p-m feedforward MLP neural network. In general there can be any number of hidden layers in the architecture; however, from a practical perspective, only one or two hidden layer are used.
44
Multilayer Perceptron
3 layers network -1 output layer -2 hidden layers
45
Multilayer Perceptron
The first layer has the weight matrix The second layer has the weight matrix The third layer has the weight matrix Define a diagonal nonlinear operator matrix (23)
46
Multilayer Perceptron
The output of the first layer can be written as The output of the second layer can be written as The output of the third layer can be written as 將(24)代入(25) ,再代入(26)可得最後的輸出為 (24) (25) (26) (27) The synaptic weights are fixed, a training process must be carried out a priori To properly adjust the weights.
47
Backpropagation Learning Algorithm
The standard BP algorithm for training of the MLP NN is based on the steepest descent gradient approach applied to the minimization of an energy function representing the instantaneous error 其中 dq表示第q個input的desired output. X(3)out=yq為MLP的實際輸出。 (28)
48
Backpropagation Learning Algorithm (cont.)
Using the steepest-descent gradient approach, the learning rule for a network weight in any one of the network layers is given by where s=1,2,3 (29)
49
Backpropagation Learning Algorithm (cont.)
The weights in the output layer can be updated according to Using the chain rule for the partial derivatives, (30) can be rewritten as (30) (31)
50
Backpropagation Learning Algorithm (cont.)
(32) Eq以(28)式代入 (33) (34) Local error, delta
51
Backpropagation Learning Algorithm (cont.)
Combining (31), (32), (34), the learning rule for the weights in the output layer of the network is or In the hidden layer, applying the steepest descent gradient approach (35) (36) (37)
52
Backpropagation Learning Algorithm (cont.)
(37)式右側的二階微分項,可表示成 (38) (39) (40)
53
Backpropagation Learning Algorithm (cont.)
Combining equation (37), (38), and (40) yields or (41) (42)
54
Backpropagation Learning Algorithm (cont.)
Generalized weights update form where (for the output layer) (for the hidden layers) (43) (44) (45)
55
Backpropagation Learning Algorithm (cont.)
Standard BP algorithm Step 1: Initialize the network synaptic weights to small random values. Step 2: From the set of training input/output pair, present an input pattern and calculate the network response. Step 3: The desired network response is compared with the actual output of the network, and by using (44) and (45) all the local errors can be computed. Step 4: The weights of the network are updated according to (43). Step 5: Until the network reaches a predetermined level of accuracy in producing the adequate response for all the training patterns, continue steps 2 through 4.
56
Backpropagation Learning Algorithm (cont.)
Some Practical Issues in Using Standard BP Initialization of synaptic weights Initially set to small random values. 若是設太大,很可能會造成saturation. Heuristic algorithm (Ham, Kostanic 2001) Set the weights are uniformly distributed random numbers in the interval from - 0.5/fan-in to 0.5/fan-in
57
Backpropagation Learning Algorithm (cont.)
Nguyen and Widrow’s initialization algorithm Define (適合在具有一個隱藏層的架構) n0: number of components in input layer n1: number of neurons in hidden layer g: scaling factor Step 1: Compute the scaling factor according to Step 2: Initialize the weights wij of a layer as random numbers between -0.5 and 0.5 Step 3: Reinitialize the weights according to Step 4: For the i-th neuron in the hidden layer, set the bias to be a random number between –wij and wij. (46) (47)
58
Backpropagation Learning Algorithm (cont.)
Network configuration and ability of the network to generalize The configuration of the MLP NN由下列所決定: Hidden layer的數量、每個hidden layer的神經元數量、神經元 所採用的activation function。 Network performance的影響不在於activation function的形 式,而在於hidden layer的數量、每個hidden layer的神經元數 量。 hidden layer神經元數量是以trial and error的方式決定。 MLP NN一般被設計成有2層hidden layer。 一般而言,較多的Hidden layer神經元數量,可保證good network performance,但是一個”over-designed”架構, 會造成”over-fit” ,而喪失網路的generalization。
59
Backpropagation Learning Algorithm (cont.)
Example 此MLP網路具有一個hidden layer,50個神經元。 要訓練一個非線性方程式 (a)Training: 在[0,4]之間,每0.2取樣一點,共21點,target mean square error=0.01。The network converaged in only 5 epochs. (b)Testing: 在[0,4]之間,每0.01取樣一點,共401點。 Training data造成overfitting 解決方法是降低神經元的個數
60
Backpropagation Learning Algorithm (cont.)
Independent validation 使用training data來評估網路最後的performance quality,會造成overfitting。 可使用independent validation來避免此問題。 將可用的data分成training set和testing set。 一開始先將data randomize, 接著,再將資料分成兩部分:training set用來update 網路的權重。Testing set用來評估training的效能
61
Backpropagation Learning Algorithm (cont.)
Speed of convergence The convergence properties depend on the magnitude of the learning rate. 為保證網路能夠收斂,且避免訓練過程的震盪,learning rate必 須設成相當小的值。 若是網路的訓練起始點遠離global minimum,會造成許多神經 元的飽和,使得梯度變化變小,甚至卡在error surface的局部最 小值,造成收斂速度的減慢。 快速演算法可分成兩大類: Consists of various heuristic improvements to the standard BP algorithm Involves use of standard numerical optimization techniques 此外,preprocessing and reduction in the input data can result in improved performance and faster learning. PCA, FFT, WT…
62
Backpropagation Learning Algorithm (cont.)
Backpropagation Learning Algorithm with Momentum Updating To update the weights in the direction which is a linear combination of the current gradient of the instantaneous error surface and the one obtained in the previous step of the training. The weights are updated according to or Momentum term (48) (49) 加上前一次的輸出,如果兩次輸出的方向一致,則加速,否則減速。
63
Backpropagation Learning Algorithm (cont.)
This type of learning improves convergence If the training patterns contain some element of uncertainly, then updating with momentum provides a sort of low-pass filtering by preventing rapid changes in the direction of the weight updates. Render the training relatively immune to the presence of outliers or erroneous training pairs. Increase the rate of weight change, and the speed of convergence is increased.
64
Backpropagation Learning Algorithm (cont.)
前述的幾點可表示成下列的update equation 如果網路是操作在error surface的平坦區域,則gradient將不會 改變,因此(50)可近似成 由於forgetting factor總是小於1,所以effective learning rate可 定成 (50) (51) (52)
65
Backpropagation Learning Algorithm (cont.)
Batch Updating 標準的BP假設權重的更新是對每一個 input/output training pair。 Batch-updating 則是累積許多的training pattern 才來更新權重值。(可視為將許多個別I/O pair的修 正量平均後再修正權重) 。 Batch-updating具有下列優點: Gives a much better estimate of the error surface. Provides some inherent low-pass filtering of the training pattern. Suitable for more sophisticated optimization procedures.
66
Backpropagation Learning Algorithm (cont.)
Search-Then-Converge Method A kind of heuristic strategy for speeding up BP Search phase: The network is relatively far from the global minimum. The learning rate is kept sufficiently large and relatively constant. Converge phase The network is approaching the global minimum. The learning rate is decreased at each iteration.
67
Backpropagation Learning Algorithm (cont.)
因為實際上無法知道網路距離global minimum有 多遠,因此可以下列兩式來估計: 基本上,1<c/m0<100, 100<k0<500 當 k<<k0 ,learning rate近似於m0。 (search phase) 當 k>>k0 ,learning rate以1/k到1/k2的比例減少 (converge phase) (53) (54)
68
Backpropagation Learning Algorithm (cont.)
Batch Updating with Variable Learning Rate A simple heuristic strategy to increase the convergence speed of BP with batch update. To increase the magnitude of the learning rate if the learning in the previous step has decreased the total error function. If the error function has increased, the learning rate need to be decreased.
69
Backpropagation Learning Algorithm (cont.)
The algorithm can be summarized as If the error function over the entire training set has decreased, increase the learning rate by multiplying it by a number h>1 (typically h=1.05) If the error function has increased more than some set percentage x, decrease the learning rate by multiplying it by a number c<1 (typically c=0.7) If the error function is increased less than the percentage x, the learning rate remains unchanged. Apply the variable learning rate to batch updating can significantly speed up the convergence. The algorithm easily be trapped in a local minimum. 可設定最小learning rate mmin
70
Homework Write a program using a MLP to classify the digits (0-9).
The number of neurons in the output layer should be equal to the number of digits. Each of the digits is represented as a 9x4 matrix of binary numbers. The input training patterns can be generated from each digit as a vector resulting from applying the vector operator to each matrix representing the digit. After the network is trained, introduce random noise into the digit representations, and test the performance of the neural network.
71
Multilayer Perceptrons
(Rumelhart et al., 1986)
72
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
The multilayer perceptron that solves the XOR problem. The hidden units and the output have the threshold activation function with threshold at 0.
73
Backpropagation
74
Radial Basis Function Neural Networks
模擬大腦皮質層軸突的局部調整功能,具備良好的映射能力。 被訓練來執行輸入和輸出向量空間的非線性對應。 RBFNN是以函數逼近的方式來建構網路 RBF NN由三層神經元所組成:input layer、hidden layer (又 稱為nonlinear processing layer) 、及輸出層(output layer)
75
Radial Basis Function Neural Networks (cont.)
RBF NN的輸出定義如下: 其中,N為隱藏層神經元的個數,x為輸入向量,c為輸入 向量空間的RBF中心,w為輸出層的權重值。 fk(.)有許多種形式如下所示: (55) 最常使用
76
Radial Basis Function Neural Networks (cont.)
上述的activation function中,s用來控制RBF的寬 度,稱為spread parameter,中心點ck是被定義的 點,執行輸入向量空間的適當取樣。
77
Radial Basis Function Neural Networks (cont.)
在隱藏層中的每個神經元,計算輸入xi和其對應的 中心ci之間的Euclidean距離,並將此距離經fk後, 當成隱藏層的輸出。 最後將此輸出乘上對應的權重加總,即得輸出yi。 因為輸出層的神經元之間並無相互關係。因此 y1~ym可視為由多組單一輸出,共用隱藏層的多輸 出網路。
78
Radial Basis Function Neural Networks (cont.)
Training the RBF NN with Fixed Centers 支配RBF NN對應性質的兩個參數 隱藏層和輸出層間的weight,wik Radial basis function的中心點,ck Broomhead建議從輸入樣本中隨機選擇一組輸入樣本當為中心點。 根據訓練樣本的PDF選擇足夠多的centers,但很難去量化說多少 的中心點才是適當的。因此,一開始可選用相當大數量的中心點, 再使用系統化的方法移除一些不會降低網路對應能力的中心點。 一旦中心點選定之後,網路的輸出可表示成 其中Q為輸入樣本的總數 選擇了N個中心點 (56)
79
Radial Basis Function Neural Networks (cont.)
將(3.147)重新以向量的型式表示成 或 (57) 網路實際輸出 (58) 第Q個輸入和第n個中心的Euclidean distance 網路的實際輸出 輸出層的權重
80
Radial Basis Function Neural Networks (cont.)
因為中心點ck是固定的,因此網路在隱藏層間的對 應也是固定,所以網路的訓練只在於調整輸出權重, 因此,實際和desired輸出的MSE可表示成: 將 (58)式代入(59) (59) (60)
81
Radial Basis Function Neural Networks (cont.)
(61) (62) (63) Pseudo inverse 如果中心點的數目大於等於training pattern的數量,則網路的實際 輸出和desired輸出的差會很小。
82
Radial Basis Function Neural Networks (cont.)
從(63)可看出在固定網路中心的情況下,網路的訓 練會得到一個固定解,意謂著RBF NN比BP的訓練 更快速。 如果中心的數目大於等於training pattern的數目, 則網路的實際輸出和desired輸出之間的差異會很 小,事實上如果使用(63),則(60)式的error 會等於 0。
83
Radial Basis Function Neural Networks (cont.)
Example 要訓練一個RBFNN來近似非線性方程式 Interval [0,4] ,hidden layer: 21 neurons。 Fig. (a) :training:取樣間距0.2,所以隱藏層有21個神經元。 (中心點即為此21個取樣值,Gaussian RBF, s=0.2) Fig. (b) :testing:取樣間距0.01,有401個取樣點。 些許的over-fit,相對於BP已經好很多。
84
Radial Basis Function Neural Networks (cont.)
Setting the spread parameter spread parameter s 通常以下列的heuristic求得 使用(64)式,隱藏層中神經元的radial basis function可表示 成 參考通式 選擇的中心點之間最大的Euclidean 距離 (64) 中心點的數量 (65)
85
Radial Basis Function Neural Networks (cont.)
RBF NN Training algorithm with fixed centers Step 1: Choose the centers for the RBF functions Step 2: Calculate the spread parameter s for the RBF functions according to (64) Step 3: Initialized the weights in the output layer of the network Step 4: Calculate the output of the neural network according to (56) Step 5: Solve for the network weights, using (63)
86
Radial Basis Function Neural Networks (cont.)
Training the RBF NN using the Stochastic Gradient Approach 固定中心點的RBF NN只有輸出層的權重可以調整,訓 練上很簡單。但要從輸入向量中找出適當的中心點數量 卻不容易。因此必須從輸入的資料中選擇大量的中心點, 造成即使是相當簡單的問題,網路的架構仍然相當複雜。 Stochastic gradient approach的RBF NN允許調整網 路的所有的三種網路參數(權重、中心點位置及RBF的 寬度)
87
定義instantaneous error cost function
若RBF選擇Gaussian,則(66)變成 網路參數的更新方程式如下: (66) (67) 利用最陡坡降法,J(n)分別對w, c, s進行偏微分。 (68) (69) (70)
88
Radial Basis Function Neural Networks (cont.)
RBF NN Training algorithm with stochastic gradient-based method Step 1: Choose the centers for the RBF functions from input vectors randomly. Step 2: Calculate the initial value of the spread parameter for the RBF function according to (64). Step 3: Initialize the weights in the output layer of the network to some small random values. Step 4: Present an input vector, and compute the network output according to
89
Radial Basis Function Neural Networks (cont.)
Step 5: Update the network parameters . where Step 6: Stop if the network has converged; else, go back to step 4.
90
Example Chuan-Yu Chang, Yuh-Shyan Tsai, I-Lien Wu, “Integrating Validation Incremental Neural Network And Radial-Basis Function Neural Network For Segmenting Prostate In Ultrasound Images,” International Journal of Innovative Computing, Information and Control, Vol. 7, Number 6, June 2011 Chuan-Yu Chang, Hung-Jen Wang, Shif-Yu Fu, “Texture Image Classification using Modular RBF Neural Networks,” Journal of Electronic Imaging, Vol. 19(1), . Chuan-Yu Chang, Yue-Fong Lei, Chin-Hsiao Tseng and Shyang-Rong Shih, “Thyroid Segmentation and Volume Estimation in Ultrasound Images,” IEEE Transactions on Biomedical Engineering, 57(6), , 2010. Chuan-Yu Chang and Hung-Rung, “Application of Principal Component Analysis to a Radial Basis Function Committee Machine for Face Recognition,” International Journal of Innovative Computing, Information and Control, 5(11), , 2009.
91
Regression Backward Forward x
92
Regression with Multiple Outputs
yi vih zh whj xj
94
Sample training data shown as ‘+’, where xt ∼ U(−0.5, 0.5), and
yt = f (xt )+N(0, 0.1). f (x) = sin(6x) is shown by a dashed line. The evolution of the fit of an MLP with two hidden units after 100, 200, and 300 epochs is drawn.
95
The mean square error on training and validation sets as a function
of training epochs.
96
whx+w0 vhzh zh (a) The hyperplanes of the hidden unit weights on the first layer, (b) hidden unit outputs, and (c) hidden unit outputs multiplied by the weights on the second layer. Two sigmoid hidden units slightly displaced, one multiplied by a negative weight, when added, implement a bump. With more hidden units, a better approximation is attained
97
Two-Class Discrimination
When there are two classes, one output unit suffices: which approximates P(C1|xt) and P(C2|xt) ≡ 1-yt The error function in this case is The update equations implementing gradient descent are
98
Multiclass Discrimination
In a (K > 2)-class classification problem, there are K outputs We use softmax to indicate the dependency between classes The error function is
99
Multiclass Discrimination
we get the update equations using gradient descent
100
Multiple Hidden Layers
MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks For regression, let us say, if we have a multilayer perceptron with two hidden layers, we write where w1h and w2l are the first- and second-layer weights, z1h and z2h are the units on the first and second hidden layers, and v are the third layer weights.
101
Training Procedures Improving Convergence
The gradient descent converges slowly. Momentum take a running average by incorporating the previous update in the current change as if there is a momentum due to previous updates: Adaptive learning rate it is kept large when learning takes place and is decreased when learning slows down
102
Overfitting/Overtraining
Number of weights: H (d+1)+(H+1)K As complexity increases, training error is fixed but the validation error starts to increase and the network starts to overfit.
103
As training continues, the validation error starts to increase and
the network starts to overfit.
104
Structured MLP Convolutional networks (Deep learning)
(Le Cun et al, 1989)
105
Weight Sharing
106
Hints Invariance to translation, rotation, size Virtual examples
Augmented error: E’=E+λhEh If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2 Approximation hint: (Abu-Mostafa, 1995)
107
Tuning the Network Size
Destructive Weight decay: Constructive Growing networks (Ash, 1989) (Fahlman and Lebiere, 1989)
108
Bayesian Learning Consider weights wi as random vars, prior p(wi)
Weight decay, ridge regression, regularization cost=data-misfit + λ complexity More about Bayesian methods in chapter 14
109
Dimensionality Reduction
Autoencoder networks
111
Learning Time Applications: Network architectures
Sequence recognition: Speech recognition Sequence reproduction: Time-series prediction Sequence association Network architectures Time-delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986)
112
Time-Delay Neural Networks
In a time delay neural network, previous inputs are delayed in time so as to synchronize with the final input, and all are fed together as input to the system.
113
Recurrent Networks In a recurrent network, units have self-connections or connections to units in the previous layers. This recurrency acts as a short-term memory and lets the network remember what happened in the past.
114
Unfolding in Time
115
Deep Networks Layers of feature extraction units
Can have local receptive fields as in convolution networks, or can be fully connected Can be trained layer by layer using an autoencoder in an unsupervised manner No need to craft the right features or the right basis functions or the right dimensionality reduction method; learns multiple layers of abstraction all by itself given a lot of data and a lot of computation Applications in vision, language processing, ...
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.