Pseudoinverse Learning Algorithm for Feedforward Neural Networks Guo, Ping Department of Computer Science & Engineering, The Chinese University of Hong Kong, Hong Kong Supervisor: Professor Michael Lyu Markers: Professor L.W. Chan and I. King June 11, 2015
2 Introduction RFeedforward Neural Network 1Widely used for pattern classification and universal approximation 1Supervised learning task 1Back propagation algorithm used to train the neural network 1Poor convergence rate and local minima problem 1Learning factors problem ( learning rate, momentum constant) 1Time-consuming computation for some task by BP 1 Pseudoinverse Learning Algorithm 1 Batch-way learning 1 Matrix inner product and pseudoinverse operation
3 Network Structure (a) R Multilayer Neural Network (Mathematics Expression) R Input matrix:, output matrix: R Connect weight matrix R Nonlinear activate function R Network Mapping Function (with two hidden layers)
4 Network Structure (b) R Multilayer Neural Network (Mathematics Expression) Denote l -th layer output R Network output: R To find the weigh matrices based on training data set
5 Pseudoinverse Solution (a) RExistence of the Solution Linear Algebra Theorem: RBest Approximation Solution (Theorem) ÕThe best solution for is Õ Pseudoinverse solution
6 Pseudoinverse Solution (b) RMinimize error function RLearning Task If Y is full rank, above equation will be held Learning task becomes to raise the rank of Y.
7 Pseudoinverse Learning Algorithm 1.Let 2.Compute 3. Yes, go to 6. No, next step 4.Let feed this as input to next layer, compute 5.Compute and go to step 3 6.Let 7.Stop training. Real network output is
8 Add and Delete Sample (b) Computation efficiently Griville ’ s Theorem Add a sample: From (k-1)-th to calculate k-th pseudoinverse matrix
9 Add and Delete Sample (b) Computation efficiently Delete a sample: Delete a sample: From (k+1)-th to calculate k-th pseudoinverse matrix Let Bordering algorithm:
10 Numerical Examples (a) Function Mapping (1) Sin(x) (smooth function) (2) Nonlinear function: 8-D input, 3-D output (3) Smooth function (4) Piecewise smooth function
11 Numerical Examples (b) Function Mapping Table 1 Generalization ability test results. 20 training samples, 100 test samples Input rangeGeneralized ERMSEMax deviation Example 10-2p Example Example 30-p Table 2 Generalization ability test results. 5 or 50 training samples, 100 test samples Input rangeTest no. NGeneralized ERMSEMax deviation Example 10-2p x x x10 -5 Example 40- 2p
12 Numerical Examples (c) Function Mapping “ * ”— training data, “ o ” – test data Input Output * * * * * * * * * * * * * * * * * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Input Output * * * * * * * * * * * * * * * * * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o Input Output * * * * * * * * * * * * * * * * * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Input Output * * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Example 1Example 3 Example 4, 5 training samplesExample 4, 20 training samples
13 Numerical Examples (d) Real world data set Software reliability growth model -- Sys1 data Total 54 samples, partitioned data into training samples (37) and test samples (17). “ * ”— training data, “ o ” – test data
14 Numerical Examples (e) Real world data set Software reliability growth model -- Sys1 data “ o ”— level-0 output, “ + ” – level-1 output. Stacked generalization test, level-0 output is the level-1 input. Generalization is poor
15 Discussion [Local minima can be avoided by certain initialization. [No user selected parameter, “learning factor” problem is avoided. [Differentiable activate function is not necessary [Batch way learning, speed is fast [Provide an effective method to investigate some computation-intensive techniques [Further work: to find the techniques for generalization when noise data presented.
16 Thanks End of Presentation Q & A June 11, 2015