1 Statistical Mechanics of Online Learning for Ensemble Teachers Seiji Miyoshi Masato Okada Kobe City College of Tech. Univ. of Tokyo, RIKEN BSI
2 S U M M A R Y We analyze the generalization performance of a student in a model composed of linear perceptrons: a true teacher, K teachers, and the student. Calculating the generalization error of the student analytically using statistical mechanics in the framework of on-line learning, we prove that when the learning rate satisfies η 1, the properties are completely reversed. If the variety of the K teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η→0 and K→∞.
3 B A C K G R O U N D (1/2) Batch learning –given examples are used more than once –student becomes to give correct answers for all examples –long time and large memory On-line learning –examples once used are discarded –cannot give correct answers for all examples used in training –large memory is not necessary –it is possible to follow a time variant teacher
4 B A C K G R O U N D (2/2) P U R P O S E In most cases in an actual human society, a student can observe examples from two or more teachers who differ from each other. To analyze generalization performance of a model composed of a student, a true teacher and K teachers (ensemble teachers) who exist around the true teacher To discuss the relationship between the number, the variety of ensemble teachers and the generalization error
5 M O D E L (1/4) True teacher Student J learns B 1,B 2, ・・・ in turn. J can not learn A directly. A, B 1,B 2, ・・・,J are linear perceptrons with noises. Ensemble teachers
6 M O D E L (2/4) Output of true teacher Outputs of ensemble teachers Output of student Linear perceptronGaussian noise Linear perceptrons Linear perceptron Gaussian noises Gaussian noise
7 M O D E L (3/4) Inputs: Initial value of student: True teacher: Ensemble teachers: N→∞ (Thermodynamic limit) Order parameters –Length of student –Direction cosines
8 M O D E L (4/4) fkmfkm Gradient method Squared errors Student learns K ensemble teachers in turn.
9 GENERALIZATION ERROR A goal of statistical learning theory is to obtain generalization error theoretically. Generalization error = mean of errors over the distribution of new input Error Multiple Gaussian Distribution
10 Differential equations, which describe the dynamical behaviors of order parameters, have been obtained based on self-averaging in the thermodynamic limits as follows: J m+1 = J m + f k m x m + Nr J m+1 = Nr J m + f k m y m Ndt inputs A is multiplied to both side of Nr J m+2 = Nr J m+1 + f k m+1 y m+1 Nr J m+Ndt = Nr J m+Ndt-1 + f k m+Ndt-1 y m+Ndt-1 1. To simplify the analysis, the following auxiliary order parameters are introduced: 2. 3.
11 Simultaneous differential equations in deterministic forms, which describe dynamical behaviors of order parameters
12 Analytical solutions of order parameters
13 Dynamical behaviors of generalization error, R and l ( η=0.3, K=3, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ) Student becomes cleverer than a member of ensemble teachers. The larger the variety of the ensemble teachers is, the nearer the student and true teacher are. Student Ensemble teachers
14 Steady state analysis ( t → ∞ ) ・ If η <0 or η >2 ・ If 0< η <2 Generalization error and length of student diverge. If η <1, the more teachers exist or the richer the variety of teachers is, the cleverer the student can become. If η >1, the fewer teachers exist or the poorer the variety of teachers is, the cleverer the student can become.
15 Steady value of generalization error, R and l ( K=3, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ) Rich variety is good !Poor variety is good !
16 Steady value of generalization error, R and l ( q=0.49, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ) Many teachers are good !Few teachers are good !