1 Statistical Mechanics of Online Learning for Ensemble Teachers Seiji Miyoshi Masato Okada Kobe City College of Tech. Univ. of Tokyo, RIKEN BSI.

Slides:



Advertisements
Similar presentations
Identify the number of solutions of an equation
Advertisements

CHE 185 – PROCESS CONTROL AND DYNAMICS
Hazırlayan NEURAL NETWORKS Least Squares Estimation PROF. DR. YUSUF OYSAL.
A Statistical Mechanical Analysis of Online Learning: Can Student be more Clever than Teacher ? Seiji MIYOSHI Kobe City College of Technology
1 A Statistical Mechanical Analysis of Online Learning: Seiji MIYOSHI Kobe City College of Technology
( ) EXAMPLE 3 Solve ax2 + bx + c = 0 when a = 1
Radial Basis Functions
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Basic Models in Theoretical Neuroscience Oren Shriki 2010 Differential Equations.
SOLVING SYSTEMS USING SUBSTITUTION
1 Analysis of Ensemble Learning using Simple Perceptrons based on Online Learning Theory Seiji MIYOSHI 1 Kazuyuki HARA 2 Masato OKADA 3,4,5 1 Kobe City.
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Standardized Test Practice
3-4 Solving Systems of Linear Equations in 3 Variables
EXAMPLE 1 Collecting Like Terms x + 2 = 3x x + 2 –x = 3x – x 2 = 2x 1 = x Original equation Subtract x from each side. Divide both sides by x2x.
Section 8.3 – Systems of Linear Equations - Determinants Using Determinants to Solve Systems of Equations A determinant is a value that is obtained from.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
© Copyright McGraw-Hill CHAPTER 6 The Normal Distribution.
Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.
Associative Memory by Recurrent Neural Networks with Delay Elements Seiji MIYOSHI Hiro-Fumi YANAI Masato OKADA Kobe City College of Tech. Ibaraki Univ.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
EXAMPLE 2 Rationalize denominators of fractions Simplify
Curve-Fitting Regression
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Copyright © Cengage Learning. All rights reserved. Polynomials 4.
Martin-Gay, Beginning Algebra, 5ed Using Both Properties Divide both sides by 3. Example: 3z – 1 = 26 3z = 27 Simplify both sides. z = 9 Simplify.
The Irrational Numbers and the Real Number System
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Mean, Variance, Moments and.
Differential Equations
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Multi-Step Equations We must simplify each expression on the equal sign to look like a one, two, three step equation.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Engineers often: Regress data to a model  Used for assessing theory  Used for predicting  Empirical or theoretical model Use the regression of others.
Use the substitution method
Joint Moments and Joint Characteristic Functions.
Formulas & Functions Formula – an algebraic expression that relates two or more real-life quantities.
Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.
Lecture 4 - E. Wilson - 23 Oct 2014 –- Slide 1 Lecture 4 - Transverse Optics II ACCELERATOR PHYSICS MT 2014 E. J. N. Wilson.
Graphing Linear Inequalities 6.1 & & 6.2 Students will be able to graph linear inequalities with one variable. Check whether the given number.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
State-Space Recursive Least Squares with Adaptive Memory College of Electrical & Mechanical Engineering National University of Sciences & Technology (NUST)
Computacion Inteligente Least-Square Methods for System Identification.
Lecture 4 - E. Wilson –- Slide 1 Lecture 4 - Transverse Optics II ACCELERATOR PHYSICS MT 2009 E. J. N. Wilson.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Linear Regression.
Lecture 4 - Transverse Optics II
Example: Solve the equation. Multiply both sides by 5. Simplify both sides. Add –3y to both sides. Simplify both sides. Add –30 to both sides. Simplify.
EXAMPLE 2 Rationalize denominators of fractions Simplify
PSG College of Technology
TYPES OF SOLUTIONS OF LINEAR EQUATIONS
Chapter 9 Right Triangles and Trigonometry
9-2 Pythagorean Theorem.
PROVING THE PYTHAGOREAN THEOREM
Linear Models and Equations
Solve an equation by combining like terms
1.3 Solving Linear Equations
Lecture 4 - Transverse Optics II
Evaluating expressions and Properties of operations
Linear simultaneous equations: Correct solution? (step 7)
LINEAR EQUATIONS.
Lecture 4 - Transverse Optics II
Numerical Analysis Lecture 2.
Warm Up #3 Find the exact value. 2. –√ √49 ANSWER –12 7 ANSWER
Dynamics of Training Noh, Yung-kyun Mar. 11, 2003
LINEAR EQUATIONS.
Warm-up: Solve the system: 2x – 0.4y = -2 3x – 1 2 y = -2
Skill Check Lesson Presentation Lesson Quiz.
Presentation transcript:

1 Statistical Mechanics of Online Learning for Ensemble Teachers Seiji Miyoshi Masato Okada Kobe City College of Tech. Univ. of Tokyo, RIKEN BSI

2 S U M M A R Y We analyze the generalization performance of a student in a model composed of linear perceptrons: a true teacher, K teachers, and the student. Calculating the generalization error of the student analytically using statistical mechanics in the framework of on-line learning, we prove that when the learning rate satisfies η 1, the properties are completely reversed. If the variety of the K teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η→0 and K→∞.

3 B A C K G R O U N D (1/2) Batch learning –given examples are used more than once –student becomes to give correct answers for all examples –long time and large memory On-line learning –examples once used are discarded –cannot give correct answers for all examples used in training –large memory is not necessary –it is possible to follow a time variant teacher

4 B A C K G R O U N D (2/2) P U R P O S E In most cases in an actual human society, a student can observe examples from two or more teachers who differ from each other. To analyze generalization performance of a model composed of a student, a true teacher and K teachers (ensemble teachers) who exist around the true teacher To discuss the relationship between the number, the variety of ensemble teachers and the generalization error

5 M O D E L (1/4) True teacher Student J learns B 1,B 2, ・・・ in turn. J can not learn A directly. A, B 1,B 2, ・・・,J are linear perceptrons with noises. Ensemble teachers

6 M O D E L (2/4) Output of true teacher Outputs of ensemble teachers Output of student Linear perceptronGaussian noise Linear perceptrons Linear perceptron Gaussian noises Gaussian noise

7 M O D E L (3/4) Inputs: Initial value of student: True teacher: Ensemble teachers: N→∞ (Thermodynamic limit) Order parameters –Length of student –Direction cosines

8 M O D E L (4/4) fkmfkm Gradient method Squared errors Student learns K ensemble teachers in turn.

9 GENERALIZATION ERROR A goal of statistical learning theory is to obtain generalization error theoretically. Generalization error = mean of errors over the distribution of new input Error Multiple Gaussian Distribution

10 Differential equations, which describe the dynamical behaviors of order parameters, have been obtained based on self-averaging in the thermodynamic limits as follows: J m+1 = J m + f k m x m + Nr J m+1 = Nr J m + f k m y m Ndt inputs A is multiplied to both side of Nr J m+2 = Nr J m+1 + f k m+1 y m+1 Nr J m+Ndt = Nr J m+Ndt-1 + f k m+Ndt-1 y m+Ndt-1 1. To simplify the analysis, the following auxiliary order parameters are introduced: 2. 3.

11 Simultaneous differential equations in deterministic forms, which describe dynamical behaviors of order parameters

12 Analytical solutions of order parameters

13 Dynamical behaviors of generalization error, R and l ( η=0.3, K=3, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ) Student becomes cleverer than a member of ensemble teachers. The larger the variety of the ensemble teachers is, the nearer the student and true teacher are. Student Ensemble teachers

14 Steady state analysis ( t → ∞ ) ・ If η <0 or η >2 ・ If 0< η <2 Generalization error and length of student diverge. If η <1, the more teachers exist or the richer the variety of teachers is, the cleverer the student can become. If η >1, the fewer teachers exist or the poorer the variety of teachers is, the cleverer the student can become.

15 Steady value of generalization error, R and l ( K=3, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ) Rich variety is good !Poor variety is good !

16 Steady value of generalization error, R and l ( q=0.49, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ) Many teachers are good !Few teachers are good !