1 Statistical Mechanics of Online Learning for Ensemble Teachers Seiji Miyoshi Masato Okada Kobe City College of Tech. Univ. of Tokyo, RIKEN BSI.

Slides:

Advertisements

Similar presentations

Identify the number of solutions of an equation

Advertisements

CHE 185 – PROCESS CONTROL AND DYNAMICS

Hazırlayan NEURAL NETWORKS Least Squares Estimation PROF. DR. YUSUF OYSAL.

A Statistical Mechanical Analysis of Online Learning: Can Student be more Clever than Teacher ? Seiji MIYOSHI Kobe City College of Technology

1 A Statistical Mechanical Analysis of Online Learning: Seiji MIYOSHI Kobe City College of Technology

( ) EXAMPLE 3 Solve ax2 + bx + c = 0 when a = 1

Radial Basis Functions

September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Basic Models in Theoretical Neuroscience Oren Shriki 2010 Differential Equations.

SOLVING SYSTEMS USING SUBSTITUTION

1 Analysis of Ensemble Learning using Simple Perceptrons based on Online Learning Theory Seiji MIYOSHI 1 Kazuyuki HARA 2 Masato OKADA 3,4,5 1 Kobe City.

1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.

Standardized Test Practice

3-4 Solving Systems of Linear Equations in 3 Variables

EXAMPLE 1 Collecting Like Terms x + 2 = 3x x + 2 –x = 3x – x 2 = 2x 1 = x Original equation Subtract x from each side. Divide both sides by x2x.

Section 8.3 – Systems of Linear Equations - Determinants Using Determinants to Solve Systems of Equations A determinant is a value that is obtained from.

Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.

© Copyright McGraw-Hill CHAPTER 6 The Normal Distribution.

Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.

Associative Memory by Recurrent Neural Networks with Delay Elements Seiji MIYOSHI Hiro-Fumi YANAI Masato OKADA Kobe City College of Tech. Ibaraki Univ.

1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.

EXAMPLE 2 Rationalize denominators of fractions Simplify

Curve-Fitting Regression

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Copyright © Cengage Learning. All rights reserved. Polynomials 4.

Martin-Gay, Beginning Algebra, 5ed Using Both Properties Divide both sides by 3. Example: 3z – 1 = 26 3z = 27 Simplify both sides. z = 9 Simplify.

The Irrational Numbers and the Real Number System

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Mean, Variance, Moments and.

Differential Equations

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Multi-Step Equations We must simplify each expression on the equal sign to look like a one, two, three step equation.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Engineers often: Regress data to a model  Used for assessing theory  Used for predicting  Empirical or theoretical model Use the regression of others.

Use the substitution method

Joint Moments and Joint Characteristic Functions.

Formulas & Functions Formula – an algebraic expression that relates two or more real-life quantities.

Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.

Lecture 4 - E. Wilson - 23 Oct 2014 –- Slide 1 Lecture 4 - Transverse Optics II ACCELERATOR PHYSICS MT 2014 E. J. N. Wilson.

Graphing Linear Inequalities 6.1 & & 6.2 Students will be able to graph linear inequalities with one variable. Check whether the given number.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

State-Space Recursive Least Squares with Adaptive Memory College of Electrical & Mechanical Engineering National University of Sciences & Technology (NUST)

Computacion Inteligente Least-Square Methods for System Identification.

Lecture 4 - E. Wilson –- Slide 1 Lecture 4 - Transverse Optics II ACCELERATOR PHYSICS MT 2009 E. J. N. Wilson.

CORRELATION-REGULATION ANALYSIS Томский политехнический университет.

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Linear Regression.

Lecture 4 - Transverse Optics II

Example: Solve the equation. Multiply both sides by 5. Simplify both sides. Add –3y to both sides. Simplify both sides. Add –30 to both sides. Simplify.

EXAMPLE 2 Rationalize denominators of fractions Simplify

PSG College of Technology

TYPES OF SOLUTIONS OF LINEAR EQUATIONS

Chapter 9 Right Triangles and Trigonometry

9-2 Pythagorean Theorem.

PROVING THE PYTHAGOREAN THEOREM

Linear Models and Equations

Solve an equation by combining like terms

1.3 Solving Linear Equations

Lecture 4 - Transverse Optics II

Evaluating expressions and Properties of operations

Linear simultaneous equations: Correct solution? (step 7)

LINEAR EQUATIONS.

Lecture 4 - Transverse Optics II

Numerical Analysis Lecture 2.

Warm Up #3 Find the exact value. 2. –√ √49 ANSWER –12 7 ANSWER

Dynamics of Training Noh, Yung-kyun Mar. 11, 2003

LINEAR EQUATIONS.

Warm-up: Solve the system: 2x – 0.4y = -2 3x – 1 2 y = -2

Skill Check Lesson Presentation Lesson Quiz.

Presentation transcript:

1 Statistical Mechanics of Online Learning for Ensemble Teachers Seiji Miyoshi Masato Okada Kobe City College of Tech. Univ. of Tokyo, RIKEN BSI

2 S U M M A R Y We analyze the generalization performance of a student in a model composed of linear perceptrons: a true teacher, K teachers, and the student. Calculating the generalization error of the student analytically using statistical mechanics in the framework of on-line learning, we prove that when the learning rate satisfies η 1, the properties are completely reversed. If the variety of the K teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η→0 and K→∞.

3 B A C K G R O U N D (1/2) Batch learning –given examples are used more than once –student becomes to give correct answers for all examples –long time and large memory On-line learning –examples once used are discarded –cannot give correct answers for all examples used in training –large memory is not necessary –it is possible to follow a time variant teacher

4 B A C K G R O U N D (2/2) P U R P O S E In most cases in an actual human society, a student can observe examples from two or more teachers who differ from each other. To analyze generalization performance of a model composed of a student, a true teacher and K teachers (ensemble teachers) who exist around the true teacher To discuss the relationship between the number, the variety of ensemble teachers and the generalization error

5 M O D E L (1/4) True teacher Student J learns B 1,B 2, ・・・ in turn. J can not learn A directly. A, B 1,B 2, ・・・,J are linear perceptrons with noises. Ensemble teachers

6 M O D E L (2/4) Output of true teacher Outputs of ensemble teachers Output of student Linear perceptronGaussian noise Linear perceptrons Linear perceptron Gaussian noises Gaussian noise

7 M O D E L (3/4) Inputs: Initial value of student: True teacher: Ensemble teachers: N→∞ (Thermodynamic limit) Order parameters –Length of student –Direction cosines

8 M O D E L (4/4) fkmfkm Gradient method Squared errors Student learns K ensemble teachers in turn.

9 GENERALIZATION ERROR A goal of statistical learning theory is to obtain generalization error theoretically. Generalization error = mean of errors over the distribution of new input Error Multiple Gaussian Distribution

10 Differential equations, which describe the dynamical behaviors of order parameters, have been obtained based on self-averaging in the thermodynamic limits as follows: J m+1 = J m + f k m x m + Nr J m+1 = Nr J m + f k m y m Ndt inputs A is multiplied to both side of Nr J m+2 = Nr J m+1 + f k m+1 y m+1 Nr J m+Ndt = Nr J m+Ndt-1 + f k m+Ndt-1 y m+Ndt-1 １． To simplify the analysis, the following auxiliary order parameters are introduced: ２．３．

11 Simultaneous differential equations in deterministic forms, which describe dynamical behaviors of order parameters

12 Analytical solutions of order parameters

13 Dynamical behaviors of generalization error, R and l （ η=0.3, K=3, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ） Student becomes cleverer than a member of ensemble teachers. The larger the variety of the ensemble teachers is, the nearer the student and true teacher are. Student Ensemble teachers

14 Steady state analysis （ t → ∞ ）・ If η ＜０ or η ＞２・ If ０＜ η ＜２ Generalization error and length of student diverge. If η ＜１, the more teachers exist or the richer the variety of teachers is, the cleverer the student can become. If η ＞１, the fewer teachers exist or the poorer the variety of teachers is, the cleverer the student can become.

15 Steady value of generalization error, R and l （ K=3, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ） Rich variety is good !Poor variety is good !

16 Steady value of generalization error, R and l （ q=0.49, R B =0.7, σ A 2 =0.0, σ B 2 =0.1, σ J 2 =0.2 ） Many teachers are good !Few teachers are good !