1 Weak Convergence of Random Free Energy in Information Theory Sumio Watanabe Tokyo Institute of Technology.

Slides:



Advertisements
Similar presentations
Shinichi Nakajima Sumio Watanabe  Tokyo Institute of Technology
Advertisements

Elementary Number Theory and Methods of Proof
Entropy in the Quantum World Panagiotis Aleiferis EECS 598, Fall 2001.
Chain Rules for Entropy
1. Introduction Consistency of learning processes To explain when a learning machine that minimizes empirical risk can achieve a small value of actual.
Instructor : Dr. Saeed Shiry
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
The Mean Square Error (MSE):. Now, Examples: 1) 2)
1 Chap 5 Sums of Random Variables and Long-Term Averages Many problems involve the counting of number of occurrences of events, computation of arithmetic.
SUMS OF RANDOM VARIABLES Changfei Chen. Sums of Random Variables Let be a sequence of random variables, and let be their sum:
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
4. Convergence of random variables  Convergence in probability  Convergence in distribution  Convergence in quadratic mean  Properties  The law of.
Ch 5.3: Series Solutions Near an Ordinary Point, Part II
Visual Recognition Tutorial
Probability theory 2011 Convergence concepts in probability theory  Definitions and relations between convergence concepts  Sufficient conditions for.
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
A) Transformation method (for continuous distributions) U(0,1) : uniform distribution f(x) : arbitrary distribution f(x) dx = U(0,1)(u) du When inverse.
Maximum likelihood (ML)
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Mathematics for Business (Finance)
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Ch 5.3: Series Solutions Near an Ordinary Point, Part II A function p is analytic at x 0 if it has a Taylor series expansion that converges to p in some.
T. Mhamdi, O. Hasan and S. Tahar, HVG, Concordia University Montreal, Canada July 2010 On the Formalization of the Lebesgue Integration Theory in HOL On.
Limits and the Law of Large Numbers Lecture XIII.
Functions of Random Variables. Methods for determining the distribution of functions of Random Variables 1.Distribution function method 2.Moment generating.
1 Reproducing Kernel Exponential Manifold: Estimation and Geometry Kenji Fukumizu Institute of Statistical Mathematics, ROIS Graduate University of Advanced.
IRDM WS Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions,
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
A C B Small Model Middle Model Large Model Figure 1 Parameter Space The set of parameters of a small model is an analytic set with singularities. Rank.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
7.5.1 Zeros of Polynomial Functions
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
Expectation for multivariate distributions. Definition Let X 1, X 2, …, X n denote n jointly distributed random variable with joint density function f(x.
Chapter 7 Applications of Residues - evaluation of definite and improper integrals occurring in real analysis and applied math - finding inverse Laplace.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Riemann Zeta Function and Prime Number Theorem Korea Science Academy Park, Min Jae.
3.The Canonical Ensemble 1.Equilibrium between a System & a Heat Reservoir 2.A System in the Canonical Ensemble 3.Physical Significance of Various Statistical.
ACTIVITY 31: Dividing Polynomials (Section 4.2, pp )
4.3: Real Zeroes of Polynomials Functions February 13, 2008.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
Section 10.5 Let X be any random variable with (finite) mean  and (finite) variance  2. We shall assume X is a continuous type random variable with p.d.f.
2. Time Independent Schrodinger Equation
Asymptotic Behavior of Stochastic Complexity of Complete Bipartite Graph-Type Boltzmann Machines Yu Nishiyama and Sumio Watanabe Tokyo Institute of Technology,
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Toric Modification on Machine Learning Keisuke Yamazaki & Sumio Watanabe Tokyo Institute of Technology.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Sums of Random Variables and Long-Term Averages Sums of R.V. ‘s S n = X 1 + X X n of course.
Ch 6. Markov Random Fields 6.1 ~ 6.3 Adaptive Cooperative Systems, Martin Beckerman, Summarized by H.-W. Lim Biointelligence Laboratory, Seoul National.
Polynomial Long Division
Boyce/DiPrima 9 th ed, Ch 5.3: Series Solutions Near an Ordinary Point, Part II Elementary Differential Equations and Boundary Value Problems, 9 th edition,
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Singularities ECE 6382 Notes are from D. R. Wilton, Dept. of ECE David R. Jackson 1.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Warm Up Compute the following by using long division.
3.7 The Real Zeros of a Polynomial Function
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Complex Variables. Complex Variables Open Disks or Neighborhoods Definition. The set of all points z which satisfy the inequality |z – z0|
3.7 The Real Zeros of a Polynomial Function
Basis and Dimension Basis Dimension Vector Spaces and Linear Systems
Model systems with interaction
Second order stationary p.p.
Notes are from D. R. Wilton, Dept. of ECE
Chap 6 Residues and Poles
Presentation transcript:

1 Weak Convergence of Random Free Energy in Information Theory Sumio Watanabe Tokyo Institute of Technology

2 Contents 1. Background 2. Main Theorem 3. Outline of Proof 4. Applications and Future Study Identification Problem ≡ Math. Phys. with Random Hamiltonian

3 Example : Classical Spin System Visible x sisi sjsj w ij Hidden sisi sjsj w ij Hidden Visible samples Learn Unknown Learner Background (1) p(x|w) = ∑ exp( - ∑ w ij s i s j ) 1 Z(w) (i,j) Hidden

4 Identification Problem q(x) Classical Unknown Information Source X 1, X 2,…, X n p(x|w) φ(w) Learning System p( x | X 1, X 2,…, X n ) D(q||p) ≡ ∫ dx q [log q –log p] = ? Observation (Relative Entropy) Background (2) Estimated Distribution

5 Random Free Energy and Relative Entropy F(X 1, X 2,…, X n ) ≡ - log ∫ p(X 1 |w) p(X 2 |w) ・・・ p(X n |w) φ(dw) D( q(X n+1 ) || p(X n+1 | X 1, X 2,…, X n ) ) = F(X 1, X 2,…, X n+1 ) - F(X 1, X 2,…, X n ) Definition. Random Free Energy = log-Likelihood of System Relation between F and D(q||p) Background (3) + Σ log q(Xi) n i=1

6 Identifiability and Singularities A learning system p(x|w) is called identifiable p(x|w 1 ) = p(x|w 2 ) ( ∀ x) ⇒ w 1 =w 2 A system which identifies the structure is non-identifiable. { w ; p(x|w)=p(x|w 0 )} is an analytic set with singularities. W={w}, w 1 ~ w 2 ⇔ “p(x|w 1 ) = p(x|w 2 ) ( ∀ x)” W / is not a manifold because Background (4) ~ Remark.

7 Mathematical Definitions X : a random variable on R N with p.d.f. q(x). W : a real d-dimensional manifold. L 2 (q) = {f ; ∫ f(x) 2 q(x) dx < ∞ } : real Hilbert space. φ(w) : a p.d.f. on W, C 0 ∞ -class function. φ(w) dw : prob. Dist. on W Main Theorem (1)

8 Mathematical Definitions F = - log ∫ exp( - Σ H(X i, w) ) φ(w) dw n i= 1 Given X 1, X 2, …,X n : i.i.d., Random Free Energy W 0 ≡ {w ∈ supp φ; K(w)=0} ≠ O H( ・,w) : an L 2 (q)-valued real analytic function on W. e.g. H(x,w)=log q(x) – log p(x|w) Main Theorem (2) E X [e -H(X,w) ]=1 ( ∀ w). [ ⇒ K(w) ≡ E X [H(X,w)] ≧ 0 ] s.t.

9 Gel ’ fand ’ s Zeta function ζ(z) = ∫ K(w) z φ(w) dw Difficulty : {w; K(w)=0} is an analytic set with singularities. The zeta function (1) ζ(z) can be analytically continued to a meromorphic function on the entire complex plane. (2) All poles are real, negative, and rational numbers. Theorem (Atiyah,Sato,Bernstein,Bjork,Kashiwara, ) Poles: 0>-λ 1 > -λ 2 > -λ 3 > ・・・, Orders: m 1,m 2,m 3,… Main Theorem (3) : holomorphic in Re(z)>0.

10 Main Theorem F – λ 1 log n + (m 1 -1)loglog n → F* The convergence in law holds. where F* can be represented by a limit process of an empirical process on W 0. (n → ∞) Main Theorem (4) E[ D(q||p) ] = + o( ) λ1 nλ1 n CorollaryIf E[ D(q||p)] has an asymptotic expansion 1n1n

11 Hironaka Resolution Theorem W K(w) 0 W0W0 g U locally U0U0 K(g(u))=a(u) u 1 2s 1 u 2 2s 2 ・・・ u d 2s d Proof Outline (1)

12 Resolution Theorem Let K(w) ≧ 0 be a real analytic function defined in a neighborhood of 0 ∈ W ⊂ R d. Then there exist an open set W, a real analytic manifold U, and a proper analytic map g: U → W such that H.Hironaka(1964) M.F.Atiyah(1970) (1)g:U-U 0 → W-W 0 is an isomorphism. (2) For each P ∈ U, there are local coordinates (u 1,u 2,…,u d ) centered at P so that locally near P K(g(u)) = a(u) u 1 2s 1 u 2 2s 2 ・・・ u d 2s d where a(u)>0 is an analytic function and s i ≧ 0 is integer. Proof Outline (2)

13 Division of Partition Function Because suppφ is compact and g is a proper map, We can assume W = ∪ U  (finite sum 、 joint set measure zero) K(g  (u  )) = a(u) u 1 2s 1 u 2 2s 2 ・・・ u d 2s d φ  (u  ) = Σ b  (u) u 1 k 1 u 2 k 2 ・・・ u d k d in each U , Proof Outline (3) Hereafter,  is omitted and K(u) ≡ K(g(u)) is used. exp(-F) = Σ ∫ exp[ -ΣH(X i, g  (u  )) ] φ  (u  ) du   UU ( Both s i and k i depend on  ) n i= 1

14 B-function Proof Outline (4) ζ(z) = ∫ K(w) z φ(w) dw The zeta function ∃ P(w,∂w,z) ∃ b(z) s.t. P(w,∂w,z) K(w) z+1 =b(z)K(w) z Analytic continuation is carried out using b-function. If K(w) is a polynomial, then there exists an algorithm to calculate b(z). (Oaku, 1997).

15 Ideals of Local Analytic functions (2) H(x,u) =∑ g j (u) h j (x,u) Lemma 1. Let u → H( ・,u) be a real analytic function in U. J j=1 There exist an open set U  ⊂ U and a finite set of analytic functions { g j (u), h j ( ・,u) ; j=1,2, …,J } in U  s.t. (1) T(u) ≧ I ( ∀ u ∈ U  T ij (u) ≡ ∫ h j (x,u) h k (x,u) q(x) dx Proof Outline (5)

16 Decomposition of Hamiltonian Σ H(X i,u) = nK(u) + (nK(u)) 1/2 σ n (u) n i=1 σ n (u) ≡ ∑ r(X i,u) 1n1n r(x,u) ≡ n i=1 H(x,u) - K(u) K(u) 1/2 Since Lemma 1 and K(u) = ∫{K(x,u)+e -K(x,u) -1} q(x) dx, r(x,u) is well defined even if K(u)=0. Proof Outline (6) Random Hamiltonian

17 Donsker ’ s Empirical Process σ n ( ・ ) → σ ( ・ ) Empirical processTight Gaussian process σ n (u) ≡ ∑ r(Xi,u) 1n1n n i=1 E [ f(σ n )] → E σ [ f(σ)] x 1,x 2,…,x n ( ∀ f : a bounded continuous functional on L ∞ (supp φ)) Proof Outline (7) Central limit theorem in Banach Space

18 Poles of Zeta function K(u) = a(u) u 1 2s 1 u 2 2s 2 ・・・ u d 2s d Φ(u) = Σ b(u) u 1 k 1 u 2 k 2 ・・・ u d k d λ = min K j +1 2s j m = ♯ { j ; λ = } K j +1 2s j ζ(z) = Σ ∫ K(u) z φ(u) du Proof Outline (8)

19 Zeta function and State Density ∬ L( u ) z φ( u,v ) d u d v ∬ δ(t-L( u )) φ( u,v ) d u d v u=(u,v) u =(u j ) ; j ∈ J : attains min. L(u) ≡ Π u j 2s j j ∈ J : Pole – λ order m = t λ-1 (-log t) m-1 ∫ φ( 0,v ) d v Inverse Mellin Transf. Proof Outline (9) ( t → 0 ) Partial Zeta State Density

20 Partition function and Empirical Process E [ { Z } iε ]→const. (log n) m-1 n λ Characteristic function of F : Sufficiently small ε >0 Z = ∬ exp(-nK (u,v) + (nK (u,v) ) 1/2 σ n (u,v) ) φ( u,v ) d u d v → ∬ ( ) λ-1 (-log( )) m-1 φ( 0,v ) d v ×exp[ -tK (0,v) + (tK (0,v) ) 1/2 σ(0,v) ] t n t n t n t n dt n Proof Outline (10) Partition function ← State Density ← Zeta function Q.E.D. ( n → ∞ )

21 Information Science & Mathematical Physics Applications and Future Study (1) Identification of Unknown Information Source = Statistical Physics with Random Hamiltonian Identification of Hidden Structure = Hamiltonian has Singularities ⇒ Singularities make State Density to be singular.

22 Model Identification Applications and Future Study (2) p(x|w), φ(w) F F = F(p,φ,X 1,X 2,…,X n ) True From Samples, then true distribution is identified.

23 Poles and orders of Zeta function 1. If φ(w)>0 at W 0, then 0<λ ≦ d/ ≦ m ≦ d. 3. If φ(w) is Jeffreys’ prior, λ ≧ d/2. 4. If ζ(z) has a pole –λ’, then λ ≦ λ. Applications and Future Study (3)

24 Concrete Learning Systems 1. Neural Networks, True H 0, Model H. 2. Gaussian Mixtures, True H 0, Model H. p(y|x,w)= exp(- ) p(x|w) = Σ a h exp( - ) 1 (2π) 1/2 || y – Σa h f(b h ・ x+c h )|| 2 2 || x - b h || 2 2 2λ ≦ H 0 (M+N+1) + (H-H 0 ) Min(M+1,N) 2λ ≦ H 0 + (M-1)H/2 +(M-3)/2 Applications and Future Study (4)

25 Future Study Applications and Future Study (5) 2. Large System : Thermo-dynamical limit. 3. Replica Method : f(z) = E[ exp( zF) ]. 4. Generalization to Non-commutative System. 1. Testing hypothesis ⇒ q(x)=p(x|w 0 ) ; w 0 near singularity