The “Second Law” of Probability: Entropy Growth in the Central Limit Theorem. Keith Ball.

Slides:

Advertisements

Similar presentations

Random Processes Introduction (2)

Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

CHAPTER 14 THE CLASSICAL STATISTICAL TREATMENT OF AN IDEAL GAS.

Chapter 3 Classical Statistics of Maxwell-Boltzmann

Christopher Dougherty EC220 - Introduction to econometrics (review chapter) Slideshow: asymptotic properties of estimators: plims and consistency Original.

Chain Rules for Entropy

Corporate Banking and Investment Mathematical issues with volatility modelling Marek Musiela BNP Paribas 25th May 2005.

Entropy Rates of a Stochastic Process

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Lecture 9 Inexact Theories. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture 03Probability and.

Wavefunction Quantum mechanics acknowledges the wave-particle duality of matter by supposing that, rather than traveling along a definite path, a particle.

4.7 Brownian Bridge 報告者 : 劉彥君 Gaussian Process Definition 4.7.1: A Gaussian process X(t), t ≥ 0, is a stochastic process that has the property.

Maximum likelihood (ML) and likelihood ratio (LR) test

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

SUMS OF RANDOM VARIABLES Changfei Chen. Sums of Random Variables Let be a sequence of random variables, and let be their sum:

Entropy. Optimal Value Example The decimal number 563 costs 10  3 = 30 units. The binary number costs 2  10 = 20 units.  Same value as decimal.

VECTOR CALCULUS Fundamental Theorem for Line Integrals In this section, we will learn about: The Fundamental Theorem for line integrals and.

Maximum likelihood (ML) and likelihood ratio (LR) test

Lecture 6 The dielectric response functions. Superposition principle.

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Machine Learning CMPT 726 Simon Fraser University

The moment generating function of random variable X is given by Moment generating function.

4. The Postulates of Quantum Mechanics 4A. Revisiting Representations

Maximum likelihood (ML)

Inferential Statistics

Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;

Applications of Calculus. The logarithmic spiral of the Nautilus shell is a classical image used to depict the growth and change related to calculus.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.

01/24/05© 2005 University of Wisconsin Last Time Raytracing and PBRT Structure Radiometric quantities.

Ch. 10 Vector Integral Calculus.

Section 2: Finite Element Analysis Theory

Physics 361 Principles of Modern Physics Lecture 8.

Chang-Kui Duan, Institute of Modern Physics, CUPT 1 Harmonic oscillator and coherent states Reading materials: 1.Chapter 7 of Shankar’s PQM.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Permutations & Combinations and Distributions

The Laws of Thermodynamics

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Vector Calculus 13. The Fundamental Theorem for Line Integrals 13.3.

The Second Law of Thermodynamics

Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.

Wednesday, Oct. 17, 2012PHYS , Fall 2012 Dr. Jaehoon Yu 1 PHYS 3313 – Section 001 Lecture #13 Wednesday, Oct. 17, 2012 Dr. Jaehoon Yu Properties.

Security Markets VII Miloslav S. Vosvrda Teorie financnich trhu.

1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.

Differential Equations Brannan Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved. Chapter 09: Partial Differential Equations and Fourier.

Ch 10.6: Other Heat Conduction Problems

1 EE571 PART 3 Random Processes Huseyin Bilgekul Eeng571 Probability and astochastic Processes Department of Electrical and Electronic Engineering Eastern.

Probability and Moment Approximations using Limit Theorems.

An Introduction to Statistical Thermodynamics. ( ) Gas molecules typically collide with a wall or other molecules about once every ns. Each molecule has.

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

One Function of Two Random Variables

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Geology 6600/7600 Signal Analysis 04 Sep 2014 © A.R. Lowry 2015 Last time: Signal Analysis is a set of tools used to extract information from sequences.

Runge Kutta schemes Taylor series method Numeric solutions of ordinary differential equations.

Electrostatic field in dielectric media When a material has no free charge carriers or very few charge carriers, it is known as dielectric. For example.

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

Introduction to Probability - III John Rundle Econophysics PHYS 250

Theory of Capital Markets

CONTINUOUS RANDOM VARIABLES

Introduction to Instrumentation Engineering

Vanishing Potential from Two Point Charges

STOCHASTIC HYDROLOGY Random Processes

8. One Function of Two Random Variables

CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.

8. One Function of Two Random Variables

Presentation transcript:

The “Second Law” of Probability: Entropy Growth in the Central Limit Theorem. Keith Ball

The second law of thermodynamics Joule and Carnot studied ways to improve the efficiency of steam engines. Is it possible for a thermodynamic system to move from state A to state B without any net energy being put into the system from outside? A single experimental quantity, dubbed entropy, made it possible to decide the direction of thermodynamic changes.

The second law of thermodynamics The entropy of a closed system increases with time. The second law applies to all changes: not just thermodynamic. Entropy measures the extent to which energy is dispersed: so the second law states that energy tends to disperse.

The second law of thermodynamics Maxwell and Boltzmann developed a statistical model of thermodynamics in which the entropy appeared as a measure of “uncertainty”. Uncertainty should be interpreted as “uniformity” or “lack of differentiation”.

The second law of thermodynamics

Closed systems become progressively more featureless. We expect that a closed system will approach an equilibrium with maximum entropy.

Information theory Shannon showed that a noisy channel can communicate information with almost perfect accuracy, up to a fixed rate: the capacity of the channel. The (Shannon) entropy of a probability distribution: if the possible states have probabilities then the entropy is Entropy measures the number of (YES/NO) questions that you expect to have to ask in order to find out which state has occurred.

Information theory You can distinguish 2 k states with k (YES/NO) questions. If the states are equally likely, then this is the best you can do. It costs k questions to identify a state from among 2 k equally likely states.

Information theory It costs k questions to identify a state from among 2 k equally likely states. It costs log 2 n questions to identify a state from among n equally likely states: to identify a state with probability 1/n. Probability Questions 1/n log 2 n p log 2 (1/p)

State Probability Questions Uncertainty S 1 p 1 log 2 (1/p 1 ) p 1 log 2 (1/p 1 ) S 2 p 2 log 2 (1/p 2 ) p 2 log 2 (1/p 2 ) S 3 p 3 log 2 (1/p 3 ) p 3 log 2 (1/p 3 ) Entropy = p 1 log 2 (1/p 1 ) + p 2 log 2 (1/p 2 ) + p 3 log 2 (1/p 3 ) + … The entropy

Continuous random variables For a random variable X with density f the entropy is The entropy behaves nicely under several natural processes: for example, the evolution governed by the heat equation.

If the density f measures the distribution of heat in an infinite metal bar, then f evolves according to the heat equation: The entropy increases: Fisher information

The central limit theorem If X i are independent copies of a random variable with mean 0 and finite variance, then the normalized sums converge to a Gaussian (normal) with the same variance. Most proofs give little intuition as to why.

The central limit theorem Among random variables with a given variance, the Gaussian has largest entropy. Theorem (Shannon-Stam) If X and Y are independent and identically distributed, then the normalized sum has entropy at least that of X and Y.

Idea The central limit theorem is analogous to the second law of thermodynamics: the normalized sums have increasing entropy which drives them to an “equilibrium” which has maximum entropy.

Linnik (1959) gave an information-theoretic proof of the central limit theorem using entropy. He showed that for appropriately smoothed random variables, if S n remains far from Gaussian, then S n+1 has larger entropy than S n. This gives convergence in entropy for the smoothed random variables but does not show that entropy increases with n. The central limit theorem

Problem: (folklore or Lieb (1978)). Is it true that Ent(S n ) increases with n? Shannon-Stam shows that it increases as n goes from 1 to 2 (hence 2 to 4 and so on). Carlen and Soffer found uniform estimates for entropy jump from 1 to 2. It wasn’t known that entropy increases from 2 to 3. The difficulty is that you can't express the sum of 3 independent random variables in terms of the sum of 2: you can't add 3/2 independent copies of X.

The Fourier transform? The simplest proof (conceptually) of the central limit theorem uses the FT. If X has density f whose FT is then the FT of the density of is. The problem is that the entropy cannot easily be expressed in terms of the FT. So we must stay in real space instead of Fourier space.

Example: Suppose X is uniformly distributed on the interval between 0 and 1. Its density is: When we add two copies the density is:

For 9 copies the density is a spline defined by 9 different polynomials on different parts of the range. The central polynomial (for example) is: and its logarithm is?

A new variational approach to entropy gives quantitative measures of entropy growth and proves the “second law”. Theorem (Artstein, Ball, Barthe, Naor) If X i are independent copies of a random variable with finite variance, then the normalized sums have increasing entropy. The second law of probability

Starting point: used by many authors. Instead of considering entropy directly, we study the Fisher information: Among random variables with variance 1, the Gaussian has the smallest Fisher information, namely 1. The Fisher information should decrease as a process evolves.

The connection (we want) between entropy and Fisher information is provided by the Ornstein-Uhlenbeck process (de Bruijn, Bakry and Emery, Barron). Recall that if the density of X (t) evolves according to the heat equation then The heat equation can be solved by running a Brownian motion from the initial distribution. The Ornstein-Uhlenbeck process is like Brownian motion but run in a potential which keeps the variance constant.

The Ornstein-Uhlenbeck process A discrete analogue: You have n sites, each of which can be ON or OFF. At each time, pick a site (uniformly) at random and switch it. X (t) = (number on)-(number off).

The Ornstein-Uhlenbeck process A typical path of the process.

The Ornstein-Uhlenbeck evolution The density evolves according to the modified diffusion equation: From this: As the evolutes approach the Gaussian of the same variance.

The entropy gap can be found by integrating the information gap along the evolution. In order to prove entropy increase, it suffices to prove that the information decreases with n. It was known (Blachman-Stam) that.

Main new tool: a variational description of the information of a marginal density. If w is a density on and e is a unit vector, then the marginal in direction e has density

Main new tool: The density h is a marginal of w and The integrand is non-negative if h has concave logarithm. Densities with concave logarithm have been widely studied in high-dimensional geometry, because they naturally generalize convex solids.

The Brunn-Minkowski inequality Let A(x) be the cross-sectional area of a convex body at position x. Then log A is concave. The function A is a marginal of the body. x

The Brunn-Minkowski inequality We can replace the body by a function with concave logarithm. If w has concave logarithm, then so does each of its marginals. x If the density h is a marginal of w, the inequality tells us something about in terms of

The Brunn-Minkowski inequality If the density h is a marginal of w, the inequality tells us something about in terms of We rewrite a proof of the Brunn-Minkowski inequality so as to provide an explicit relationship between the two. The expression involving the Hessian is a quadratic form whose minimum is the information of h. This gives rise to the variational principle.

The variational principle Theorem If w is a density and e a unit vector then the information of the marginal in the direction e is where the minimum is taken over vector fields p satisfying

Technically we have gained because h(t) is an integral: not good in the denominator. The real point is that we get to choose p. Instead of choosing the optimal p which yields the intractable formula for information, we choose a non-optimal p with which we can work.

Proof of the variational principle. so If p satisfies at each point, then we can realise the derivative as since the part of the divergence perpendicular to e integrates to 0 by the Gauss-Green (divergence) theorem.

Hence There is equality if This divergence equation has many solutions: for example we might try the electrostatic field solution. But this does not decay fast enough at infinity to make the divergence theorem valid.

The right solution for p is a flow in the direction of e which transports between the probability measures induced by w on hyperplanes perpendicular to e. For example, if w is 1 on a triangle and 0 elsewhere, the flow is as shown. (The flow is irrelevant where w = 0.) e

If w(x 1,x 2,…x n ) = f(x 1 )f(x 2 )…f(x n ), then the density of is the marginal of w in the direction (1,1,…1). The density of the (n-1)-fold sum is the marginal in direction (0,1,…,1) or (1,0,1,…,1) or…. Thus, we can extract both sums as marginals of the same density. This deals with the “difficulty”. How do we use it?

Proof of the second law. Choose an (n-1)-dimensional vector field p with and. In n-dimensional space, put a “copy” of p on each set of n-1 coordinates. Add them up to get a vector field with which to estimate the information of the n-fold sum.

As we cycle the “gap” we also cycle the coordinates of p and the coordinates upon which it depends: Add these functions and normalise:

We use P to estimate J(n). The problem reduces to showing that if then The trivial estimate (using ) gives n instead of n-1. We can improve it because the have special properties. There is a small amount of orthogonality between them because is independent of the i coordinate.

Theorem If X i are independent copies of a random variable with variance, then the normalized sums have increasing entropy. The second law of probability