Vesslin I. Dimitrov on Shannon-Jaynes Entropy and Fisher Information July 13, 2007 (154)

Slides:



Advertisements
Similar presentations
Rae §2.1, B&J §3.1, B&M § An equation for the matter waves: the time-dependent Schrődinger equation*** Classical wave equation (in one dimension):
Advertisements

The microcanonical ensemble Finding the probability distribution We consider an isolated system in the sense that the energy is a constant of motion. We.
Quantum One: Lecture 6. The Initial Value Problem for Free Particles, and the Emergence of Fourier Transforms.
“velocity” is group velocity, not phase velocity
, CZE ISMD2005 Zhiming LI 11/08/ and collisions at GeV Entropy analysis in and collisions at GeV Zhiming LI (For the NA22 Collaboration)
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 3 Classical Statistics of Maxwell-Boltzmann
Quantum One: Lecture 4. Schrödinger's Wave Mechanics for a Free Quantum Particle.
Quantum One: Lecture 3. Implications of Schrödinger's Wave Mechanics for Conservative Systems.
Voting, Spatial Monopoly, and Spatial Price Regulation Economic Inquiry, Jan, 1992, MH Ye and M. J. Yezer Presentation Date: 06/Jan/14.
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Wavefunction Quantum mechanics acknowledges the wave-particle duality of matter by supposing that, rather than traveling along a definite path, a particle.
Kevin H Knuth Game Theory 2009 Automating the Processes of Inference and Inquiry Kevin H. Knuth University at Albany.
Proof by Deduction. Deductions and Formal Proofs A deduction is a sequence of logic statements, each of which is known or assumed to be true A formal.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Definition and Properties of the Production Function Lecture II.
Weakly nonlocal fluid mechanics Peter Ván Budapest University of Technology and Economics, Department of Chemical Physics –One component fluid mechanics.
Lecture 10 Energy production. Summary We have now established three important equations: Hydrostatic equilibrium: Mass conservation: Equation of state:
Quantum One: Lecture 7. The General Formalism of Quantum Mechanics.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Unique additive information measures – Boltzmann-Gibbs-Shannon, Fisher and beyond Peter Ván BME, Department of Chemical Physics Thermodynamic Research.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
L.I. Petrova “Specific features of differential equations of mathematical physics.” Investigation of the equations of mathematical physics with the help.
Quantum Theory But if an electron acts as a wave when it is moving, WHAT IS WAVING? When light acts as a wave when it is moving, we have identified the.
Logic CL4 Episode 16 0 The language of CL4 The rules of CL4 CL4 as a conservative extension of classical logic The soundness and completeness of CL4 The.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Equations, Inequalities, and Mathematical Models 1.2 Linear Equations
Bayesian Methods I: Parameter Estimation “A statistician is a person who draws a mathematically precise line from an unwarranted assumption to a foregone.
Uniform discretizations: the continuum limit of consistent discretizations Jorge Pullin Horace Hearne Institute for Theoretical Physics Louisiana State.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Wednesday, Mar. 5, 2003PHYS 5326, Spring 2003 Jae Yu 1 PHYS 5326 – Lecture #13 Wednesday, Mar. 5, 2003 Dr. Jae Yu Local Gauge Invariance and Introduction.
MODULE 1 In classical mechanics we define a STATE as “The specification of the position and velocity of all the particles present, at some time, and the.
The Quantum Theory of Atoms and Molecules The Schrödinger equation and how to use wavefunctions Dr Grant Ritchie.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Research Interests 2007: John L. Fry. Research Supported BY: Air Force Office of Scientific Research Air Force Office of Scientific Research R. A. Welch.
Quantum New way of looking at our world. Classical vs Quantum Typically a student develops an intuition about how the world works using classical mechanics.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Physics Lecture 11 3/2/ Andrew Brandt Monday March 2, 2009 Dr. Andrew Brandt 1.Quantum Mechanics 2.Schrodinger’s Equation 3.Wave Function.
Monday, Apr. 4, 2005PHYS 3446, Spring 2005 Jae Yu 1 PHYS 3446 – Lecture #16 Monday, Apr. 4, 2005 Dr. Jae Yu Symmetries Why do we care about the symmetry?
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Ariel Caticha on Information and Entropy July 8, 2007 (16)
Updating Probabilities Ariel Caticha and Adom Giffin Department of Physics University at Albany - SUNY MaxEnt 2006.
An equation for matter waves Seem to need an equation that involves the first derivative in time, but the second derivative in space As before try solution.
Yun, Hyuk Jin. Theory A.Nonuniformity Model where at location x, v is the measured signal, u is the true signal emitted by the tissue, is an unknown.
The Quantum Theory of Atoms and Molecules
LAWS OF THERMODYNAMICS
Fundamentals of Quantum Electrodynamics
LECTURE 05: THRESHOLD DECODING
Quantum One.
Quantum One.
Quantum Two.
RELATIVISTIC EFFECTS.
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Continuous Systems and Fields
Chapter 1: Statistical Basis of Thermodynamics
Parametric Methods Berlin Chen, 2005 References:
LECTURE 05: THRESHOLD DECODING
Objectives 6.1 Estimating with confidence Statistical confidence
Objectives 6.1 Estimating with confidence Statistical confidence
Quantum One.
Presentation transcript:

Vesslin I. Dimitrov on Shannon-Jaynes Entropy and Fisher Information July 13, 2007 (154)

ON SHANNON-JAYNES ENTROPY AND FISHER INFORMATION Vesselin I. Dimitrov Idaho Accelerator Center Idaho State University, Pocatello, ID V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

The MAXENT Paradigm State of knowledge: probability density Quantification: information content information distance Data: constraints MAXENT (constructive): minimize information content subject to the available data/constraints MAXENT (learning rule): minimize information distance subject to the available data/constraints V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Motivation: Physics Laws as Rules of Inference Mechanics: Free particle (T.Toffoli’98) Particle in a force field (*) Electrodynamics: Free field (*) Field with sources (*) Thermodynamics: Entropy (E.Jaynes’1957) Fisher information (Frieden, Plastino 2, Soffer’1999) General criterion (Plastino 2 ’1997, T.Yamano’2000) Quantum Mechanics: Non-relativistic (Frieden’89,?) Relativistic (?) Quantum Field Theory (?) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July Why would universal physics laws exist? Who is enforcing them? The more universal a law is, the less likely it is to actually exist. The laws of thermodynamics were once thought to have universal validity but turned out to express a little more than handling one’s ignorance in a relatively honest way (Jaynes’57). Moreover, most fundamental aspects of thermodynamics laws are insensitive to the particular choice of the information distance!

Problems with the Standard Characterizations Shannon-Jaynes entropy: Problems with continuous variables; In hindsight similar problems even with discrete variables; The constraints are not likely to be in the form of true average values; Shannon-Jaynes distance (Shore & Johnson): Karbelkar’86 points out that there is a problem with the constraints as average values; Uffink’95 takes issue with SJ’s additivity axiom; this I believe is misguided (Caticha’06); The real problem with SJ’s derivation: SJ prove that the correct form of the relevant functional is V.I Dimitrov, MaxEnt2007, Saratoga Springs, July where g(.) is a monotonic function. Then, however, they declare H and F equivalent and require additivity for F only, thus excluding the possibility that F factorizes and g(xy)=g(x)+g(y). This ignored possibility is precisely the general Renyi case! Vàn’s derivation: Apparently the first to consider dependence also on the derivatives of p(x); Derives the functional as a linear combination of SJ term and Fisher information term; Deficient derivation – handles functionals as if they were functions. New characterization from scratch is highly desirable!

Inadmissible and Admissible Variations Variations of this kind shouldn’t be allowed because they are physically insignificant. Yet they are included if we write Excluding the former and allowing the latter variations is achieved by writing V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Axioms & Consequences Locality: Expandability: Additivity (simple): Information “Distance” interpretation: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Axiom (ii): Conditional probabilities V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Axiom (ii): Conditional probabilities 2 V.I Dimitrov, MaxEnt2007, Saratoga Springs, July The term with lnm can cause potential trouble with reproducing the Kolmogorov’s rule unless either m=const or else f,2 =0, which implies c 0 =0

General Consequences of Additivity V.I Dimitrov, MaxEnt2007, Saratoga Springs, July a) b)

Particular Parameter Choices ν=1 : (Fisher distance) c 0 =0 c 1 =(ν-1) -1 : (Renyi distance) c 0 <<1 : c 0 =b(ν-1) c 1 = a(ν-1) -1 ν=1 (Vàn-type distance) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Variational Equation V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Vàn - Type Updating Rule Vàn-type distance: Vàn-type updating rule: Non-linear Schrodinger equation of Bialynicki-Birula – Mycielski type. If μ(x) is determined on an earlier stage of updating from a prior μ 0 (x) and constraint C 0 (x) then andany subsequent constraints are additive: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Fisher Updating Rule Fisher distance: Fisher updating rule: Schrodinger equation where the “quantum potential” of the prior adds to the constraint. If μ(x) is determined on an earlier stage of updating from a prior μ 0 (x) and constraint C 0 (x) then andany subsequent constraints are additive: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Shannon-Jaynes Updating Rule Shannon-Jaynes distance: Shannon-Jaynes updating rule: The usual exponential factor updating rule. If μ(x) is determined on an earlier stage of updating from a prior μ 0 (x) and constraint C 0 (x) then andany subsequent constraints are additive: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Uniqueness: Consistency with Probability Theory (i) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July Having a prior knowledge about the joint distribution of x and y in the form m(x,y) includes knowing the conditional distribution of x given y: If we now obtain some piece of information about y e.g. =d we could proceed in two ways: a) Start with the prior m(y) and the data constraint and obtain p(y); b) Apply the learning rule to the joint distribution, updating m(x,y) to p(x,y). Consistency requires that the marginal of the updated joint distribution is the same as p(y): This is the same as requiring that the conditional distribution should remain unchanged:

Consistency upon updating: Example 1 Prior m(x,y) Researcher A Researcher B Consistency upon updating requires that V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Uniqueness: Consistency with Probability Theory (ii) V.I Dimitrov, MaxEnt2007, Saratoga Springs, July A sufficient (but not necessary) condition is the “strong additivity” of S: A sufficient (and probably necessary) condition for the above is that the updating amounts to multiplying the prior by a function of the constraint: This is basically one of the Khinchin’s axioms discriminating Renyi’s ν=1 against all other values, thus singling out the Shannon-Jaynes distance. This condition rules out the gradient-dependent terms (i.e. pins down the value of c 0 to zero), but does not restrict Renyi’s ν,

Consistency upon updating: Example 1 Prior m(x,y) Researcher A Researcher B Consistency upon updating requires that Researcher C V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Consistency upon updating: Example 1 V.I Dimitrov, MaxEnt2007, Saratoga Springs, July Renyi with c 0 =0: This excludes all ν≠1 and leaves us with the Shannon-Jaynes distance!

Consistency upon updating: Example 2 Consistency upon updating requires that V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Intermediate Summary The Shannon-Jaynes relative entropy is singled out, after all, by consistency requirements even when dependence on the derivatives is allowed, as THE information distance measure to be used in a learning rule: In practical applications, however, one rarely has prior knowledge in the form of a prior distribution; it is more likely to be in the form of data. In such cases the updating rule is useless but one can still utilize the derived unique information distance for identifying least-informative priors. V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Least-Informative Priors – Constructive Approach Identify a generic operation invoking “information loss”; Notice that the information loss should be (at least for small information content) proportional to the information content unless negative information is tolerated; Look for the least-informative prior (under given data constraints) as the one least sensitive to the operation above. The sensitivity should be defined consistently with the adopted information distance. The generic operation is coarse-graining: The information distance is the Shannon-Jaynes one: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Coarse-Graining = Adding Noise Coarse-graining is equivalent to adding noise: “Distance” between the original and the “noisy” probability distributions: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July For the Shannon-Jaynes distance: ( just another instance of De Bruijn identity) Essentially the same result holds for general Renyi ν as well as for symmetrized SJ distance or the symmetric Bernardo-Rueda “intrinsic discrepancy”.

Least-Informative Priors: Constructive MAXENT Least-informative prior: The probability density least-sensitive to coarse-graining subject to all available data constraints: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July Putting here and varying ψ we obtain theEuler-Lagrange equation This is a Schrodinger equation; non-relativistic Quantum Mechanics appears to be based on applying the constructive MAXENT to a system with given average kinetic energy, i.e. temperature!

Least-Informative Prior Least-Informative Priors: Consistency Condition MAXENT A Prior Posterior Data MAXENT B Data Information “distance” p1p1 p2p2 S(p1, p2) The posterior is the closest distribution to the prior subject to the data constraints. “Closest” is with regard to the information distance. The LI prior is the one least sensitive to coarse-graining subject to the data constraints. “Least sensitive” is with regard to the information distance. V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Least-Informative Priors: Consistency Condition MAXENT A P 2 (x) Data MAXENT B NoData MAXENT B Data P 1 (x) Consistency condition: V.I Dimitrov, MaxEnt2007, Saratoga Springs, July Checks o.k.? In some cases, maybe! In general – not! MAXENT B with no data = uninformative prior=problem!

Conclusion: The MAXENT Approach MAXENT A m(x) p(x) C(x) Updating rule: MAXENT B p(x) Constructive rule: C(x) Based on Shannon-Jaynes relative entropy Based on Fisher information for the “location” parameter as a measure for sensitivity to coarse-graining V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Summary Axiomatic characterization of both constructive and learning MAXENT is possible, based on the notion of information distance; The usual requirements for smoothness, consistency with probability theory and additivity narrow down the possible form of the information distance to Renyi – type one with an exponential Fisher-type modification factor; Additional requirement that different ways of feeding the same information into the method produce the same results pins down the value of the Renyi’s parameter to 1 and excludes the derivative terms singling out the Shannon-Jaynes distance as the correct one; Since the information distance does not measure information content, I propose to consider the information content proportional to the sensitivity of the probability distribution to coarse-graining. If this is accepted, a unique constructive procedure for priors subject to available information results, involving the minimization of Fisher information under constraints. The constructive procedure, if taken seriously, has far reaching implications for the nature of those physics laws which can be formulated as variational principle. They may turn out to have very little to do with physics per se. V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Thoughts V.I Dimitrov, MaxEnt2007, Saratoga Springs, July If we agree to use probability distributions (wave functions) to encode all knowledge we have about a system, there is no way of treating new information in the form of a constraint in any exclusive way: once the PD gets updated all mean values it predicts are treated on the same footing. Then, obtaining new information basically means feeding in generally inconsistent (with what the PD already predicts for the particular observable) data. The behavior of the MAXENT makes sense – the rule gives preference to the most up-to-date “data”. This would be an insurmountable conceptual difficulty if we could actually obtain information in the form of mean values, which we can’t.

Fisher Distance Proper and Quantum Mechanics QM connection: Distance based on Fisher metrics: Minimizing the Fisher metrics distance = maximizing the QM overlap QM inference based on Hilbert space metrics (same as Fisher metrics) is equivalent to using MAXENT with ν=1/2 Renyi distance, and thus it is guaranteed to eventually produce inconsistencies, i.e. paradoxes! V.I Dimitrov, MaxEnt2007, Saratoga Springs, July

Variation for General Functional Form V.I Dimitrov, MaxEnt2007, Saratoga Springs, July