Fisher Information and Applications MLCV Reading Group 3Mar16.

Slides:



Advertisements
Similar presentations
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Advertisements

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Nonlinear Regression Ecole Nationale Vétérinaire de Toulouse Didier Concordet ECVPT Workshop April 2011 Can be downloaded at
Visual Recognition Tutorial
How well can we learn what the stimulus is by looking at the neural responses? We will discuss two approaches: devise and evaluate explicit algorithms.
For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:
Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.
Maximum likelihood (ML) and likelihood ratio (LR) test
The Mean Square Error (MSE):. Now, Examples: 1) 2)
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML)
AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.
Chapter 8 – Normal Probability Distribution A probability distribution in which the random variable is continuous is a continuous probability distribution.
Maximum likelihood (ML) and likelihood ratio (LR) test
Feature extraction: Corners and blobs
Parametric Inference.
Machine Learning CMPT 726 Simon Fraser University
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Visual Recognition Tutorial
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Maximum likelihood (ML)
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Bioinformatics lectures at Rice University Lecture 4: Shannon entropy and mutual information.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
CS774. Markov Random Field : Theory and Application Lecture 08 Kyomin Jung KAIST Sep
1 Reproducing Kernel Exponential Manifold: Estimation and Geometry Kenji Fukumizu Institute of Statistical Mathematics, ROIS Graduate University of Advanced.
The Triangle of Statistical Inference: Likelihoood
Model Inference and Averaging
1 Normal Random Variables In the class of continuous random variables, we are primarily interested in NORMAL random variables. In the class of continuous.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
Continuous Distributions The Uniform distribution from a to b.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
1 Information Geometry of Self-organizing maximum likelihood Shinto Eguchi ISM, GUAS This talk is based on joint research with Dr Yutaka Kano, Osaka Univ.
A New Method of Probability Density Estimation for Mutual Information Based Image Registration Ajit Rajwade, Arunava Banerjee, Anand Rangarajan. Dept.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
RIKEN Brain Science Institute
B AD 6243: Applied Univariate Statistics Data Distributions and Sampling Professor Laku Chidambaram Price College of Business University of Oklahoma.
Feature extraction: Corners and blobs. Why extract features? Motivation: panorama stitching We have two images – how do we combine them?
Lecture 4: Likelihoods and Inference Likelihood function for censored data.
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Sampling Theory and Some Important Sampling Distributions.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
STA347 - week 91 Random Vectors and Matrices A random vector is a vector whose elements are random variables. The collective behavior of a p x 1 random.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Neural Network Approximation of High- dimensional Functions Peter Andras School of Computing and Mathematics Keele University
Regularized Least-Squares and Convex Optimization.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Topic Overview and Study Checklist. From Chapter 7 in the white textbook: Modeling with Differential Equations basic models exponential logistic modified.
Information geometry.
Visual Recognition Tutorial
STAT 311 REVIEW (Quick & Dirty)
Information Geometry: Duality, Convexity, and Divergences
Chapter 7 Sampling Distributions.
Summarizing Data by Statistics
Chapter 7 Sampling Distributions.
Chapter 7 Sampling Distributions.
Chapter 9 Chapter 9 – Point estimation
Chapter 7 Sampling Distributions.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Fundamental Sampling Distributions and Data Descriptions
Presentation transcript:

Fisher Information and Applications MLCV Reading Group 3Mar16

Motivation Connections to statistics, information theory, and differential geometry Broad interests from different research communities, and many potential applications. E.g. machine learning, computer vision, neuroscience, economics Below: A few image-based shape applications

Detect Circles in Iris Images A Fisher–Rao Metric for Curves Using the Information in Edges, SJ Maybank, Journal of Mathematical Imaging and Vision, 2015

Detect Lines in Paracatadioptric Images A Fisher-Rao Metric for Paracatadioptric Images of Lines, Maybank, Ieng, Benosman, IJCV, 2012

Detect Lines in Paracatadioptric Images A Fisher-Rao Metric for Paracatadioptric Images of Lines, Maybank, Ieng, Benosman, IJCV, 2012

Shape Analysis Copy from M. Bauer, M. Bruveris, P. W. Michor. Uniqueness of the Fisher-Rao metric on the space of smooth densities, arXiv: And many many other applications ….

Our Focus Basics of Fisher information Connection to MLE Cramer-Rao bound Connection to KL divergence, mutual information Connection to differential geometry Applications? Papers in NIPS, IJCV, …. Warning: a bit intensive in equations

Fisher Information A parametric family of distributions with each one indexed by the particular parameter Denote the log-likelihood Denote the 1 st -order derivative for 1D, and in general, The Fisher information associated with is

Fisher Information Fisher information is the variance of the score, as It can be interpreted as the curvature of the log-likelihood function Theorem (Cramer-Rao bound): the variance of any estimator from iid random sample of size n of underlying distribution, is bounded by -- Single parameter scenario

Fisher Information Remark: – When low curvature (almost flat log-likelihood curve), expect large deviation from – When high curvature (peaky log-likelihood), expect small deviation from Example, consider Poisson distribution – We have, thus -- Single parameter scenario

Fisher Information Remarks: – Score function has zero mean under, i.e. – Also, – We have -- Multi-parameter scenario

Fisher Information Theorem (Multi-param. Cramer-Rao bound): The variance of any vector-valued estimator from iid random sample of size n of underlying distribution, is bounded by Example (Exponential family of distributions): Since, we have and. Then -- Multi-parameter scenario

Our Focus Basics of Fisher information Connection to MLE Cramer-Rao bound Connections to KL divergence, mutual information Connection to differential geometry Applications? Papers in NIPS, IJCV, ….

KL-divergence Entropy Relative entropy (i.e. KL-divergence) Remark: – KL(p||q) >=0

KL-divergence of Exponential Family Recall We have KL is roughly quadratic for exp. family distributions, where quadratic form is given by the Fisher information matrix

Mutual Information By definition Remark: – We have – I(x,y) -> 0, when p(x,y)-> p(x) p(y) – I(x,y)-> max, when p(x,y) is farther away from p(x) p(y) – Fisher information provides a 2 nd -order approximation of MI

Our Focus Basics of Fisher information Connection to MLE Cramer-Rao bound Connections to KL divergence, mutual information Connection to differential geometry Applications? Papers in NIPS, IJCV, ….

Fisher Information Metric A statistical manifold Two coordinate systems: with Remarks: – since – of diff. geometry

Fisher Information Metric Remarks: – is p.d. matrix of size mxm – We also have – For exponential family: It is the Hessian of A – is the Jacobian matrix, since

Fisher Information Metric Consider Bregman divergence: By Taylor expansion of This explains why learning is m-projection (i.e. the 2 nd argument of KL) It can be shown KL is the only valid Bregman divergence here. -- Information Divergence

Our Focus Basics of Fisher information Connection to MLE Cramer-Rao bound Connections to KL divergence, mutual information Connection to differential geometry Applications? Papers in NIPS, IJCV, ….

Applications (a very partial list) Optimal Neural Population Codes for High-dimensional Stimulus Variables, by Z. Wang, A. Stocker and D. Lee., NIPS'13 A Fisher–Rao Metric for Curves Using the Information in Edges, by SJ Maybank, Journal of Mathematical Imaging and Vision, 2015 A Fisher-Rao Metric for Paracatadioptric Images of Lines, by SJ Maybank, S Ieng, R Benosman, IJCV'12 Information geometry and minimum description length networks, Ke et al., ICML'15 Space-time local embeddings, Ke et al., NIPS'15 An Information Geometry of Statistical Manifold Learning, Ke et al., ICML'14 Scaling Up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix, Grosse & Salakhutdinov, ICML’15