Information Based Criteria for Design of Experiments

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Pattern Recognition and Machine Learning
1 Regression Models & Loss Reserve Variability Prakash Narayan Ph.D., ACAS 2001 Casualty Loss Reserve Seminar.
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
What is Statistical Modeling
Machine Learning CMPT 726 Simon Fraser University
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks Maurice Chu, Horst Haussecker and Feng Zhao Xerox Palo.
Maximum Likelihood (ML), Expectation Maximization (EM)
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
BCS547 Neural Decoding.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
Presented by Minkoo Seo March, 2006
Javad Azimi, Ali Jalali, Xiaoli Fern Oregon State University University of Texas at Austin In NIPS 2011, Workshop in Bayesian optimization, experimental.
1 Information Content Tristan L’Ecuyer. 2 Degrees of Freedom Using the expression for the state vector that minimizes the cost function it is relatively.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Prediction and Missing Data. Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested.
Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Data Modeling Patrice Koehl Department of Biological Sciences
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
ICS 280 Learning in Graphical Models
Conditional Random Fields for ASR
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Maximum Likelihood Estimation
Of Probability & Information Theory
Data Mining Lecture 11.
Dynamical Statistical Shape Priors for Level Set Based Tracking
Statistical Learning Dong Liu Dept. EEIS, USTC.
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Mixture Density Networks
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Modelling data and curve fitting
Quantum Information Theory Introduction
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The free-energy principle: a rough guide to the brain? K Friston
Pattern Recognition and Machine Learning
Computing and Statistical Data Analysis / Stat 7
LECTURE 23: INFORMATION THEORY REVIEW
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
12. Principles of Parameter Estimation
Data Exploration and Pattern Recognition © R. El-Yaniv
Probabilistic Surrogate Models
Presentation transcript:

Information Based Criteria for Design of Experiments -AC

Design of experiments Initial Design Of Experiments Random Factorial and Latin Hypercube designs Sequential Design of Experiments Information Based Methods Why Does choosing the highest variance to build the model work? Selecting based on maximal variance of the prediction is equivalent to the maximization of Shannon information/mutual information This is all detailed in MacKay 92’(Information Based Objective Functions)

Information and Entropy Proposed by Shannon in 1948, extended to continuous variables by Jaynes The entropy of a random variable is… Also thought of the expectation of “surprise” Typically dealt with in communications, spoken in term of bits Less probable events carry more surprise “Information” is cast in a way which describes changes/differences in entropy The change in the expectation of surprise is a measure of information

Why Find Use Variance for Next Point? (MacKay 92’) Prior (Posterior) distribution about the weights given data: Distribution as additional data is observed: Expectation of Cross Entropy (expected Information gained from observed y*): *Note: This is the same as Shannon information/Mutual Information  This is also the Kullbrick Leiber Divergence!  Amount of information gained when revising prior beliefs to posterior

Evaluation of the Information We split up the following… Entropy should decrease as data is added Some measure over w commonly defined for entropy (unused) *We note that the entropy of the updated posterior depends on data we do not have, y* *Also, entropy is sometime taken with respect to some measure m(w), this can be taken to be uniform m(w)=m, but the point of expanding the cross entropy is also to show this measure should have no consequence on the final result for info…

Analytical Evaluation of Gaussian Entropy The prior term can be evaluated analytically when using gaussian distributions. (here we will use m(x)=1) However the updated posterior must be evaluated/approximated There are a couple of ways to accomplish this. *The entropy of a gaussian posterior is dependent only on the precision matrix

Approximation of Entropy The entropy of the prior is given by For the 1-d gaussian, , n=1, which follows the intuition, if the variance of a gaussian distribution is higher, there is a higher expectation that we will be surprised One approximation to the entropy at of the unobserved step is a simple approximation to the updated “covariance matrix” (by adding contribution to possible point x*)

Evaluation of Information The change in entropy between two steps is now approximated “Matrix Determinant Lemma” This term is the prediction variance… If we maximize the variance we maximize the information gain… This is a “D-optimal Design”

Notes We note This is a general framework assuming homoscedasticity There may be extensions to neural networks which may need tweaking to properly fit

Backup Slides Information Theory…

Information Entropy Proposed by Shannon in 1948, extended to continuous variables by Jaynes Negative logarithm of the probability density function (for continuous variables) Less probable events carry more surprise, or information The joint information of two independent events is additive Amount of information of a variable is the expectation of the entropy Also thought of the expectation of “surprise” Note: That the following assumes the probability density of given variables is essentially zero outside of a given range

Types of Entropy Joint Entropy Conditional Entropy

Mutual Information The mutual information is defined as the information gained for observing a random variable X by observing another random variable Y. The conditional entropy can be thought of the amount of information needed to describe a variable X, once Y is given (if this H is small, Y contains a lot of information about X) We will use the concept of mutual information and extend it to amount of information for parameters (X as place holders) when an experimental output is observed (Y as place holders)

Mutual Information Mutual Information can be written