2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE

2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE
MARINE BIOLOGICAL LABORATORY WOODS HOLE, MA Brief introduction to probability theory, information measures, and latent variables Uri Eden BU Department of Mathematics and Statistics August 1, 2016

Probability and statistics
Probability theory uses probability models to describe and predict features of experimental data e.g. Assuming 55% of voters favor Clinton, what range of outcomes could we expect of a random poll of 1000 voters? Statistics uses data to estimate and infer features of a probability model that could have generated the data e.g. If a random poll of 1000 voters has 550 favor Clinton, what is a range of reasonable values for the fraction of voters in the population who favor her?

A case study Iygengar and Liu (1997) examined the activity of retinal neurons grown in culture under constant light and environmental conditions. Spontaneous spiking activity of these neurons was recorded. The objective is to develop a probability model which accurately describes the stochastic structure of this activity. In particular, we seek a model of the interspike interval (ISI) distribution.

Discrete vs continuous random variables
Discrete variables can only take on values in a discrete (finite or countably infinite) set. e.g. Result of coin flip, 𝑋∈ H,T Number of spikes in an interval, 𝑋∈ 0,1,2,… Continuous variables take on values in continuous intervals. e.g. Phase of a rhythm at which a spike occurs, 𝑋∈(−𝜋,𝜋] Interspike interval between two spikes, 𝑋∈[0,∞)

Discrete variables are defined by a pmf or CDF
probability mass function (pmf): 𝑝 𝑥 = Pr 𝑋=𝑥 =𝑃 𝑥 𝑝(𝑥) 𝑥 𝑝 𝑥 =1 𝑥 Cumulative Distribution Function (CDF): 𝐹(𝑥) 𝑥 𝐹 𝑥 =Pr⁡(𝑋≤𝑥) = 𝑥 𝑖 ≤𝑥 𝑝 𝑥 𝑖

Continuous variables are defined by a pdf or CDF
probability density function (pdf): 𝑓(𝑥)=𝑃 𝑥 Pr 𝑋=𝑥 =0 Pr 𝑎<𝑋<𝑏 = 𝑎 𝑏 𝑓 𝑥 𝑑𝑥 𝑓(𝑥) 𝑥 𝑓 𝑥 𝑑𝑥 =1 𝑥 Cumulative Distribution Function (CDF): 𝐹(𝑥) 𝑥 𝐹 𝑥 = Pr 𝑋≤𝑥 = 𝑥 𝑖 ≤𝑥 𝑓 𝑥 𝑖 𝑑 𝑥 𝑖

Expected values Discrete variables: Continuous variables:
𝐸 𝑋 = 𝑥 𝑥𝑓 𝑥 𝑑𝑥 𝐸 𝑋 = 𝑥 𝑥𝑝 𝑥 𝜇= 𝑓(𝑥) 𝐸 𝑔(𝑋) = 𝑥 𝑔(𝑥)𝑓 𝑥 𝑑𝑥 𝐸 𝑔(𝑋) = 𝑥 𝑔(𝑥)𝑝 𝑥 Var(𝑋)=𝐸 𝑋−𝜇 2

Question What probability model should we use to describe the retinal spiking data?

Nonparametric model: the empirical CDF
The empirical CDF (eCDF) is a function of 𝑥 that computes the fraction of data less than or equal to 𝑥. 𝐹 𝑥 = 1 𝑛 𝑖=1 𝑛 1 𝑥 𝑖 ≤𝑥 𝐹 (𝑥) 𝑥 Dvoretzky–Kiefer–Wolfowitz (DKW) inequality: Pr⁡ sup 𝑥 𝐹 𝑥 − 𝐹 𝑥 >𝜀 ≤2 𝑒 −2𝑛 𝜀 2 Pr⁡ sup 𝑥 𝐹 𝑥 − 𝐹 𝑥 > 𝑛 ≤0.95

Expected values of eCDF
The eCDF defines a discrete probability model that places probability mass 1 𝑛 at each observed data point. 𝐸 𝑋 = 𝑥 𝑥 𝑝 𝑥 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 = 𝑥 𝐸 𝑔(𝑋) = 1 𝑛 𝑖=1 𝑛 𝑔( 𝑥 𝑖 ) Var(𝑋)= 𝑥 𝑥−𝐸[𝑋] 2 𝑝 (𝑥) = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 2

Parametric models: common discrete distributions
𝑝 𝑥 = 1 𝑛 Discrete uniform Bernoulli Binomial Poisson 𝑥∈ 1,…,𝑛 𝑝 𝑥 = 𝑝 𝑥 1−𝑝 1−𝑥 𝑥∈ 0,1 𝑝 𝑥 = 𝑛 𝑥 𝑝 𝑥 1−𝑝 𝑛−𝑥 𝑥∈ 0,…,𝑛 𝑝 𝑥 = 𝜆 𝑥 𝑒 −𝜆 𝑥! 𝑥∈ 0,…,∞

Parametric models: common continuous distributions
𝑓 𝑥 = 1 𝑏−𝑎 Uniform Normal Exponential Gamma 𝑥∈[𝑎,𝑏] 𝑓 𝑥 = 𝜋 𝜎 2 𝑒 − 𝑥−𝜇 𝜎 2 𝑥∈(−∞,∞) 𝑓 𝑥 = 1 𝜃 𝑒 − 𝑥 𝜃 𝑥∈[0,∞) 𝑓 𝑥 ∝ 𝑥 𝛼−1 𝑒 −𝛽𝑥 𝑥∈[0,∞)

Nonparametric vs parametric models
Nonparametric models and methods require fewer assumptions about data typically involve less complicated estimation methods Parametric models and methods tend to have more statistical power often have well-studied, well-behaved properties involve parameters that often have useful interpretations tend to scale better with large, high-dimensional datasets

Parametric probability models for ISI data
𝑓 𝑥 = 1 𝜃 𝑒 − 𝑥 𝜃 Exponential model: Which model (which value of 𝜃) should we pick? 𝑓 𝑥 ISI (ms)

Model fitting – parameter estimation
Two related perspectives on parameter estimation Minimize some cost function e.g. 𝜃 = min 𝜃 𝐸 𝜃 −𝜃 2 Select estimators with known properties and distributions. e.g. unbiasedness, minimum variance, asymptotic normality Best of both worlds: Maximum Likelihood

Maximum likelihood The likelihood is the joint probability of the data, as a function of the model parameters 𝑓 𝑥 = 1 𝜃 𝑒 − 𝑥 𝜃 Exponential model: Likelihood: 𝐿 𝜃 = 𝑖=1 𝑛 1 𝜃 𝑒 − 𝑥 𝑖 𝜃 x 𝐿(𝜃) 𝜃

Maximum log likelihood
The log likelihood is maximized at the same point and is often easier to work with. 𝐿 𝜃 = 𝑖=1 𝑛 1 𝜃 𝑒 − 𝑥 𝑖 𝜃 Likelihood: Log likelihood: log 𝐿 𝜃 =− 𝑖=1 𝑛 log 𝜃 + 𝑥 𝑖 𝜃 log 𝐿(𝜃) 𝜃

Maximum likelihood The likelihood is the joint probability of the data, as a function of the model parameters log 𝐿 𝜃 =− 𝑖=1 𝑛 log 𝜃 + 𝑥 𝑖 𝜃 Log likelihood: 𝑑log 𝐿 𝜃 𝑑𝜃 =− 𝑖=1 𝑛 1 𝜃 − 𝑥 𝑖 𝜃 2 − 𝑛 𝜃 𝜃 2 𝑖=1 𝑛 𝑥 𝑖 =0 ⇒ 𝜃 = 𝑥 𝑖 𝑛 =30.8 ms

Goodness-of-fit Goodness-of-fit is assessed across multiple measures
One intuitive approach – compare model to data Kolmogorov-Smirnov (KS) Plot 𝐹 𝑥 𝐹 𝑥 𝑓 𝑥 𝐹 𝑥 ISI (ms)

Interval estimation Any point estimate for a parameter is almost certainly incorrect and hard to express confidence about Instead we want a statement about a reasonable range of parameter values and our confidence in that range. The likelihood provides an intuitive approach: Examine curvature of log 𝐿(𝜃) at 𝜃 ML log 𝐿(𝜃) 𝜃 ML log 𝐿(𝜃)

Fisher Information Information contained in data about model parameter
Expected Fisher Information: Observed Fisher Information: 𝐼 𝜃 =−𝐸 𝑑 2 log 𝐿(𝜃) 𝑑 𝜃 2 𝐼 𝜃 =− 𝑑 2 log 𝐿(𝜃) 𝑑 𝜃 𝜃 ML Cramer-Rao bound for any unbiased estimator: ML estimators achieve this bound and have a normal distribution asymptotically var 𝜃 ≥ 𝐼(𝜃) −1 95% CI for 𝜃: 𝜃 ML ±2 𝐼(𝜃) − 1 2

Fisher Information For our exponential model of the retinal ISI data
𝑑log 𝐿 𝜃 𝑑𝜃 =− 𝑖=1 𝑛 1 𝜃 − 𝑥 𝑖 𝜃 2 𝑑 2 log 𝐿 𝜃 𝑑 𝜃 2 = 𝑖=1 𝑛 1 𝜃 2 − 2 𝑥 𝑖 𝜃 3 𝐼 𝜃 =−𝐸 𝑑 2 log 𝐿 𝜃 𝑑 𝜃 2 =− 𝑛 𝜃 𝑛𝜃 𝜃 3 = 𝑛 𝜃 2 𝐼 𝜃 =− 𝑑 2 log 𝐿 𝜃 𝑑 𝜃 𝜃 ML =− 𝑛 𝑥 𝑛 𝑥 𝑥 3 = 𝑛 𝑥 2 var 𝜃 ≥ 𝜃 2 𝑛 95% CI for 𝜃: 𝑥 ±2 𝑥 𝑛 =[28.9, 32.8]

Information theory Claude Shannon – A mathematical theory of communication (1948) Formulated to quantify communication across a noisy channel Defines particular properties of probability models Goodness-of-fit often neglected Information theoretic statistics are rarely most powerful statistically These statistics can be useful when they have intuitive interpretations.

Information theory: entropy
For discrete probability models, entropy is one of many measures to quantify variability/dispersion Self information/surprisal of an outcome: −log 𝑝(𝑥) Larger for less likely outcomes Information of independent events is additive Entropy is the expected surprisal: Maximized when each event is equally surprising Almost always expressed using log 2 (bits) 𝐻 𝑋 =−𝐸 log 𝑝 𝑋 =− 𝑥 𝑝 𝑥 log 𝑝(𝑥)

Information theory: differential entropy
Not quite analogous measure of dispersion for continuous probability models Typically computed using log 2 (in bits) Depends on units of 𝑋 Can take on negative values Harder to interpret, but smaller values suggest probability mass confined to smaller volumes 𝐻 𝑋 =−𝐸 log 𝑓 𝑋 =− 𝑥 𝑓 𝑥 log 𝑓(𝑥) 𝑑𝑥

Differential entropy of our ISI model
𝐻 𝑋 =− 0 ∞ 1 𝜃 𝑒 − 𝑥 𝜃 log 1 𝜃 𝑒 − 𝑥 𝜃 𝑑𝑥 = log 𝜃 + 1 𝜃 0 ∞ 𝑥 𝜃 𝑒 − 𝑥 𝜃 𝑑𝑥 = log 𝜃 + 1 𝜃 𝐸 𝑋 = log 𝜃 +1 =4.43 nats=6.39 bits This does NOT mean that retinal spiking contains 6.4 bits of information. 𝐻 𝑋 = -3.6 bits if 𝜃 is in seconds. Watch out for many nonsense statements related to information theory!

Information theory: KL divergence
Distance measure between two distributions Some properties: Asymmetric, 𝐷 KL (𝑋||𝑌)≥0. Interpretation (discrete): If 𝑃 𝑋 𝑋 is correct, information gain using 𝑃 𝑋 𝑋 instead of incorrect 𝑃 𝑌 𝑋 . Many relations to other information measures. 𝐷 KL 𝑋||𝑌 =− 𝐸 𝑋 log 𝑃 𝑌 𝑋 𝑃 𝑋 𝑋 Discrete variables: Continuous variables: 𝐷 KL 𝑋||𝑌 =− 𝑥 𝑝 𝑋 𝑥 log 𝑝 𝑌 𝑥 𝑝 𝑋 𝑥 𝐷 KL 𝑋||𝑌 =− 𝑥 𝑓 𝑋 𝑋 log 𝑓 𝑌 𝑥 𝑓 𝑋 𝑥 𝑑𝑥

Joint distributions In most experiments, we measure multiple variables. If variables are random, their relationship to each other is described by their joint distribution. Most statistical analyses relate to the estimation of some form of joint or conditional distribution. Some similar and some new challenges: Discrete vs continuous variables can lead to additional issues that require measure theoretic approaches More variables require high dimensional analysis methods

Joint distributions Imagine for retinal data we find that light intensity (X) varied in time and influenced ISIs (Y). ISIs (ms) Light intensity (fractional change)

Joint distributions Since both variables are continuous, we can describe distribution of probability mass using a joint density. 𝑓(𝑥,𝑦) 𝑓(𝑥,𝑦)≥0 𝑥 𝑦 𝑓 𝑥,𝑦 𝑑𝑦 𝑑𝑥 =1 𝑦 Joint CDF: 𝐹 𝑎,𝑏 = 𝑥<𝑎 𝑦<𝑏 𝑓 𝑥,𝑦 𝑑𝑦 𝑑𝑥 𝑥

Marginalization The marginal distribution of a set of variables is computed by integrating out any additional variables over their full support 𝑓 𝑦 = 𝑥 𝑓 𝑥,𝑦 𝑑𝑥 𝑓 𝑥 = 𝑦 𝑓 𝑥,𝑦 𝑑𝑦

Conditional distributions
Conditional distributions characterize the distribution of one set of variables at specific values of other variables 𝑃 𝑌 𝑋=𝑥 = 𝑃(𝑋=𝑥,𝑌) 𝑃(𝑋=𝑥)

Conditional distributions
𝑃 𝑌 𝑋 , written as a function of the variable 𝑋, expresses the conditional distribution P(Y|𝑋=𝑥) at each value 𝑋=𝑥. 𝑃 𝑌 𝑋 = 𝑃(𝑋,𝑌) 𝑃(𝑋) 𝑃 𝑋,𝑌 =𝑃 𝑌 𝑋 𝑃(𝑋) 𝑦 𝑥

Independence Variables are independent when knowledge of one can provide no information about the other. 𝑃 𝑌 𝑋 =𝑃(𝑌) 𝑃(𝑋,𝑌) 𝑃(𝑋) =𝑃(𝑌) 𝑃(𝑋,𝑌)=𝑃 𝑋 𝑃(𝑌) 𝑦 𝑥

Linear models Linear models define a conditional distribution of a response variable as a function of one or more predictor variables. Typically, we assume a Gaussian noise model Note that this is linear with respect to the parameters, but 𝑋 and 𝑌 can be any nonlinear functions of measured variables. Estimate parameters by maximum likelihood. 𝑌= 𝛽 0 + 𝛽 1 𝑋+𝜀 𝑌|𝑋~𝑁( 𝛽 0 + 𝛽 1 𝑋,𝜎) 𝛽 1 = 𝑥 𝑖 − 𝑥 𝑦 𝑖 − 𝑦 𝑥 𝑖 − 𝑥 2 𝛽 0 = 𝑦 − 𝛽 1 𝑥

Linear fit of light intensity – ISI relationship
𝑦 𝑥

Linear fit of light intensity – log(ISI) relationship
𝑥 𝑦 log⁡(𝑦) 𝑥

Linear models Basic principles of linear modeling contain most basic statistical methods as special cases: z-tests, t-tests – linear model with binary predictor ANOVA – linear model with categorical predictors multiple regression – linear model with multiple predictors Quadratic /polynomial/spline/wavelet/etc. regression – linear model with predictors that are functions of select variables. Spectral estimation – linear model with sinusoid predictors Autocorrelation analysis/AR models – linear models with past signal values as predictors

More information theory
Conditional entropy – expected remaining entropy in Y once X is known 𝐻 𝑌|𝑋 =−𝐸 log 𝑃(𝑌|𝑋) =− 𝑥,𝑦 𝑝 𝑥,𝑦 log 𝑝(𝑥,𝑦) 𝑝(𝑥) 𝐼 𝑋;𝑌 =−𝐸 log 𝑃 𝑋 𝑃 𝑌 𝑃 𝑋,𝑌 =− 𝑥,𝑦 𝑝 𝑥,𝑦 log 𝑝 𝑥 𝑝 𝑦 𝑝 𝑥,𝑦 =𝐻 𝑌 −𝐻 𝑌 𝑋 = 𝐷 𝐾𝐿 (𝑃(𝑋,𝑌)||𝑃 𝑋 𝑃(𝑌)) Mutual information – Expected change in entropy of Y by learning value of X. 𝐻 𝑌 =3.15 nats. 𝐻 𝑌 𝑋 =2.97 nats. 𝐼 𝑋;𝑌 =0.18 nats=0.27 bits. For retinal dataset:

Decoding problem Imagine in our retinal ISI data that we neglected to record the light intensity level through time for part of the experiment. Can we estimate it from the ISIs?

Decoding using the likelihood
Our linear model describes the conditional distribution of the log ISI given the light intensity: Without any probability model for the intensity we could estimate its value for each ISI using ML log⁡(𝑌)|𝑋~𝑁( 𝛽 0 + 𝛽 1 𝑋,𝜎) 𝐿 𝑥 𝑡 = 1 2𝜋 𝜎 2 𝑒 − log 𝑦 𝑡 − 𝛽 0 − 𝛽 1 𝑥 𝑡 𝜎 2 𝑑log⁡(𝐿 𝑥 𝑡 ) 𝑑 𝑥 𝑡 =− 2 log 𝑦 𝑡 − 𝛽 0 − 𝛽 1 𝑥 𝑡 2 𝜎 2 ⇒ 𝑥 𝑡 = log 𝑦 𝑡 − 𝛽 0 𝛽 1

Bayes’ Rule Consequence of conditional distribution definition
𝑃 𝑋,𝑌 =𝑃 𝑌 𝑋 𝑃(𝑋) 𝑃 𝑋,𝑌 =𝑃 𝑋 𝑌 𝑃(𝑌) 𝑃 𝑋|𝑌 𝑃 𝑌 =𝑃 𝑌 𝑋 𝑃(𝑋) The posterior is proportional to the prior times the likelihood 𝑃 𝑋 𝑌 = 𝑃 𝑌|𝑋 𝑃(𝑋) 𝑃(𝑌)

Bayesian Decoder Model 𝑋 as an independent normal process:
Prior: 𝑋~𝑁 𝜇 𝑥 , 𝜎 𝑥 Likelihood: log⁡(𝑌)|𝑋~𝑁( 𝛽 0 + 𝛽 1 𝑋,𝜎) Posterior: 𝑓 𝑋 log 𝑌 ∝𝑓 log 𝑌 𝑋 𝑓 𝑋 . ∝ exp − log 𝑦 − 𝛽 0 − 𝛽 1 𝑥 𝜎 2 − 𝑥− 𝜇 𝑥 𝜎 𝑥 2 𝑋|log 𝑌 ~𝑁 𝛽 1 𝜎 𝑥 2 log 𝑦 − 𝛽 0 + 𝜎 2 𝜇 𝑥 𝛽 1 2 𝜎 𝑥 2 + 𝜎 2 , 𝜎 𝑥 2 𝜎 2 𝛽 1 2 𝜎 𝑥 2 + 𝜎 2

Decoding ISI Data ISI (ms) Likelihood Decoder Intensity
Bayesian Decoder Intensity Spike index

𝑃 𝑦 1:𝑛 𝑥 1:𝑚 =𝑃( 𝑦 1 | 𝑥 1:𝑚 )∙∙∙𝑃( 𝑦 𝑛 | 𝑥 1:𝑚 )
Latent variables Informal definition: any variable that is not directly observed, but which provides additional insight about the data is called latent. One formal definition: Latent variables are ones that makes the data conditionally independent: 𝑃( 𝑦 1:𝑛 )≠𝑃( 𝑦 1 )∙∙∙𝑃( 𝑦 𝑛 ) 𝑃 𝑦 1:𝑛 𝑥 1:𝑚 =𝑃( 𝑦 1 | 𝑥 1:𝑚 )∙∙∙𝑃( 𝑦 𝑛 | 𝑥 1:𝑚 ) Latent variables are often used to model unobserved factors influencing data, low-dimensional representations of high-dimensional data, unobserved dynamics underlying data, …

The state-space paradigm
An unobserved (latent) state process (xt) undergoes stochastic dynamics that influence an observation process (yt) 𝑓 𝑦 𝑘 | 𝑥 𝑘 𝑓 𝑥 𝑘 | 𝑥 𝑘−1 Filtering problem: Estimate state, 𝑥 𝑘 , given observations up to current time, 𝑦 0:𝑘 . Smoothing problem: Estimate state, 𝑥 𝑘 , given all observations, 𝑦 0:𝑇 . In our example, we may assume the light intensity varies smoothly in time: 𝑥 𝑘 | 𝑥 𝑘−1 ~𝑁(0,𝜖).

Latent variables ISI Data ISI (ms) Filter Estimate Intensity
Smoother Estimate Intensity Spike index

Take home points Most data analysis techniques are fundamentally concerned with expressing a (joint) probability model for the data. Nonparametric and parametric modeling methods are complementary and can provide robust and interpretable properties of the data. Maximum likelihood provides an often statistically optimal approach for estimating parameters of models. Information measures describe properties of a probability model, which can in turn provide insight into data. Latent variable methods use knowledge or assumptions about unobserved variables to improve understanding of observed data.

2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE

Similar presentations

Presentation on theme: "2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE

Similar presentations

Presentation on theme: "2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE"— Presentation transcript:

Similar presentations

About project

Feedback