2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Random Variables ECE460 Spring, 2012.

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Visual Recognition Tutorial

Maximum likelihood (ML) and likelihood ratio (LR) test

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

Maximum likelihood (ML)

Maximum likelihood (ML) and likelihood ratio (LR) test

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Presenting: Assaf Tzabari

Machine Learning CMPT 726 Simon Fraser University

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Statistical analysis and modeling of neural data Lecture 5 Bijan Pesaran 19 Sept, 2007.

Visual Recognition Tutorial

3-1 Introduction Experiment Random Random experiment.

Review of Probability and Random Processes

Thanks to Nir Friedman, HU

Maximum likelihood (ML)

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,

Model Inference and Averaging

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.

Theory of Probability Statistics for Business and Economics.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

2011 COURSE IN NEUROINFORMATICS MARINE BIOLOGICAL LABORATORY WOODS HOLE, MA Introduction to Spline Models or Advanced Connect-the-Dots Uri Eden BU Department.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Introduction to logistic regression and Generalized Linear Models July 14, 2011 Introduction to Statistical Measurement and Modeling Karen Bandeen-Roche,

BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity

BCS547 Neural Decoding.

Basics on Probability Jingrui He 09/11/2007. Coin Flips  You flip a coin Head with probability 0.5  You flip 100 coins How many heads would you expect.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

1 Review of Probability and Random Processes. 2 Importance of Random Processes Random variables and processes talk about quantities and signals which.

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Bayesian Estimation and Confidence Intervals Lecture XXII.

The Probit Model Alexander Spermann University of Freiburg SoSe 2009

Lecture 1.31 Criteria for optimal reception of radio signals.

Multiple Random Variables and Joint Distributions

Statistical Modelling

12. Principles of Parameter Estimation

IEE 380 Review.

Probability for Machine Learning

ICS 280 Learning in Graphical Models

Appendix A: Probability Theory

Ch3: Model Building through Regression

Of Probability & Information Theory

Review of Probability and Estimators Arun Das, Jason Rebello

Maximum Likelihood Find the parameters of a model that best fit the data… Forms the foundation of Bayesian inference Slide 1.

Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Statistical NLP: Lecture 4

Mathematical Foundations of BME

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

12. Principles of Parameter Estimation

Berlin Chen Department of Computer Science & Information Engineering

Berlin Chen Department of Computer Science & Information Engineering

Experiments, Outcomes, Events and Random Variables: A Revisit

HKN ECE 313 Exam 2 Review Session

Berlin Chen Department of Computer Science & Information Engineering

Applied Statistics and Probability for Engineers

Introductory Statistics

Continuous Random Variables: Basics

Presentation transcript:

2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE MARINE BIOLOGICAL LABORATORY WOODS HOLE, MA Brief introduction to probability theory, information measures, and latent variables Uri Eden BU Department of Mathematics and Statistics August 1, 2016

Probability and statistics Probability theory uses probability models to describe and predict features of experimental data e.g. Assuming 55% of voters favor Clinton, what range of outcomes could we expect of a random poll of 1000 voters? Statistics uses data to estimate and infer features of a probability model that could have generated the data e.g. If a random poll of 1000 voters has 550 favor Clinton, what is a range of reasonable values for the fraction of voters in the population who favor her?

A case study Iygengar and Liu (1997) examined the activity of retinal neurons grown in culture under constant light and environmental conditions. Spontaneous spiking activity of these neurons was recorded. The objective is to develop a probability model which accurately describes the stochastic structure of this activity. In particular, we seek a model of the interspike interval (ISI) distribution.

Discrete vs continuous random variables Discrete variables can only take on values in a discrete (finite or countably infinite) set. e.g. Result of coin flip, 𝑋∈ H,T Number of spikes in an interval, 𝑋∈ 0,1,2,… Continuous variables take on values in continuous intervals. e.g. Phase of a rhythm at which a spike occurs, 𝑋∈(−𝜋,𝜋] Interspike interval between two spikes, 𝑋∈[0,∞)

Discrete variables are defined by a pmf or CDF probability mass function (pmf): 𝑝 𝑥 = Pr 𝑋=𝑥 =𝑃 𝑥 𝑝(𝑥) 𝑥 𝑝 𝑥 =1 𝑥 Cumulative Distribution Function (CDF): 𝐹(𝑥) 𝑥 𝐹 𝑥 =Pr⁡(𝑋≤𝑥) = 𝑥 𝑖 ≤𝑥 𝑝 𝑥 𝑖

Continuous variables are defined by a pdf or CDF probability density function (pdf): 𝑓(𝑥)=𝑃 𝑥 Pr 𝑋=𝑥 =0 Pr 𝑎<𝑋<𝑏 = 𝑎 𝑏 𝑓 𝑥 𝑑𝑥 𝑓(𝑥) 𝑥 𝑓 𝑥 𝑑𝑥 =1 𝑥 Cumulative Distribution Function (CDF): 𝐹(𝑥) 𝑥 𝐹 𝑥 = Pr 𝑋≤𝑥 = 𝑥 𝑖 ≤𝑥 𝑓 𝑥 𝑖 𝑑 𝑥 𝑖

Expected values Discrete variables: Continuous variables: 𝐸 𝑋 = 𝑥 𝑥𝑓 𝑥 𝑑𝑥 𝐸 𝑋 = 𝑥 𝑥𝑝 𝑥 𝜇= 𝑓(𝑥) 𝐸 𝑔(𝑋) = 𝑥 𝑔(𝑥)𝑓 𝑥 𝑑𝑥 𝐸 𝑔(𝑋) = 𝑥 𝑔(𝑥)𝑝 𝑥 Var(𝑋)=𝐸 𝑋−𝜇 2

Question What probability model should we use to describe the retinal spiking data?

Nonparametric model: the empirical CDF The empirical CDF (eCDF) is a function of 𝑥 that computes the fraction of data less than or equal to 𝑥. 𝐹 𝑥 = 1 𝑛 𝑖=1 𝑛 1 𝑥 𝑖 ≤𝑥 𝐹 (𝑥) 𝑥 Dvoretzky–Kiefer–Wolfowitz (DKW) inequality: Pr⁡ sup 𝑥 𝐹 𝑥 − 𝐹 𝑥 >𝜀 ≤2 𝑒 −2𝑛 𝜀 2 Pr⁡ sup 𝑥 𝐹 𝑥 − 𝐹 𝑥 > 1.36 𝑛 ≤0.95

Expected values of eCDF The eCDF defines a discrete probability model that places probability mass 1 𝑛 at each observed data point. 𝐸 𝑋 = 𝑥 𝑥 𝑝 𝑥 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 = 𝑥 𝐸 𝑔(𝑋) = 1 𝑛 𝑖=1 𝑛 𝑔( 𝑥 𝑖 ) Var(𝑋)= 𝑥 𝑥−𝐸[𝑋] 2 𝑝 (𝑥) = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 2

Parametric models: common discrete distributions 𝑝 𝑥 = 1 𝑛 Discrete uniform Bernoulli Binomial Poisson 𝑥∈ 1,…,𝑛 𝑝 𝑥 = 𝑝 𝑥 1−𝑝 1−𝑥 𝑥∈ 0,1 𝑝 𝑥 = 𝑛 𝑥 𝑝 𝑥 1−𝑝 𝑛−𝑥 𝑥∈ 0,…,𝑛 𝑝 𝑥 = 𝜆 𝑥 𝑒 −𝜆 𝑥! 𝑥∈ 0,…,∞

Parametric models: common continuous distributions 𝑓 𝑥 = 1 𝑏−𝑎 Uniform Normal Exponential Gamma 𝑥∈[𝑎,𝑏] 𝑓 𝑥 = 1 2𝜋 𝜎 2 𝑒 − 𝑥−𝜇 2 2 𝜎 2 𝑥∈(−∞,∞) 𝑓 𝑥 = 1 𝜃 𝑒 − 𝑥 𝜃 𝑥∈[0,∞) 𝑓 𝑥 ∝ 𝑥 𝛼−1 𝑒 −𝛽𝑥 𝑥∈[0,∞)

Nonparametric vs parametric models Nonparametric models and methods require fewer assumptions about data typically involve less complicated estimation methods Parametric models and methods tend to have more statistical power often have well-studied, well-behaved properties involve parameters that often have useful interpretations tend to scale better with large, high-dimensional datasets

Parametric probability models for ISI data 𝑓 𝑥 = 1 𝜃 𝑒 − 𝑥 𝜃 Exponential model: Which model (which value of 𝜃) should we pick? 𝑓 𝑥 ISI (ms)

Model fitting – parameter estimation Two related perspectives on parameter estimation Minimize some cost function e.g. 𝜃 = min 𝜃 𝐸 𝜃 −𝜃 2 Select estimators with known properties and distributions. e.g. unbiasedness, minimum variance, asymptotic normality Best of both worlds: Maximum Likelihood

Maximum likelihood The likelihood is the joint probability of the data, as a function of the model parameters 𝑓 𝑥 = 1 𝜃 𝑒 − 𝑥 𝜃 Exponential model: Likelihood: 𝐿 𝜃 = 𝑖=1 𝑛 1 𝜃 𝑒 − 𝑥 𝑖 𝜃 x 10-1870 𝐿(𝜃) 𝜃

Maximum log likelihood The log likelihood is maximized at the same point and is often easier to work with. 𝐿 𝜃 = 𝑖=1 𝑛 1 𝜃 𝑒 − 𝑥 𝑖 𝜃 Likelihood: Log likelihood: log 𝐿 𝜃 =− 𝑖=1 𝑛 log 𝜃 + 𝑥 𝑖 𝜃 log 𝐿(𝜃) 𝜃

Maximum likelihood The likelihood is the joint probability of the data, as a function of the model parameters log 𝐿 𝜃 =− 𝑖=1 𝑛 log 𝜃 + 𝑥 𝑖 𝜃 Log likelihood: 𝑑log 𝐿 𝜃 𝑑𝜃 =− 𝑖=1 𝑛 1 𝜃 − 𝑥 𝑖 𝜃 2 − 𝑛 𝜃 + 1 𝜃 2 𝑖=1 𝑛 𝑥 𝑖 =0 ⇒ 𝜃 = 𝑥 𝑖 𝑛 =30.8 ms

Goodness-of-fit Goodness-of-fit is assessed across multiple measures One intuitive approach – compare model to data Kolmogorov-Smirnov (KS) Plot 𝐹 𝑥 𝐹 𝑥 𝑓 𝑥 𝐹 𝑥 ISI (ms)

Interval estimation Any point estimate for a parameter is almost certainly incorrect and hard to express confidence about Instead we want a statement about a reasonable range of parameter values and our confidence in that range. The likelihood provides an intuitive approach: Examine curvature of log 𝐿(𝜃) at 𝜃 ML log 𝐿(𝜃) 𝜃 ML log 𝐿(𝜃)

Fisher Information Information contained in data about model parameter Expected Fisher Information: Observed Fisher Information: 𝐼 𝜃 =−𝐸 𝑑 2 log 𝐿(𝜃) 𝑑 𝜃 2 𝐼 𝜃 =− 𝑑 2 log 𝐿(𝜃) 𝑑 𝜃 2 𝜃 ML Cramer-Rao bound for any unbiased estimator: ML estimators achieve this bound and have a normal distribution asymptotically var 𝜃 ≥ 𝐼(𝜃) −1 95% CI for 𝜃: 𝜃 ML ±2 𝐼(𝜃) − 1 2

Fisher Information For our exponential model of the retinal ISI data 𝑑log 𝐿 𝜃 𝑑𝜃 =− 𝑖=1 𝑛 1 𝜃 − 𝑥 𝑖 𝜃 2 𝑑 2 log 𝐿 𝜃 𝑑 𝜃 2 = 𝑖=1 𝑛 1 𝜃 2 − 2 𝑥 𝑖 𝜃 3 𝐼 𝜃 =−𝐸 𝑑 2 log 𝐿 𝜃 𝑑 𝜃 2 =− 𝑛 𝜃 2 + 2𝑛𝜃 𝜃 3 = 𝑛 𝜃 2 𝐼 𝜃 =− 𝑑 2 log 𝐿 𝜃 𝑑 𝜃 2 𝜃 ML =− 𝑛 𝑥 2 + 2𝑛 𝑥 𝑥 3 = 𝑛 𝑥 2 var 𝜃 ≥ 𝜃 2 𝑛 95% CI for 𝜃: 𝑥 ±2 𝑥 𝑛 =[28.9, 32.8]

Information theory Claude Shannon – A mathematical theory of communication (1948) Formulated to quantify communication across a noisy channel Defines particular properties of probability models Goodness-of-fit often neglected Information theoretic statistics are rarely most powerful statistically These statistics can be useful when they have intuitive interpretations.

Information theory: entropy For discrete probability models, entropy is one of many measures to quantify variability/dispersion Self information/surprisal of an outcome: −log 𝑝(𝑥) Larger for less likely outcomes Information of independent events is additive Entropy is the expected surprisal: Maximized when each event is equally surprising Almost always expressed using log 2 (bits) 𝐻 𝑋 =−𝐸 log 𝑝 𝑋 =− 𝑥 𝑝 𝑥 log 𝑝(𝑥)

Information theory: differential entropy Not quite analogous measure of dispersion for continuous probability models Typically computed using log 2 (in bits) Depends on units of 𝑋 Can take on negative values Harder to interpret, but smaller values suggest probability mass confined to smaller volumes 𝐻 𝑋 =−𝐸 log 𝑓 𝑋 =− 𝑥 𝑓 𝑥 log 𝑓(𝑥) 𝑑𝑥

Differential entropy of our ISI model 𝐻 𝑋 =− 0 ∞ 1 𝜃 𝑒 − 𝑥 𝜃 log 1 𝜃 𝑒 − 𝑥 𝜃 𝑑𝑥 = log 𝜃 + 1 𝜃 0 ∞ 𝑥 𝜃 𝑒 − 𝑥 𝜃 𝑑𝑥 = log 𝜃 + 1 𝜃 𝐸 𝑋 = log 𝜃 +1 =4.43 nats=6.39 bits This does NOT mean that retinal spiking contains 6.4 bits of information. 𝐻 𝑋 = -3.6 bits if 𝜃 is in seconds. Watch out for many nonsense statements related to information theory!

Information theory: KL divergence Distance measure between two distributions Some properties: Asymmetric, 𝐷 KL (𝑋||𝑌)≥0. Interpretation (discrete): If 𝑃 𝑋 𝑋 is correct, information gain using 𝑃 𝑋 𝑋 instead of incorrect 𝑃 𝑌 𝑋 . Many relations to other information measures. 𝐷 KL 𝑋||𝑌 =− 𝐸 𝑋 log 𝑃 𝑌 𝑋 𝑃 𝑋 𝑋 Discrete variables: Continuous variables: 𝐷 KL 𝑋||𝑌 =− 𝑥 𝑝 𝑋 𝑥 log 𝑝 𝑌 𝑥 𝑝 𝑋 𝑥 𝐷 KL 𝑋||𝑌 =− 𝑥 𝑓 𝑋 𝑋 log 𝑓 𝑌 𝑥 𝑓 𝑋 𝑥 𝑑𝑥

Joint distributions In most experiments, we measure multiple variables. If variables are random, their relationship to each other is described by their joint distribution. Most statistical analyses relate to the estimation of some form of joint or conditional distribution. Some similar and some new challenges: Discrete vs continuous variables can lead to additional issues that require measure theoretic approaches More variables require high dimensional analysis methods

Joint distributions Imagine for retinal data we find that light intensity (X) varied in time and influenced ISIs (Y). ISIs (ms) Light intensity (fractional change)

Joint distributions Since both variables are continuous, we can describe distribution of probability mass using a joint density. 𝑓(𝑥,𝑦) 𝑓(𝑥,𝑦)≥0 𝑥 𝑦 𝑓 𝑥,𝑦 𝑑𝑦 𝑑𝑥 =1 𝑦 Joint CDF: 𝐹 𝑎,𝑏 = 𝑥<𝑎 𝑦<𝑏 𝑓 𝑥,𝑦 𝑑𝑦 𝑑𝑥 𝑥

Marginalization The marginal distribution of a set of variables is computed by integrating out any additional variables over their full support 𝑓 𝑦 = 𝑥 𝑓 𝑥,𝑦 𝑑𝑥 𝑓 𝑥 = 𝑦 𝑓 𝑥,𝑦 𝑑𝑦

Conditional distributions Conditional distributions characterize the distribution of one set of variables at specific values of other variables 𝑃 𝑌 𝑋=𝑥 = 𝑃(𝑋=𝑥,𝑌) 𝑃(𝑋=𝑥)

Conditional distributions 𝑃 𝑌 𝑋 , written as a function of the variable 𝑋, expresses the conditional distribution P(Y|𝑋=𝑥) at each value 𝑋=𝑥. 𝑃 𝑌 𝑋 = 𝑃(𝑋,𝑌) 𝑃(𝑋) 𝑃 𝑋,𝑌 =𝑃 𝑌 𝑋 𝑃(𝑋) 𝑦 𝑥

Independence Variables are independent when knowledge of one can provide no information about the other. 𝑃 𝑌 𝑋 =𝑃(𝑌) 𝑃(𝑋,𝑌) 𝑃(𝑋) =𝑃(𝑌) 𝑃(𝑋,𝑌)=𝑃 𝑋 𝑃(𝑌) 𝑦 𝑥

Linear models Linear models define a conditional distribution of a response variable as a function of one or more predictor variables. Typically, we assume a Gaussian noise model Note that this is linear with respect to the parameters, but 𝑋 and 𝑌 can be any nonlinear functions of measured variables. Estimate parameters by maximum likelihood. 𝑌= 𝛽 0 + 𝛽 1 𝑋+𝜀 𝑌|𝑋~𝑁( 𝛽 0 + 𝛽 1 𝑋,𝜎) 𝛽 1 = 𝑥 𝑖 − 𝑥 𝑦 𝑖 − 𝑦 𝑥 𝑖 − 𝑥 2 𝛽 0 = 𝑦 − 𝛽 1 𝑥

Linear fit of light intensity – ISI relationship 𝑦 𝑥

Linear fit of light intensity – log(ISI) relationship 𝑥 𝑦 log⁡(𝑦) 𝑥

Linear models Basic principles of linear modeling contain most basic statistical methods as special cases: z-tests, t-tests – linear model with binary predictor ANOVA – linear model with categorical predictors multiple regression – linear model with multiple predictors Quadratic /polynomial/spline/wavelet/etc. regression – linear model with predictors that are functions of select variables. Spectral estimation – linear model with sinusoid predictors Autocorrelation analysis/AR models – linear models with past signal values as predictors

More information theory Conditional entropy – expected remaining entropy in Y once X is known 𝐻 𝑌|𝑋 =−𝐸 log 𝑃(𝑌|𝑋) =− 𝑥,𝑦 𝑝 𝑥,𝑦 log 𝑝(𝑥,𝑦) 𝑝(𝑥) 𝐼 𝑋;𝑌 =−𝐸 log 𝑃 𝑋 𝑃 𝑌 𝑃 𝑋,𝑌 =− 𝑥,𝑦 𝑝 𝑥,𝑦 log 𝑝 𝑥 𝑝 𝑦 𝑝 𝑥,𝑦 =𝐻 𝑌 −𝐻 𝑌 𝑋 = 𝐷 𝐾𝐿 (𝑃(𝑋,𝑌)||𝑃 𝑋 𝑃(𝑌)) Mutual information – Expected change in entropy of Y by learning value of X. 𝐻 𝑌 =3.15 nats. 𝐻 𝑌 𝑋 =2.97 nats. 𝐼 𝑋;𝑌 =0.18 nats=0.27 bits. For retinal dataset:

Decoding problem Imagine in our retinal ISI data that we neglected to record the light intensity level through time for part of the experiment. Can we estimate it from the ISIs?

Decoding using the likelihood Our linear model describes the conditional distribution of the log ISI given the light intensity: Without any probability model for the intensity we could estimate its value for each ISI using ML log⁡(𝑌)|𝑋~𝑁( 𝛽 0 + 𝛽 1 𝑋,𝜎) 𝐿 𝑥 𝑡 = 1 2𝜋 𝜎 2 𝑒 − log 𝑦 𝑡 − 𝛽 0 − 𝛽 1 𝑥 𝑡 2 2 𝜎 2 𝑑log⁡(𝐿 𝑥 𝑡 ) 𝑑 𝑥 𝑡 =− 2 log 𝑦 𝑡 − 𝛽 0 − 𝛽 1 𝑥 𝑡 2 𝜎 2 ⇒ 𝑥 𝑡 = log 𝑦 𝑡 − 𝛽 0 𝛽 1

Bayes’ Rule Consequence of conditional distribution definition 𝑃 𝑋,𝑌 =𝑃 𝑌 𝑋 𝑃(𝑋) 𝑃 𝑋,𝑌 =𝑃 𝑋 𝑌 𝑃(𝑌) 𝑃 𝑋|𝑌 𝑃 𝑌 =𝑃 𝑌 𝑋 𝑃(𝑋) The posterior is proportional to the prior times the likelihood 𝑃 𝑋 𝑌 = 𝑃 𝑌|𝑋 𝑃(𝑋) 𝑃(𝑌)

Bayesian Decoder Model 𝑋 as an independent normal process: Prior: 𝑋~𝑁 𝜇 𝑥 , 𝜎 𝑥 Likelihood: log⁡(𝑌)|𝑋~𝑁( 𝛽 0 + 𝛽 1 𝑋,𝜎) Posterior: 𝑓 𝑋 log 𝑌 ∝𝑓 log 𝑌 𝑋 𝑓 𝑋 . ∝ exp − log 𝑦 − 𝛽 0 − 𝛽 1 𝑥 2 2 𝜎 2 − 𝑥− 𝜇 𝑥 2 2 𝜎 𝑥 2 𝑋|log 𝑌 ~𝑁 𝛽 1 𝜎 𝑥 2 log 𝑦 − 𝛽 0 + 𝜎 2 𝜇 𝑥 𝛽 1 2 𝜎 𝑥 2 + 𝜎 2 , 𝜎 𝑥 2 𝜎 2 𝛽 1 2 𝜎 𝑥 2 + 𝜎 2

Decoding ISI Data ISI (ms) Likelihood Decoder Intensity Bayesian Decoder Intensity Spike index

𝑃 𝑦 1:𝑛 𝑥 1:𝑚 =𝑃( 𝑦 1 | 𝑥 1:𝑚 )∙∙∙𝑃( 𝑦 𝑛 | 𝑥 1:𝑚 ) Latent variables Informal definition: any variable that is not directly observed, but which provides additional insight about the data is called latent. One formal definition: Latent variables are ones that makes the data conditionally independent: 𝑃( 𝑦 1:𝑛 )≠𝑃( 𝑦 1 )∙∙∙𝑃( 𝑦 𝑛 ) 𝑃 𝑦 1:𝑛 𝑥 1:𝑚 =𝑃( 𝑦 1 | 𝑥 1:𝑚 )∙∙∙𝑃( 𝑦 𝑛 | 𝑥 1:𝑚 ) Latent variables are often used to model unobserved factors influencing data, low-dimensional representations of high-dimensional data, unobserved dynamics underlying data, …

The state-space paradigm An unobserved (latent) state process (xt) undergoes stochastic dynamics that influence an observation process (yt) 𝑓 𝑦 𝑘 | 𝑥 𝑘 𝑓 𝑥 𝑘 | 𝑥 𝑘−1 Filtering problem: Estimate state, 𝑥 𝑘 , given observations up to current time, 𝑦 0:𝑘 . Smoothing problem: Estimate state, 𝑥 𝑘 , given all observations, 𝑦 0:𝑇 . In our example, we may assume the light intensity varies smoothly in time: 𝑥 𝑘 | 𝑥 𝑘−1 ~𝑁(0,𝜖).

Filters and Smoothers Using basic probability theory, we can compute posterior of latent state, given all the observations up to current time or given all observations in the experiment 𝑓 𝑥 𝑘 | 𝑦 1:𝑘 ∝𝑓 𝑦 𝑘 | 𝑥 𝑘 , 𝑦 1:𝑘−1 𝑓( 𝑥 𝑘 | 𝑦 1:𝑘−1 ) 𝑓 𝑥 𝑘 𝑦 1:𝑘−1 = −∞ ∞ 𝑓 𝑥 𝑘 | 𝑥 𝑘−1 , 𝑦 1:𝑘−1 𝑓( 𝑥 𝑘−1 | 𝑦 1:𝑘−1 )𝑑 𝑥 𝑘−1 𝑓 𝑥 𝑘 | 𝑦 1:𝑘 ∝𝑓 𝑦 𝑘 | 𝑥 𝑘 −∞ ∞ 𝑓 𝑥 𝑘 | 𝑥 𝑘−1 𝑓( 𝑥 𝑘−1 | 𝑦 1:𝑘−1 )𝑑 𝑥 𝑘−1 𝑓 𝑥 𝑘 | 𝑦 1:𝐾 =𝑓 𝑥 𝑘 | 𝑦 1:𝑘 −∞ ∞ 𝑓 𝑥 𝑘+1 | 𝑦 1:𝐾 𝑓 𝑥 𝑘 | 𝑥 𝑘−1 𝑓 𝑥 𝑘+1 | 𝑦 0:𝑘 𝑑 𝑥 𝑘+1

Latent variables ISI Data ISI (ms) Filter Estimate Intensity Smoother Estimate Intensity Spike index

Take home points Most data analysis techniques are fundamentally concerned with expressing a (joint) probability model for the data. Nonparametric and parametric modeling methods are complementary and can provide robust and interpretable properties of the data. Maximum likelihood provides an often statistically optimal approach for estimating parameters of models. Information measures describe properties of a probability model, which can in turn provide insight into data. Latent variable methods use knowledge or assumptions about unobserved variables to improve understanding of observed data.