Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC.

Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC

Scientific methods Induction F.Bacon Machine Models Data Deduction R.Descartes Math. modeling

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Outline  Learning as ill-posed problem  General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Problem statement Learning is inverse, ill-posed problem Model  Data Learning paradoxes Infinite predictions  Finite data? How to optimize future predictions? How to select regular from casual in data? Regularization of learning Optimal model complexity

Well-posed problem Solution is unique Solution is stable Hadamard (1900-s) Tikhonoff (1960-s)

Learning from examples Problem: Find hypothesis h, generating observed data D in model H Well defined if not sensitive to: noise in data (Hadamard) learning procedure (Tikhonoff)

Learning is ill-posed problem Example: Function approximation Sensitive to noise in data Sensitive to learning procedure

Learning is ill-posed problem Solution is non-unique

Outline  Learning as ill-posed problem General problem: data generalization  General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Problem regularization Main idea: restrict solutions – sacrifice precision to stability How to choose?

Statistical Learning practice Data  Learning set + Validation set Cross-validation: Systematic approach to ensembles  Bayes + + … +

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization  Bayesian regularization. Theory  Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Statistical Learning theory Learning as inverse Probability Probability theory. H: h  D Learning theory. H: h  D H Bernoulli (1713) Bayes (~ 1750)

Bayesian learning Evidence Prior Posterior

Coin tossing game H

Monte Carlo simulations

Bayesian regularization Most Probable hypothesis  Learning error Example: Function approximation Regularization

Minimal Description Length Most Probable hypothesis Code length for:Data hypothesis Example: Optimal prefix code 01 11 10 110 111 Rissanen (1978)

Data Complexity Complexity K ( D |H ) = min L ( h, D|H ) Code length L(h,D) = coded data L(D|h) + decoding program L(h) Data D Decoding Kolmogoroff (1965)

Complex = Unpredictable Prediction error ~ L ( h,D ) / L ( D ) Random data is uncompressible Compression = predictability Program h: length L(h,D) Data D Decoding Example: block coding Solomonoff (1978)

Universal Prior All 2 L programs with length L are equiprobable Data complexity Solomonoff (1960) Bayes (~1750) L(h,D) D H

Statistical ensemble Shorter description length Proof: Corollary: Ensemble predictions are superior to most probable prediction

Ensemble prediction

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization  Bayesian regularization. Theory Hypothesis comparison  Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Model comparison Evidence Posterior

Statistics: Bayes vs. Fisher Fisher: max Likelihood Bayes: max Evidence

Historical outlook 20 – 60s of ХХ century Parametric statistics Asymptotic N   60 - 80s of ХХ century Non-Parametric statistics Regularization of ill-posed problems Non-asymptotic learning Algorithmic complexity Statistical physics of disordered systems Fisher (1912) Chentsoff (1962) Tikhonoff (1963) Vapnik (1968) Kolmogoroff (1965) Gardner (1988)

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization  Bayesian regularization. Theory Hypothesis comparison Model comparison  Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Statistical physics Probability of hypothesis - microstate Optimal model - macrostate

Free energy F = - log Z: Log of Sum  F = E – TS: Sum of logs P = P{L} 

EM algorithm. Main idea Introduce independent P: Iterations E-step: М-step:

EM algorithm Е-step Estimate Posterior for given Model М-step Update Model for given Posterior

Bayesian regularization: Examples Hypothesis testing Function approximation Data clustering y h y h(x)h(x) x P(x|H)P(x|H) x

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm  Bayesian regularization. Practice  Hypothesis testing Function approximation Data clustering

Hypothesis testing Problem Noisy observations: y  Is theoretical value h 0 true? Model H: Gaussian noise Gaussian prior y h0h0

Optimal model: Phase transition  Confidence  finite  infinite

Threshold effect Student coefficient Hypothesis h 0 is true Corrections to h 0 y P(h)P(h) y h P(h)P(h)

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm  Bayesian regularization. Practice Hypothesis testing  Function approximation Data clustering

Function approximation Problem Noisy data: y  (x  ) Find approximation h(x) Model: Noise Prior y h(x)h(x) x

Optimal model Free energy minimization

Saddle point approximation Function of best hypothesis

ЕМ learning Е-step. Optimal hypothesis М-step. Optimal regularization

Laplace Prior Pruned weights Equisensitive weights

Laplace regularization Е-step. Weights estimation М-step. Regularization

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm  Bayesian regularization. Practice Hypothesis testing Function approximation  Data clustering

Clustering Problem Noisy data: x  Find prototypes (mixture density approximation) How many clusters? Модель: Noise P(x|H)P(x|H) x

Optimal model Free energy minimization Iterations E-step: М-step:

ЕМ algorithm Е-step: М-step:

How many clusters? Number of clusters M(  ) Optimal number of clusters h(m)h(m) 1/1/

Simulations: Uniform data Optimal model M

Simulations: Gaussian data Optimal model M 0 1020304050 -12.5 -12 -11.5 -11 -10.5 -10 -9.5

Simulations: Gaussian mixture Optimal model M

Summary Learning Ill-posed problem Remedy: regularization Bayesian learning Built-in regularization (model assumptions) Optimal model = minimal Description Length = minimal Free Energy Practical issues Learning algorithms with built-in optimal regularization - from first principles (opposite to cross validation)

Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC.

Similar presentations

Presentation on theme: "Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC.

Similar presentations

Presentation on theme: "Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC."— Presentation transcript:

Similar presentations

About project

Feedback