MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Bayes rule, priors and maximum a posteriori

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

Pattern Recognition and Machine Learning

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

CS479/679 Pattern Recognition Dr. George Bebis

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Visual Recognition Tutorial

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Machine Learning CMPT 726 Simon Fraser University

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Computer vision: models, learning and inference

Computer vision: models, learning and inference Chapter 3 Common probability distributions.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Crash Course on Machine Learning

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Basics on Probability Jingrui He 09/11/2007. Coin Flips  You flip a coin Head with probability 0.5  You flip 100 coins How many heads would you expect.

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Bayesian Estimation and Confidence Intervals Lecture XXII.

Bayesian Inference: Multiple Parameters

Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.

Oliver Schulte Machine Learning 726

Bayesian Estimation and Confidence Intervals

Oliver Schulte Machine Learning 726

MCMC Output & Metropolis-Hastings Algorithm Part I

Deep Feedforward Networks

Probability Theory and Parameter Estimation I

ICS 280 Learning in Graphical Models

Appendix A: Probability Theory

CS 2750: Machine Learning Density Estimation

Parameter Estimation 主講人：虞台文.

CS 2750: Machine Learning Probability Review Density Estimation

Bayes Net Learning: Bayesian Approaches

Computer vision: models, learning and inference

Latent Variables, Mixture Models and EM

Distributions and Concepts in Probability Theory

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

Hidden Markov Models Part 2: Algorithms

More about Posterior Distributions

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Statistical Models for Automatic Speech Recognition

'Linear Hierarchical Models'

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

CSCI 5822 Probabilistic Models of Human and Machine Learning

Summarizing Data by Statistics

Pattern Recognition and Machine Learning

LECTURE 07: BAYESIAN ESTIMATION

Topic Models in Text Processing

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

Mathematical Foundations of BME Reza Shadmehr

Probabilistic Surrogate Models

Presentation transcript:

MLPR - Questions

Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian actually mean? Whoa. Lots of distributions spinning around in my head. Where what who when how? What does conjugate prior mean? What does argmax mean? Is prior P(p) similar to the maximum likelihood distribution P(D|p)? Do we need to know how to do the integrals? They look really tough. Why do we need to do these integrals anyway? What does it do? What is the difference between max likelihood and max posterior? When should I use what? Practically that is… Relate gradient computation and intuitive gradient direction. Give an exam example of Bayesian model selection. Given a question or problem. How should we proceed. Can you give us an example of this stuff in action? That PCA stuff, structural equation models etc. WT*. What should I be reading. What are the important bits?

Can you go through integration, differentiation etc. No. Sorry.

Why do we need priors? Difference between prior and posterior. Have Data D. Have Questions Q. Must Connect. Cox’s axioms show reasonable connection = probabilistic. Build probability model P(D,Q). Problem solved. This is (in a general sense) the prior connection between D and Q. Reality. Can’t really write down P(D, Q). But can write down model P(D |  ) that depends on unknown parameters . Also write P(Q|  ). Good. Now related data to question through . P(D |  ) is likelihood. Turns out this is not enough. To get full distribution relating D and Q also need P(  ). This is the prior. Without it what can we say? Different  = different predictions P(Q|  ). Which should we use? Could pick particular  (e.g. max likelihood). But this is cheating. Replacing unknown with known. Self-deceptive. Not probabilistic.

Why do we need priors? Difference between prior and posterior. What should we do? “Integrate out” or “marginalise:” Note rules of probability for (real-valued) random variables are really simple. Product=independence Conditioning. Marginalisation. Must integrate to one. Cumalitive Density. That’s it. The rest (e.g. Bayes Thm) follows.

What does Bayesian actually mean? Doing the above. Recognising the need for priors. Doing all the sums using rules of probability. That’s it.

Lots of distributions spinning around in my head. What do I use when? Usually as likelihoods: –Bernoulli - two options. Happens once. –Multivariate - many options. Happens once. –Binomial - two options. Happens many times. Keep count. –Multinomial - many options. Happens many times Keep count. –Uniform - only use on strictly bounded real quantities. –Gaussian - use on potentially unbounded real quantities. Usually as priors: –Gamma - use of positive unbounded quantities (e.g. a variance). –Beta distribution - bounded quantities between 0 and 1 (e.g. probabilities). –Dirichlet distribution - multiple positive quantities that must add up to 1 (e.g. probabilities).

What does conjugate prior mean? Prior class is conjugate to a specific likelihood class. Prior class = set of prior distributions e.g. all Dirichlet distributions. Likelihood class = set of likelihoods. e.g. set of all Multivariate distributions. Conjugate iff the posterior class is a subset of the prior class. Why useful? When we see some data, all we need to do is update the parameters of the prior to get the posterior.

What does argmax mean s=argmax_t (f(t)) means find the argument that maximizes f(t). In other words what value does t take where f attains its maximum. Se that to s.

Is prior P(p) similar to the maximum likelihood distribution P(D|p)? No. Best not to think of maximum likelihood distribution. Think of likelihoods. Think of maximum likelihood values for parameters. Strictly speaking the term likelihood is used for P(D|  ) as a function of the parameters . It is not a distribution over . It doesn’t normalise. The prior P(  ) is a completely different concept.

Do we need to know how to do the integrals? They look really tough Understand what we are doing with the integrals even if you can’t actually do them Integrals of Bernoulli Beta stuff is not that hard and is done in the lecture. They are 1D. Gaussian integrals painful. Multi-dimensional. But Add, product, integrate over, multiply by const all conserve Gaussianity. Can cheat. Don’t do integrals. Just match moments (means, covariances).

Do we need to know how to do the integrals? They look really tough Suppose we have a really complicated model (here all quantities are vectors) that says that x is Gaussian with mean A*y (A a matrix) and covariance S, y is the sum of two Gaussian variables r,s mean 0 covariance T_1 and T_2.

Why do we need to do these integrals anyway? What does it do? A probability weighted integration (which is what all these are) basically says: We don't know what the parameter is, so we need to consider all possible values for it. But not all possible values are equal, some are more probable than others. So we need to weight each possible value by its probability. To consider all possible values we need to sum out over all these possibilities. In other words, for each possible parameter value we work out the implication of the model given that parameter value. Then we combine all these possibilities together in a sensible way (weighted average) to get the resulting belief about the thing we are interested in.

What is the difference between max likelihood and max posterior? When should I use what? Practically that is… Very little difference. ML: argmax log P(D|  ). MPost: argmax ( log P(D|  ) + log P(  ) ). Always use maximum posterior, if you must maximize. Use “priors” that avoid the extremes. Still has its problems.

Relate gradient computation and intuitive gradient direction. Two lectures time.

Give an exam example of Bayesian model selection. Will provide worked example of this.

Given a question or problem. How should we proceed. Real life: –Look at the data. –Decide on problem. –Decide on method (generative, predictive). –Decide on model. –Decide on inference method (usually approximation). –Do it. Exam?

Can you give us an example of this stuff in action? After next lecture.

That PCA stuff, structural equation models etc. What’s that all about. Don’t worry. The PPCA stuff is used to motivate the PCA algorithm (possibly badly). The other examples (SEM) were there just for information. Really all you need to know at this stage is the PCA algorithm and why it represents a lower dimensional reduction.

What should I be reading. What are the important bits? Vital - the point. The high level concepts, why do what. (Bishop ) A good idea of the form of the basic distributions (the Bernoulli, the Binomial, the Multinomial, the Beta, the Dirichlet). The Gaussian distribution, the form, what the mean and covariance is, what an eigen-decomposition of the covariance means (will say more next lecture) (see also the notes online for this) The concept of exponential family. Bishop Chapter 2. Regression Chap (Understanding at the high level stuff in 3.3, 3.4 but not the maths details.) (see also the notes online for this). Its as important to know what you don't know as what you do!