Download presentation
Presentation is loading. Please wait.
Published byScarlett Sims Modified over 8 years ago
1
MLPR - Questions
2
Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian actually mean? Whoa. Lots of distributions spinning around in my head. Where what who when how? What does conjugate prior mean? What does argmax mean? Is prior P(p) similar to the maximum likelihood distribution P(D|p)? Do we need to know how to do the integrals? They look really tough. Why do we need to do these integrals anyway? What does it do? What is the difference between max likelihood and max posterior? When should I use what? Practically that is… Relate gradient computation and intuitive gradient direction. Give an exam example of Bayesian model selection. Given a question or problem. How should we proceed. Can you give us an example of this stuff in action? That PCA stuff, structural equation models etc. WT*. What should I be reading. What are the important bits?
3
Can you go through integration, differentiation etc. No. Sorry.
4
Why do we need priors? Difference between prior and posterior. Have Data D. Have Questions Q. Must Connect. Cox’s axioms show reasonable connection = probabilistic. Build probability model P(D,Q). Problem solved. This is (in a general sense) the prior connection between D and Q. Reality. Can’t really write down P(D, Q). But can write down model P(D | ) that depends on unknown parameters . Also write P(Q| ). Good. Now related data to question through . P(D | ) is likelihood. Turns out this is not enough. To get full distribution relating D and Q also need P( ). This is the prior. Without it what can we say? Different = different predictions P(Q| ). Which should we use? Could pick particular (e.g. max likelihood). But this is cheating. Replacing unknown with known. Self-deceptive. Not probabilistic.
5
Why do we need priors? Difference between prior and posterior. What should we do? “Integrate out” or “marginalise:” Note rules of probability for (real-valued) random variables are really simple. Product=independence Conditioning. Marginalisation. Must integrate to one. Cumalitive Density. That’s it. The rest (e.g. Bayes Thm) follows.
6
What does Bayesian actually mean? Doing the above. Recognising the need for priors. Doing all the sums using rules of probability. That’s it.
7
Lots of distributions spinning around in my head. What do I use when? Usually as likelihoods: –Bernoulli - two options. Happens once. –Multivariate - many options. Happens once. –Binomial - two options. Happens many times. Keep count. –Multinomial - many options. Happens many times Keep count. –Uniform - only use on strictly bounded real quantities. –Gaussian - use on potentially unbounded real quantities. Usually as priors: –Gamma - use of positive unbounded quantities (e.g. a variance). –Beta distribution - bounded quantities between 0 and 1 (e.g. probabilities). –Dirichlet distribution - multiple positive quantities that must add up to 1 (e.g. probabilities).
8
What does conjugate prior mean? Prior class is conjugate to a specific likelihood class. Prior class = set of prior distributions e.g. all Dirichlet distributions. Likelihood class = set of likelihoods. e.g. set of all Multivariate distributions. Conjugate iff the posterior class is a subset of the prior class. Why useful? When we see some data, all we need to do is update the parameters of the prior to get the posterior.
9
What does argmax mean s=argmax_t (f(t)) means find the argument that maximizes f(t). In other words what value does t take where f attains its maximum. Se that to s.
10
Is prior P(p) similar to the maximum likelihood distribution P(D|p)? No. Best not to think of maximum likelihood distribution. Think of likelihoods. Think of maximum likelihood values for parameters. Strictly speaking the term likelihood is used for P(D| ) as a function of the parameters . It is not a distribution over . It doesn’t normalise. The prior P( ) is a completely different concept.
11
Do we need to know how to do the integrals? They look really tough Understand what we are doing with the integrals even if you can’t actually do them Integrals of Bernoulli Beta stuff is not that hard and is done in the lecture. They are 1D. Gaussian integrals painful. Multi-dimensional. But Add, product, integrate over, multiply by const all conserve Gaussianity. Can cheat. Don’t do integrals. Just match moments (means, covariances).
12
Do we need to know how to do the integrals? They look really tough Suppose we have a really complicated model (here all quantities are vectors) that says that x is Gaussian with mean A*y (A a matrix) and covariance S, y is the sum of two Gaussian variables r,s mean 0 covariance T_1 and T_2.
13
Why do we need to do these integrals anyway? What does it do? A probability weighted integration (which is what all these are) basically says: We don't know what the parameter is, so we need to consider all possible values for it. But not all possible values are equal, some are more probable than others. So we need to weight each possible value by its probability. To consider all possible values we need to sum out over all these possibilities. In other words, for each possible parameter value we work out the implication of the model given that parameter value. Then we combine all these possibilities together in a sensible way (weighted average) to get the resulting belief about the thing we are interested in.
14
What is the difference between max likelihood and max posterior? When should I use what? Practically that is… Very little difference. ML: argmax log P(D| ). MPost: argmax ( log P(D| ) + log P( ) ). Always use maximum posterior, if you must maximize. Use “priors” that avoid the extremes. Still has its problems.
15
Relate gradient computation and intuitive gradient direction. Two lectures time.
16
Give an exam example of Bayesian model selection. Will provide worked example of this.
17
Given a question or problem. How should we proceed. Real life: –Look at the data. –Decide on problem. –Decide on method (generative, predictive). –Decide on model. –Decide on inference method (usually approximation). –Do it. Exam?
18
Can you give us an example of this stuff in action? After next lecture.
19
That PCA stuff, structural equation models etc. What’s that all about. Don’t worry. The PPCA stuff is used to motivate the PCA algorithm (possibly badly). The other examples (SEM) were there just for information. Really all you need to know at this stage is the PCA algorithm and why it represents a lower dimensional reduction.
20
What should I be reading. What are the important bits? Vital - the point. The high level concepts, why do what. (Bishop 1.1 1.2 1.3) A good idea of the form of the basic distributions (the Bernoulli, the Binomial, the Multinomial, the Beta, the Dirichlet). The Gaussian distribution, the form, what the mean and covariance is, what an eigen-decomposition of the covariance means (will say more next lecture) (see also the notes online for this) The concept of exponential family. Bishop Chapter 2. Regression Chap 3.1 3.6 (Understanding at the high level stuff in 3.3, 3.4 but not the maths details.) (see also the notes online for this). Its as important to know what you don't know as what you do!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.