Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

Similar presentations


Presentation on theme: "CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton."— Presentation transcript:

1 CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton

2 The Bayesian framework The Bayesian framework assumes that we always have a prior distribution for everything. –The prior may be very vague. –When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution. –The likelihood term takes into account how probable the observed data is given the parameters of the model. It favors parameter settings that make the data likely. It fights the prior With enough data the likelihood terms always win.

3 A coin tossing example Suppose you know nothing about coins except that each tossing event produces a head with some unknown probability p and a tail with probability 1-p. Suppose we observe 100 tosses and there are 53 heads. What is p? The frequentist answer: Pick the value of p that makes the observation of 53 heads and 47 tails most probable.

4 Some problems with picking the parameters that are most likely to generate the data What if we only tossed the coin once and we got 1 head? –Is p=1 a sensible answer? Surely p=0.5 is a much better answer. Is it reasonable to give a single answer? –If we don’t have much data, we are unsure about p. – Our computations will work much better if we take this uncertainty into account.

5 Using a distribution over parameters Start with a prior distribution over p. Multiply the prior probability of each parameter value by the probability of observing a head given that value. Then renormalize to get the posterior distribution probability density p area=1 01 1

6 Lets do it again Start with a prior distribution over p. Multiply the prior probability of each parameter value by the probability of observing a tail given that value. The renormalize to get the posterior distribution probability density p area=1 01 1 2

7 Lets do it another 98 times After 53 heads and 47 tails we get a very sensible posterior distribution that has its peak at 0.53 (assuming a uniform prior). probability density p area=1 01 1 2

8 Bayes Theorem Prior probability of weight vector W Posterior probability of weight vector W Probability of observed data given W joint probability conditional probability

9 Why we maximize sums of log probs We want to maximize products of probabilities of a set of independent events –Assume the output errors on different training cases are independent. –Assume the priors on weights are independent. Because the log function is monotonic, we can maximize sums of log probabilities

10 The Bayesian interpretation of weight decay assuming a Gaussian prior for the weights assuming that the model makes a Gaussian prediction

11 Maximum Likelihood Learning Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answers under a Gaussian centered at the model’s guess. This is Maximum Likelihood. d correct answer y model’s prediction

12 Maximum A Posteriori Learning This trades-off the prior probabilities of the parameters against the probability of the data given the parameters. It looks for the parameters that have the greatest product of the prior term and the likelihood term. Minimizing the squared weights is equivalent to maximizing the log probability of the weights under a zero-mean Gaussian prior. w 0 p(w)

13 Full Bayesian Learning Instead of trying to find the best single setting of the parameters (as in ML or MAP) compute the full posterior distribution over parameter settings –This is extremely computationally intensive for all but the simplest models (its feasible for a biased coin). To make predictions, let each different setting of the parameters make its own prediction and then combine all these predictions by weighting each of them by the posterior probability of that setting of the parameters. –This is also computationally intensive. The full Bayesian approach allows us to use complicated models even when we do not have much data

14 Overfitting: A frequentist illusion? If you do not have much data, you should use a simple model, because a complex one will overfit. –This is true. But only if you assume that fitting a model means choosing a single best setting of the parameters. –If you use the full posterior over parameter settings, overfitting disappears! –With little data, you get very vague predictions because many different parameters settings have significant posterior probability

15 A classic example of overfitting Which model do you believe? –The complicated model fits the data better. –But it is not economical and it makes silly predictions. But what if we start with a reasonable prior over all fifth-order polynomials and use the full posterior distribution. –Now we get vague and sensible predictions. There is no reason why the amount of data should influence our prior beliefs about the complexity of the model.


Download ppt "CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton."

Similar presentations


Ads by Google