Bayesian models of inductive learning

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

The influence of domain priors on intervention strategy Neil Bramley.
A Tutorial on Learning with Bayesian Networks
Probabilistic Models in Human and Machine Intelligence.
Dynamic Bayesian Networks (DBNs)
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Kevin Murphy UBC CS & Stats 9 February 2005
Probabilistic Models of Cognition Conceptual Foundations Chater, Tenenbaum, & Yuille TICS, 10(7), (2006)
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.
Bayesian models of inductive learning
A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Part III Hierarchical Bayesian Models. Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g.,
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian Model Comparison and Occam’s Razor Lecture 2.
Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Psych 156A/ Ling 150: Acquisition of Language II Lecture 16 Language Structure II.
Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
What is Cognitive Science? Josh Tenenbaum MLSS 2010.
Introduction to probabilistic models of cognition Josh Tenenbaum MIT.
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Computational models of cognitive development: the grammar analogy Josh Tenenbaum MIT.
Bayesian models of inductive learning Tom Griffiths UC Berkeley Josh Tenenbaum MIT Charles Kemp CMU.
1 LING 696B: Midterm review: parametric and non-parametric inductive inference.
Randomized Algorithms for Bayesian Hierarchical Clustering
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Probabilistic Models in Human and Machine Intelligence.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Logistics Class size? Who is new? Who is listening? Everyone on Athena mailing list “concepts- and-theories”? If not write to me. Everyone on stellar yet?
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Bayesian Estimation and Confidence Intervals Lecture XXII.
MCMC Output & Metropolis-Hastings Algorithm Part I
Bayesian data analysis
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Building and evaluating models of human-level intelligence
The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Bayesian models of inductive learning Tom Griffiths UC Berkeley Josh Tenenbaum MIT Charles Kemp MIT

What to expect What you’ll get out of this tutorial: Our view of what Bayesian models have to offer cognitive science. In-depth examples of basic and advanced models: how the math works & what it buys you. Some (not extensive) comparison to other approaches. Opportunities to ask questions. What you won’t get: Detailed, hands-on how-to. Where you can learn more: http://bayesiancognition.com Trends in Cognitive Sciences, July 2006, special issue on “Probabilistic Models of Cognition”.

Outline Morning Afternoon Introduction: Why Bayes? (Josh) Basic of Bayesian inference (Josh) Graphical models, causal inference and learning (Tom) Afternoon Hierarchical Bayesian models, property induction, and learning domain structures (Charles) Methods of approximate learning and inference, probabilistic models of semantic memory (Tom)

Why Bayes? The problem of induction How does the mind form inferences, generalizations, models or theories about the world from impoverished data? Induction is ubiquitous in cognition Vision (+ audition, touch, or other perceptual modalities) Language (understanding, production) Concepts (semantic knowledge, “common sense”) Causal learning and reasoning Decision-making and action (production, understanding) Bayes gives a general framework for explaining how induction can work in principle, and perhaps, how it does work in the mind….

P(S | U, G) ~ P(U | S) x P(S | G) Grammar G P(S | G) Phrase structure S P(U | S) Utterance U P(S | U, G) ~ P(U | S) x P(S | G) Bottom-up Top-down

“Universal Grammar” Grammar Phrase structure Utterance Speech signal Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG) P(phrase structure | grammar) P(utterance | phrase structure) P(speech | utterance) (c.f. Chater and Manning, 2006) P(grammar | UG) Grammar Phrase structure Utterance Speech signal

The approach Key concepts Much recent progress! Inference in probabilistic generative models Hierarchical probabilistic models, with inference at all levels of abstraction Structured knowledge representations: graphs, grammars, predicate logic, schemas, theories Flexible structures, with complexity constrained by Bayesian Occam’s razor Approximate methods of learning and inference: Expectation-Maximization (EM), Markov chain Monte Carlo (MCMC) Much recent progress! Computational resources to implement and test models that we could dream up but not realistically imagine working with New theoretical tools let us develop models that we could not clearly conceive of before.

Vision as probabilistic parsing (Han and Zhu, 2006)

Word learning on planet Gazoob “tufa” Can you pick out the tufas?

Learning word meanings Whole-object principle Shape bias Taxonomic principle Contrast principle Basic-level bias Principles Structure Data

Causal learning and reasoning Principles Structure Data

Goal-directed action (production and comprehension) (Wolpert et al., 2003)

Marr’s Three Levels of Analysis Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” Algorithm: Cognitive psychology Implementation: Neurobiology Cognitive psychology traditionally focuses on 2nd level. But this is unsatisying, for the same reason as in vision: - Ad hoc models, with arbitrary assumptions and free parameters. - Lots of different models with little sense of how they all fit together, or what the real differences are. Cognitive neuroscience has focused on the link between levels 2 and 3. But that doesn’t address what is to me the biggest mystery: how we are able to succeed in these inductive inference tasks, given that induction has been called a puzzle, a paradox, a scandal! Going back to plato, aristotle. Scandal though it might be, we do these things on a daily basis. That requires level 1. But outside of vision or language, not very much work on level 1, or the link from level 1 to level 2 and 3. No explanatory adequacy. Describe how the mind works, but don’t explain why it works that way. Why these inference strategies lead to success in the real world.

Alternative approaches to inductive learning and inference Associative learning Connectionist networks Similarity to examples Toolkit of simple heuristics Constraint satisfaction Analogical mapping

Summary: Why Bayes? A unifying framework for explaining cognition. How people can learn so much from such limited data. Strong quantitative models with minimal ad hoc assumptions. Why algorithmic-level models work the way they do. A framework for understanding how structured knowledge and statistical inference interact. How structured knowledge guides statistical inference, and may itself be acquired through statistical means. What forms knowledge takes, at multiple levels of abstraction. What knowledge must be innate, and what can be learned. How flexible knowledge structures may grow as required by the data, with complexity controlled by Occam’s razor.

Outline Morning Afternoon Introduction: Why Bayes? (Josh) Basic of Bayesian inference (Josh) Graphical models, causal inference and learning (Tom) Afternoon Hierarchical Bayesian models, property induction, and learning domain structures (Charles) Methods of approximate learning and inference, probabilistic models of semantic memory (Tom)

Bayes’ rule For any hypothesis h and data d, Likelihood Prior probability Posterior probability Sum over space of alternative hypotheses

Bayesian inference Bayes’ rule: An example Data: John is coughing Some hypotheses: John has a cold John has emphysema John has a stomach flu Prior favors 1 and 3 over 2 Likelihood P(d|h) favors 1 and 2 over 3 Posterior P(d|h) favors 1 over 2 and 3

Coin flipping Basic Bayes Parameter estimation (Model fitting) data = HHTHT or HHHHH compare two simple hypotheses: P(H) = 0.5 vs. P(H) = 1.0 Parameter estimation (Model fitting) compare many hypotheses in a parameterized family P(H) = q : Infer q Model selection compare qualitatively different hypotheses, often varying in complexity: P(H) = 0.5 vs. P(H) = q

Coin flipping HHTHT HHHHH What process produced these sequences?

Comparing two simple hypotheses Contrast simple hypotheses: h1: “fair coin”, P(H) = 0.5 h2:“always heads”, P(H) = 1.0 Bayes’ rule: With two hypotheses, use odds form

Comparing two simple hypotheses D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = ? P(D|H2) = 0 P(H2) = 1-?

Comparing two simple hypotheses D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000

Comparing two simple hypotheses D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000

Comparing two simple hypotheses D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000

The role of intuitive theories The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. Easy to imagine how a trick all-heads coin could work: high prior probability. Hard to imagine how a trick “HHTHT” coin could work: low prior probability.

Coin flipping Basic Bayes Parameter estimation (Model fitting) data = HHTHT or HHHHH compare two hypotheses: P(H) = 0.5 vs. P(H) = 1.0 Parameter estimation (Model fitting) compare many hypotheses in a parameterized family P(H) = q : Infer q Model selection compare qualitatively different hypotheses, often varying in complexity: P(H) = 0.5 vs. P(H) = q

Parameter estimation Assume data are generated from a parameterized model: What is the value of q ? each value of q is a hypothesis H requires inference over infinitely many hypotheses q d1 d2 d3 d4 P(H) = q

Model selection Assume hypothesis space of possible models: Which model generated the data? requires summing out hidden variables requires some form of Occam’s razor to trade off complexity with fit to the data. q s1 s2 s3 s4 d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4 Hidden Markov model: si {Fair coin, Trick coin} Fair coin: P(H) = 0.5 P(H) = q

Parameter estimation vs Parameter estimation vs. Model selection across learning and development Causality: learning the strength of a relation vs. learning the existence and form of a relation Language acquisition: learning a speaker's accent, or frequencies of different words vs. learning a new tense or syntactic rule (or learning a new language, or the existence of different languages) Concepts: learning what horses look like vs. learning that there is a new species (or learning that there are species) Intuitive physics: learning the mass of an object vs. learning about gravity or angular momentum Intuitive psychology: learning a person’s beliefs or goals vs. learning that there can be false beliefs, or that visual access is valuable for establishing true beliefs

A hierarchical learning framework model Parameter estimation: parameters data

A hierarchical learning framework model class Model selection: model Parameter estimation: parameters data

Bayesian parameter estimation Assume data are generated from a model: What is the value of q ? each value of q is a hypothesis H requires inference over infinitely many hypotheses q d1 d2 d3 d4 P(H) = q

Some intuitions D = 10 flips, with 5 heads and 5 tails. q = P(H) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. Why? “The future will be like the past” Suppose we had seen 4 heads and 6 tails. P(H) on next flip? Closer to 50% than to 40%. Why? Prior knowledge.

Integrating prior knowledge and data Posterior distribution P(q | D) is a probability density over q = P(H) Need to work out likelihood P(D | q ) and specify prior distribution P(q )

? Likelihood and prior Likelihood: Bernoulli distribution P(D | q ) = q NH (1-q ) NT NH: number of heads NT: number of tails Prior: P(q )  ?

Some intuitions D = 10 flips, with 5 heads and 5 tails. q = P(H) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. Why? Maximum likelihood: Suppose we had seen 4 heads and 6 tails. P(H) on next flip? Closer to 50% than to 40%. Why? Prior knowledge.

A simple method of specifying priors Imagine some fictitious trials, reflecting a set of previous experiences strategy often used with neural networks or building invariance into machine vision. e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...

Likelihood and prior Likelihood: Bernoulli(q ) distribution P(D | q ) = q NH (1-q ) NT NH: number of heads NT: number of tails Prior: Beta(FH,FT) distribution P(q )  q FH-1 (1-q ) FT-1 FH: fictitious observations of heads FT: fictitious observations of tails

Shape of the Beta prior

Shape of the Beta prior FH = 0.5, FT = 0.5 FH = 0.5, FT = 2

Bayesian parameter estimation P(q | D)  P(D | q ) P(q ) = q NH+FH-1 (1-q ) NT+FT-1 Posterior is Beta(NH+FH,NT+FT) same form as prior! expected P(H) = (NH+FH) / (NH+FH+NT+FT)

Conjugate priors A prior p(q ) is conjugate to a likelihood function p(D | q ) if the posterior has the same functional form of the prior. Parameter values in the prior can be thought of as a summary of “fictitious observations”. Different parameter values in the prior and posterior reflect the impact of observed data. Conjugate priors exist for many standard models (e.g., all exponential family models)

Some examples e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair After seeing 4 heads, 6 tails, P(H) on next flip = 1004 / (1004+1006) = 49.95% e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair After seeing 4 heads, 6 tails, P(H) on next flip = 7 / (7+9) = 43.75% Prior knowledge too weak

But… flipping thumbtacks e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks are slightly biased towards heads After seeing 2 heads, 0 tails, P(H) on next flip = 6 / (6+3) = 67% Some prior knowledge is always necessary to avoid jumping to hasty conclusions... Suppose F = { }: After seeing 1 heads, 0 tails, P(H) on next flip = 1 / (1+0) = 100%

Origin of prior knowledge Tempting answer: prior experience Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails

Problems with simple empiricism Haven’t really seen 2000 coin flips, or any flips of a thumbtack Prior knowledge is stronger than raw experience justifies Haven’t seen exactly equal number of heads and tails Prior knowledge is smoother than raw experience justifies Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins Prior knowledge is more structured than raw experience

A simple theory “Coins are manufactured by a standardized procedure that is effective but not perfect, and symmetric with respect to heads and tails. Tacks are asymmetric, and manufactured to less exacting standards.” Justifies generalizing from previous coins to the present coin. Justifies smoother and stronger prior than raw experience alone. Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin.

A hierarchical Bayesian model physical knowledge Coins q ~ Beta(FH,FT) FH,FT ... Coin 1 Coin 2 Coin 200 q1 q2 q200 d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4 Qualitative physical knowledge (symmetry) can influence estimates of continuous parameters (FH, FT). Explains why 10 flips of 200 coins are better than 2000 flips of a single coin: more informative about FH, FT.

Stability versus Flexibility Can all domain knowledge be represented with conjugate priors? Suppose you flip a coin 25 times and get all heads. Something funny is going on … But with F ={1000 heads, 1000 tails}, P(heads) on next flip = 1025 / (1025+1000) = 50.6%. Looks like nothing unusual. How do we balance stability and flexibility? Stability: 6 heads, 4 tails q ~ 0.5 Flexibility: 25 heads, 0 tails q ~ 1

A hierarchical Bayesian model fair/unfair? Higher-order hypothesis: is this coin fair or unfair? Example probabilities: P(fair) = 0.99 P(q |fair) is Beta(1000,1000) P(q |unfair) is Beta(1,1) 25 heads in a row propagates up, affecting q and then P(fair|D) FH,FT q d1 d2 d3 d4 P(fair|25 heads) P(25 heads|fair) P(fair) P(unfair|25 heads) P(25 heads|unfair) P(unfair) = = 9 x 10-5

Summary: Bayesian parameter estimation Learning the parameters of a generative model as Bayesian inference. Conjugate priors an elegant way to represent simple kinds of prior knowledge. Hierarchical Bayesian models integrate knowledge across instances of a system, or different systems within a domain. can represent richer, more abstract knowledge

Some questions Learning isn’t just about parameter estimation How do we learn the functional form of a variable’s distribution? How do we learn model structure, or theories with the expressiveness of predicate logic? Can we “grow” levels of abstraction?

Coin flipping Basic Bayes Parameter estimation Model selection data = HHTHT or HHHHH compare two hypotheses: P(H) = 0.5 vs. P(H) = 1.0 Parameter estimation compare many hypotheses in a parameterized family P(H) = q : Infer q Model selection compare qualitatively different hypotheses, often varying in complexity: P(H) = 0.5 vs. P(H) = q

A hierarchical learning framework model class Model selection: model Model fitting: parameters data

Bayesian model selection d1 d2 d3 d4 P(H) = q q vs. d1 d2 d3 d4 Fair coin, P(H) = 0.5 Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = q ?

Comparing simple and complex hypotheses P(H) = q is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = q for any observed sequence D, we can choose q such that D is more probable than if P(H) = 0.5

Comparing simple and complex hypotheses Probability q = 0.5 D = HHHHH

Comparing simple and complex hypotheses q = 1.0 Probability q = 0.5 D = HHHHH

Comparing simple and complex hypotheses Probability q = 0.6 q = 0.5 D = HHTHT

Comparing simple and complex hypotheses P(H) = q is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = q for any observed sequence X, we can choose q such that X is more probable than if P(H) = 0.5 How can we deal with this? Some version of Occam’s razor? Bayes: automatic version of Occam’s razor follows from the “law of conservation of belief”.

Comparing simple and complex hypotheses P(h1|D) P(D|h1) P(h1) P(h0|D) P(D|h0) P(h0) = x The “evidence” or “marginal likelihood”: The probability that randomly selected parameters from the prior would generate the data.

Bayesian Occam’s Razor All possible data sets d p(D = d | M ) M1 M2 For any model M, Law of “conservation of belief”: A model that can predict many possible data sets must assign each of them low probability.

Ockham’s Razor in curve fitting

M1 M1 p(D = d | M ) M2 M2 M3 D M3 Observed data M1: A model that is too simple is unlikely to generate the data. M3: A model that is too complex can generate many possible data sets, so it is unlikely to generate this particular data set at random.

M1 M1 p(D = d | M ) M2 M2 M3 D M3 Observed data [assume Gaussian parameter priors, Gaussian likelihoods (noise)]

For best fitting version of each model: Prior Likelihood high low M2 medium high M3 very very very very low very high

(assuming Gaussian noise, and Gaussian priors on parameters) (Ghahramani)

(Ghahramani)

Hierarchical Bayesian learning with flexibly structured models Learning context-free grammars for natural language (Stolcke & Omohundro; Griffiths and Johnson; Perfors et al.). Learning complex concepts. “fruit” “fruit” <= or( and( color > 0.2, color < 0.4, size > 1, size < 4 ), color > 0.5, color < 0.65, size > 2, size < 7), …) Navarro (2006): Nonparametric model Goodman et al. (2006): Probabilistic context-free grammar for rule-based concepts

The “blessing of abstraction” Often easier to learn at higher levels of abstraction Easier to learn that you have a biased coin than to learn its bias. Easier to learn causal structure than causal strength. Easier to learn that you are hearing two languages (vs. one), or to learn that language has a hierarchical phrase structure, than to learn how any one language works. Why? Hypothesis space gets smaller as you go up. But the total hypothesis space gets bigger when we add levels of abstraction (e.g., model selection). Can make better (more confident, more accurate) predictions by increasing the size of the hypothesis space, if we introduce good inductive biases.

Summary Three kinds of Bayesian inference Key concepts Comparing two simple hypotheses Parameter estimation The importance and subtlety of prior knowledge Model selection Bayesian Occam’s razor, the blessing of abstraction Key concepts Probabilistic generative models Hierarchies of abstraction, with statistical inference at all levels Flexibly structured representations