Part III Learning structured representations Hierarchical Bayesian models.

Slides:

Advertisements

Similar presentations

Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.

Advertisements

Representations for KBS: Uncertainty & Decision Support

Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp.

Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |

Causal learning in humans Alison Gopnik Dept. of Psychology UC-Berkeley.

Dynamic Bayesian Networks (DBNs)

Introduction of Probabilistic Reasoning and Bayesian Networks

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.

Topics in Cognition and Language: Theory, Data and Models *Perceptual scene analysis: extraction of meaning events, causality, intentionality, Theory of.

Probabilistic inference

CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

Bayesian models of inductive learning

Bayesian models of inductive learning

Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Part III Hierarchical Bayesian Models. Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g.,

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Learning Bayesian Networks

Bayesian Networks Alan Ritter.

Language Modeling Approaches for Information Retrieval Rong Jin.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Review: Probability Random variables, events Axioms of probability

Bayesian models of human inductive learning Josh Tenenbaum MIT.

A Brief Introduction to Graphical Models

Psych 156A/ Ling 150: Acquisition of Language II Lecture 16 Language Structure II.

Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Graphical models for part of speech tagging

Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?

Introduction to probabilistic models of cognition Josh Tenenbaum MIT.

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

第十讲概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Bayesian models of inductive learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Computational models of cognitive development: the grammar analogy Josh Tenenbaum MIT.

Bayesian models of inductive learning Tom Griffiths UC Berkeley Josh Tenenbaum MIT Charles Kemp CMU.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

Methodological Problems in Cognitive Psychology David Danks Institute for Human & Machine Cognition January 10, 2003.

Infinite block models for belief networks, social networks, and cultural knowledge Josh Tenenbaum, MIT 2007 MURI Review Meeting Work of Charles Kemp, Chris.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Slides for “Data Mining” by I. H. Witten and E. Frank.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.

Data Mining and Decision Support

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.

School of Computer Science & Engineering

Statistical Models for Automatic Speech Recognition

CS 416 Artificial Intelligence

Bayesian models of human learning and inference

Josh Tenenbaum Statistical learning of abstract knowledge:

Bayesian Models in Machine Learning

Revealing priors on category structures through iterated learning

A Hierarchical Bayesian Look at Some Debates in Category Learning

CS4705 Natural Language Processing

Building and evaluating models of human-level intelligence

The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.

Learning overhypotheses with hierarchical Bayesian models

Presentation transcript:

Part III Learning structured representations Hierarchical Bayesian models

Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)

Outline Learning structured representations –grammars –logical theories Learning at multiple levels of abstraction

Structured Representations Unstructured Representations A historical divide (Chomsky, Pinker, Keil,...) (McClelland, Rumelhart,...) Innate knowledge Learning vs

Innate Knowledge Structured Representations Learning Unstructured Representations Chomsky Keil McClelland, Rumelhart Structure Learning

Representations Grammars Logical theories Causal networks asbestos lung cancer coughingchest pain

Representations ChemicalsDiseases Bio-active substances Biological functions cause affect disrupt interact with affect Semantic networks Phonological rules

How to learn a R Search for R that maximizes Prerequisites –Put a prior over a hypothesis space of Rs. –Decide how observable data are generated from an underlying R.

How to learn a R Search for R that maximizes Prerequisites –Put a prior over a hypothesis space of Rs. –Decide how observable data are generated from an underlying R. anything

Context free grammar S  N VPVP  VN  “Alice”V  “scratched” VP  V NN  “Bob”V  “cheered” S NVP V cheered Alice S NVP V scratched AliceN Bob

Probabilistic context free grammar S  N VPVP  VN  “Alice”V  “scratched” VP  V NN  “Bob”V  “cheered” S NVP V cheered Alice S NVP V scratched AliceN Bob probability = 1.0 * 0.5 * 0.6 = 0.3 probability = 1.0*0.5*0.4*0.5*0.5 = 0.05

The learning problem S  N VPVP  VN  “Alice”V  “scratched” VP  V NN  “Bob”V  “cheered” Alice scratched. Bob scratched. Alice scratched Alice. Alice scratched Bob. Bob scratched Alice. Bob scratched Bob. Alice cheered. Bob cheered. Alice cheered Alice. Alice cheered Bob. Bob cheered Alice. Bob cheered Bob. Data D: Grammar G:

Grammar learning Search for G that maximizes Prior: Likelihood: –assume that sentences in the data are independently generated from the grammar. (Horning 1969; Stolcke 1994)

Experiment Data: 100 sentences (Stolcke, 1994)...

Generating grammar:Model solution:

For all x and y, if y is the sibling of x then x is the sibling of y For all x, y and z, if x is the ancestor of y and y is the ancestor of z, then x is the ancestor of z. Predicate logic A compositional language

Learning a kinship theory Sibling(victoria, arthur), Sibling(arthur,victoria), Ancestor(chris,victoria), Ancestor(chris,colin), Parent(chris,victoria), Parent(victoria,colin), Uncle(arthur,colin), Brother(arthur,victoria) … (Hinton, Quinlan, …) Data D: Theory T:

Learning logical theories Search for T that maximizes Prior: Likelihood: –assume that the data include all facts that are true according to T (Conklin and Witten; Kemp et al 08; Katz et al 08)

Theory-learning in the lab R(f,k) R(f,c) R(k,c) R(f,l) R(k,l) R(c,l) R(f,b)R(k,b) R(c,b) R(l,b) R(f,h) R(k,h) R(c,h) R(l,h) R(b,h) (cf Krueger 1979)

Theory-learning in the lab R(f,k). R(k,c). R(c,l). R(l,b). R(b,h). R(X,Z) ← R(X,Y), R(Y,Z). Transitive: f,kf,c k,c f,lf,bf,h k,lk,bk,h c,lc,bc,h l,bl,h b,h

Learning time Complexity trans. Theory length Goodman trans.excep.trans. (Kemp et al 08)

Conclusion: Part 1 Bayesian models can combine structured representations with statistical inference.

Outline Learning structured representations –grammars –logical theories Learning at multiple levels of abstraction

(Han and Zhu, 2006) Vision

Motor Control (Wolpert et al., 2003)

Causal learning Causal models Contingency Data Schema chemicals diseases symptoms Patient 1: asbestos exposure, coughing, chest pain Patient 2: mercury exposure, muscle wasting asbestos lung cancer coughingchest pain mercury minamata disease muscle wasting (Kelley; Cheng; Waldmann)

Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG) P(phrase structure | grammar) P(utterance | phrase structure) P(speech | utterance) P(grammar | UG)

Phrase structure Utterance Grammar Universal Grammar u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 G U A hierarchical Bayesian model specifies a joint distribution over all variables in the hierarchy: P({u i }, {s i }, G | U) = P ({u i } | {s i }) P({s i } | G) P(G|U) Hierarchical Bayesian model P(G|U) P(s|G) P(u|s)

Phrase structure Utterance Grammar Universal Grammar u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 G U Infer {s i } given {u i }, G: P( {s i } | {u i }, G) α P( {u i } | {s i } ) P( {s i } |G) Top-down inferences

Phrase structure Utterance Grammar Universal Grammar u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 G U Infer G given {s i } and U: P(G| {s i }, U) α P( {s i } | G) P(G|U) Bottom-up inferences

Phrase structure Utterance Grammar Universal Grammar u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 G U Infer G and {s i } given {u i } and U: P(G, {s i } | {u i }, U) α P( {u i } | {s i } )P({s i } |G)P(G|U) Simultaneous learning at multiple levels

Whole-object bias Shape bias gavagaiduckmonkeycar Words in general Individual words Data Word learning

A hierarchical Bayesian model d 1 d 2 d 3 d 4 FH,FTFH,FT 11  ~ Beta(F H,F T ) Coin 1Coin 2Coin 22  200 physical knowledge Qualitative physical knowledge (symmetry) can influence estimates of continuous parameters (F H, F T ). Explains why 10 flips of 200 coins are better than 2000 flips of a single coin: more informative about F H, F T. Coins

Word Learning “This is a dax.”“Show me the dax.” 24 month olds show a shape bias 20 month olds do not (Landau, Smith & Gleitman)

Is the shape bias learned? Smith et al (2002) trained 17-month-olds on labels for 4 artificial categories: After 8 weeks of training 19- month-olds show the shape bias: “wib” “lug” “zup” “div” “This is a dax.” “Show me the dax.”

? Learning about feature variability (cf. Goodman)

? Learning about feature variability (cf. Goodman)

A hierarchical model Bag proportions Data … … mostly red mostly brown mostly green Color varies across bags but not much within bags Bags in general mostly yellow mostly blue? Meta-constraints M

A hierarchical Bayesian model … … [1,0,0][0,1,0][1,0,0][0,1,0][.1,.1,.8] = 0.1 = [0.4, 0.4, 0.2] Within-bag variability … [6,0,0][0,6,0][6,0,0][0,6,0][0,0,1] M Bag proportions Data Bags in general Meta-constraints

A hierarchical Bayesian model … … [.5,.5,0] [.4,.4,.2] = 5 = [0.4, 0.4, 0.2] Within-bag variability … [3,3,0] [0,0,1] M Bag proportions Data Bags in general Meta-constraints

Shape of the Beta prior

A hierarchical Bayesian model Bag proportions Data Bags in general … … M Meta-constraints

A hierarchical Bayesian model … … M Bag proportions Data Bags in general Meta-constraints

Learning about feature variability Categories in general Meta-constraints Individual categories Data M

“wib”“lug”“zup”“div”

“dax” “wib”“lug”“zup”“div”

Model predictions “dax ” “Show me the dax:” Choice probability

Where do priors come from? M Categories in general Meta-constraints Individual categories Data

Knowledge representation

Children discover structural form Children may discover that –Social networks are often organized into cliques –The months form a cycle –“Heavier than” is transitive –Category labels can be organized into hierarchies

A hierarchical Bayesian model Form Structure Data mouse squirrel chimp gorilla Tree Meta-constraintsM

A hierarchical Bayesian model F: form S: structure D: data mouse squirrel chimp gorilla Tree Meta-constraintsM

Structural forms Order ChainRingPartition HierarchyTreeGridCylinder

P(S|F,n): Generating structures if S inconsistent with F otherwise Each structure is weighted by the number of nodes it contains: where is the number of nodes in S mouse squirrel chimp gorilla mouse squirrel chimp gorilla mouse squirrel chimpgorilla

Simpler forms are preferred ABC P(S|F, n): Generating structures from forms D All possible graph structures S P(S|F) Chain Grid

A hierarchical Bayesian model F: form S: structure D: data mouse squirrel chimp gorilla Tree Meta-constraintsM

Intuition: features should be smooth over graph S Relatively smoothNot smooth p(D|S): Generating feature data

(Zhu, Lafferty & Ghahramani) } i j Let be the feature value at node i p(D|S): Generating feature data

A hierarchical Bayesian model F: form S: structure D: data mouse squirrel chimp gorilla Tree Meta-constraintsM

animals features judges cases Feature data: results

Developmental shifts 110 features 20 features 5 features

Similarity data: results colors

Relational data Form Structure Data Cliques Meta-constraintsM

Relational data: results Primates Bush cabinet Prisoners “x dominates y” “x tells y” “x is friends with y”

Why structural form matters Structural forms support predictions about new or sparsely-observed entities.

Cliques (n = 8/12) Chain (n = 7/12) Experiment: Form discovery

Form Structure Data mouse squirrel chimp gorilla warm Universal Structure grammarU

A hypothesis space of forms Form Process

Conclusions: Part 2 Hierarchical Bayesian models provide a unified framework which helps to explain: –How abstract knowledge is acquired –How abstract knowledge is used for induction

Outline Learning structured representations –grammars –logical theories Learning at multiple levels of abstraction

Handbook of Mathematical Psychology, 1963