Amy Perfors & Josh Tenenbaum, MIT

Slides:

Advertisements

Similar presentations

Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.

Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp.

Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.

Probabilistic Models of Cognition Conceptual Foundations Chater, Tenenbaum, & Yuille TICS, 10(7), (2006)

Psych 156A/ Ling 150: Acquisition of Language II Lecture 14 Poverty of the Stimulus III.

January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.

Topics in Cognition and Language: Theory, Data and Models *Perceptual scene analysis: extraction of meaning events, causality, intentionality, Theory of.

Bayesian models of inductive learning

Part III Hierarchical Bayesian Models. Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g.,

Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.

Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:

Tom Griffiths CogSci C131/Psych C123 Computational Models of Cognition.

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.

Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.

CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Describing Syntax and Semantics

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.

Generative Grammar(Part ii)

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

Three Generative grammars

Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.

Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Evolution of Universal Grammar Pia Göser Universität Tübingen Seminar: Sprachevolution Dozent: Prof. Jäger

A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)

Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.

CSI 3120, Grammars, page 1 Language description methods Major topics in this part of the course: –Syntax and semantics –Grammars –Axiomatic semantics (next.

Introduction to probabilistic models of cognition Josh Tenenbaum MIT.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

Introduction to Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Context Free Grammars CIS 361. Introduction Finite Automata accept all regular languages and only regular languages Many simple languages are non regular:

May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.

October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.

TextBook Concepts of Programming Languages, Robert W. Sebesta, (10th edition), Addison-Wesley Publishing Company CSCI18 - Concepts of Programming languages.

CS 363 Comparative Programming Languages Semantics.

Computational models of cognitive development: the grammar analogy Josh Tenenbaum MIT.

Simulated Evolution of Language By: Jared Shane I400: Artificial Life as an approach to Artificial Intelligence January 29, 2007.

인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.

Infinite block models for belief networks, social networks, and cultural knowledge Josh Tenenbaum, MIT 2007 MURI Review Meeting Work of Charles Kemp, Chris.

Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.

November 2011CLINT-LN CFG1 Computational Linguistics Introduction Context Free Grammars.

CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.

CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?

ISBN Chapter 3 Describing Semantics.

Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.

Cognitive Processes PSY 334 Chapter 11 – Language Structure June 2, 2003.

Rules, Movement, Ambiguity

Explorations in language learnability using probabilistic grammars and child-directed speech Amy Perfors & Josh Tenenbaum, MIT Terry Regier, U Chicago.

1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.

ISBN Chapter 3 Describing Syntax and Semantics.

Introduction Chapter 1 Foundations of statistical natural language processing.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Data Mining and Decision Support

Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.

Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

Biointelligence Laboratory, Seoul National University

CS 9633 Machine Learning Concept Learning

Theories of Language Development

Lecture 7: Introduction to Parsing (Syntax Analysis)

Revealing priors on category structures through iterated learning

Building and evaluating models of human-level intelligence

The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.

Traditional Grammar VS. Generative Grammar

Learning overhypotheses with hierarchical Bayesian models

Presentation transcript:

Amy Perfors & Josh Tenenbaum, MIT Explorations in language learnability using probabilistic grammars and child-directed speech Amy Perfors & Josh Tenenbaum, MIT Terry Regier, U Chicago Acknowledgments: Tom Griffiths, Charles Kemp, The Computational Cognitive Science group at MIT All the researchers whose work I’ll discuss. Thanks also: the MIT Computational Cognitive Science group, Adam Albright, Jeff Elman, Danny Fox, Ted Gibson, Sharon Goldwater, Mark Johnson, Jay McClelland, Raj Singh, Ken Wexler, Fei Xu, NSF

Everyday inductive leaps How can people learn so much about the world from such limited evidence? Kinds of objects and their properties Meanings and forms of words, phrases, and sentences Causal relations Intuitive theories of physics, psychology, … Social structures, conventions, and rules The goal: A general-purpose computational framework for understanding of how people make these inductive leaps, and how they can be successful. …the “biochemistry” of the mind.

The approach: reverse engineering induction The big questions: How can abstract knowledge guide generalization from sparsely observed data? What is the form and content of abstract knowledge, across different domains? How could abstract knowledge itself be acquired? A computational toolkit for addressing these questions: Bayesian inference in probabilistic generative models. Probabilistic models defined over structured representations: graphs, grammars, schemas, predicate logic, lambda calculus, functional programs. Hierarchical probabilistic models, with inference at multiple levels of abstraction and multiple timescales.

? “UG” Grammar Phrase structure Utterance Speech signal P(grammar | UG) Grammar P(phrase structure | grammar) Phrase structure P(utterance | phrase structure) Utterance P(speech | utterance) Speech signal (c.f. Chater & Manning, TiCS 2006)

Vision as probabilistic parsing P(scene) Scene graph P(surfaces | scene) Surface configuration P(image | config) Image (Han & Zhu, 2006; c.f., Zhu, Yuanhao & Yuille NIPS 06 )

Learning about categories, labels, and hidden properties P(structure | form) P(data | structure) P(form) Form Tree with species at leaf nodes mouse squirrel chimp gorilla rodent Structure animal primate Data

Learning causal theories Magnets attract Metal. Every Magnet has a North Pole and a South Pole. Opposite magnetic poles attract; Like magnetic poles repel. Abstract Principles Behaviors can cause Diseases Diseases can cause Symptoms + + + Causal Structure N S - - S N + + + Event Data (Tenenbaum, Griffiths, Kemp, Niyogi, et al.)

Goal-directed action (production and comprehension) (Wolpert, Doya and Kawato, 2003)

? UG Grammar Phrase structure Utterance Speech signal P(grammar | UG) Grammar P(phrase structure | grammar) Phrase structure P(utterance | phrase structure) Utterance P(speech | utterance) Speech signal (c.f. Chater and Manning, 2006)

The “Poverty of the Stimulus” argument The generic form: Children acquiring language infer the correct forms of complex syntactic constructions for which they have little or no direct evidence. They avoid simple but incorrect generalizations that would be consistent with their data, preferring much subtler rules that just happen to be correct. How do they do this? They must have some inductive bias – some abstract knowledge about how language works – leading them to prefer the correct hypotheses even in the absence of direct supporting data. That abstract knowledge is UG.

A “Poverty of the Stimulus” argument E.g., aux-fronting in complex interrogatives: Data Simple declarative: The girl is happy. They are eating Simple interrogative: Is the girl happy? Are they eating? Hypotheses H1. Linear: move the first auxiliary in the sentence to the beginning. H2. Hierarchical: move the auxiliary in the main clause to the beginning. Generalization Complex declarative: The girl who is sleeping is happy. Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1] => Inductive constraint Induction of specific grammatical rules must be guided by some abstract constraints to prefer certain hypotheses over others, e.g., syntactic rules are defined over hierarchical phrase structures rather than linear order of words.

Hierarchical phrase structure No Yes

UG Grammar Phrase structure Utterance Speech signal Hierarchical phrase-structure grammars, … P(grammar | UG) Recent objections: The right kind of statistics we can remove the need for such inductive constraints (e.g., Pullum, Elman, Reali & Christiansen)… NOT the question here! Grammar P(phrase structure | grammar) Phrase structure P(utterance | phrase structure) Utterance P(speech | utterance) Speech signal (c.f. Chater and Manning, 2006)

Our argument The Question: What form do constraints take and how do they arise? (When) must they be innately specified as part of the initial state of the language faculty? The Claim: It is possible that, given the data of child-directed speech and certain innate domain-general capacities, an unbiased ideal learner can recognize the hierarchical phrase structure of language; perhaps this inductive constraint need not be innately specified in the language faculty. Assumed domain-general capacities: Can represent grammars of various types: hierarchical, linear, … Can evaluate the Bayesian probability of a grammar given a corpus.

How? By inferring that a hierarchical phrase-structure grammar offers the best tradeoff between simplicity and fit to natural language data. Evaluating candidate grammars based on simplicity is an old idea: Chomsky, MMH, 1951: “As a first approximation to the notion of simplicity, we will here consider shortness of grammar as a measure of simplicity, and will use such notations as will permit similar statements to be coalesced…. Given the fixed notation, the criteria of simplicity governing the ordering of statements are as follows: that the shorter grammar is the simpler, and that among equally short grammars, the simplest is that in which the average length of derivation of sentences is least.” LSLT: Applies this idea to a multi-level generative system.

Ideal learnability analyses A long history of related work Gold, Horning, Angluin, Berwick, Muggleton, Chater & Vitanyi, … Differences from our analysis Typically based on simplicity metrics that are either arbitrary or not computable. Bayes has several advantages: Gives a rational, objective way to trade off simplicity and data fit. Prescribes ideal inferences from any amount of data, not just infinite limit. Naturally handles ambiguity, noise, missing data. Mostly theorems. Our work is mostly based on empirical exploration. Typically considered highly simplified languages or an idealized corpus: infinite data, with all grammatical sentence types observed eventually and empirical frequencies given by the true grammar. The child’s corpus is very different! Small finite sample of sentence types from a very complex language, with a distribution that might depend on many other factors: semantics, pragmatics, performance, ....

The landscape of learnability analyses Can X be learned from data? ideal learner ideal data Our focus here realistic learner ideal data ideal learner realistic data realistic learner realistic data

The Bayesian model T: type of grammar G: Specific grammar D: Data Context-free Regular Flat, 1-state G: Specific grammar D: Data Unbiased (uniform)

The Bayesian model T: type of grammar G: Specific grammar D: Data Context-free Regular Flat, 1-state G: Specific grammar D: Data Fit to data Simplicity “likelihood” “prior”

Bayesian learning: trading fit vs. simplicity Data D Grammars G Simplicity: best good poor Fit: poor good best

Bayesian learning: trading fit vs. simplicity Data D c.f. Subset principle Grammars G Prior: highest high low T = 1 region T = 2 regions T = 13 regions Likelihood: low high highest

Bayesian learning: trading fit vs. simplicity Balance between fit and simplicity should be sensitive to the amount of data observed… c.f. Subset principle

The prior Measuring simplicity of a grammar A probabilistic grammar for grammars (c.f., Horning): Grammars with more rules and more non-terminals will have lower prior probability n = # of nonterminals Pk = # productions expanding nonterminal k Θk = probabilities for expansions of nonterminal k Ni = # symbols in production i V = vocabulary size

The likelihood Measuring fit of a grammar Probability of the corpus being generated from the grammar: Grammars that assign long derivations to sentences will be less probable. Ex: pro aux det n Probability of parse: 0.5*0.25*1.0*0.25*0.5 = 0.016 Grammars that generate sentences not observed in the corpus will be less probable, because they “waste” probability mass.

Different grammar types Linear Hierarchical “Flat” grammar Regular grammar 1-state grammar Context-free grammar Rules Example Rules Rules Rules List of each sentence Anything accepted NT  NT NT NT  t NT NT  NT NT  t NT  t NT NT  t Example Example Example

Hierarchical grammars Simpler, looser fit to data More complex, tighter fit to data CFG-S CFG-L Description Description Designed to be as linguistically plausible (and as compact) as possible Derived from CFG-S, with additional productions that put less probability mass on recursive productions (and hence overgenerate less). Example productions Example productions 69 rules, 14 non-terminals 120 rules, 14 non-terminals

Linear grammars 1-STATE REG-B REG-M REG-N FLAT Simplest, poorest fit Most complex, exact fit 1-STATE Any sentence accepted (unigram prob. model) 25 rules, 0 non-terminals REG-B Broadest regular derived from CFG 117 rules, 10 non-terminals Mid-level regular derived from CFG REG-M 169 rules, 13 non-terminals REG-N Narrowest regular derived from CFG 389 rules, 85 non-terminals FLAT List of each sentence 2336 rules, 0 non-terminals + Local search refinements, automatic grammar construction based on machine learning methods (Goldwater & Griffiths, 2007)

Data Child-directed speech (CHILDES database, Adam, Brown corpus, age range 2;3 to 5;2) Each word replaced by syntactic category, e.g. det adj n v prop prep adj n (the baby bear discovers Goldilocks in his bed) wh aux det n part (what is the monkey eating?) pro aux det n comp v det n (this is the man who wrote the book) aux pro det adj n (is that a green car?) n comp aux adj aux vi (eagles that are alive do fly Final data comprise 2336 individual sentence types (corresponding to 21671 sentence tokens).

(Note: these are -log probabilities, so lower = better!) Results: Full corpus (Note: these are -log probabilities, so lower = better!) Simpler Tighter fit Prior Likelihood Posterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N

Results: Generalization How well does each grammar predict unseen sentence forms? e.g., complex aux-fronted interrogatives: Type In corpus? Example FLAT RG-N RG-M RG-B 1-ST CFG-S CFG-L Simple Declarative Eagles do fly. (n aux vi) Simple Interrogative Do eagles fly? (aux n vi) Complex Declarative Eagles that are alive do fly. (n comp aux adj aux vi) Complex Interrogative Do eagles that are alive fly? (aux n comp aux adj vi) Are eagles that alive do fly? (aux n comp adj aux vi)

Results: First file (90 mins) (Note: these are -log probabilities, so lower = better!) Simpler Tighter fit Prior Likelihood Posterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N

Conclusions: poverty of the stimulus An alternative to the standard version of linguistic nativism Hierarchical phrase structure is a crucial constraint on syntactic acquisition, but perhaps need not be specified innately in the language faculty. It can be inferred from child-directed speech by an ideal learner equipped with innate domain-general capacities to represent grammars of various types and to perform Bayesian inference. Many open questions: How close have we come to finding the best grammars of each type? How would these results extend to richer representations of syntax, or jointly learning multiple levels of linguistic structure? What are the prospects for computationally or psychologically realistic learning algorithms that might approximate this ideal learner?

Learnability of inductive constraints Abstract domain knowledge may be acquired “top-down”, by Bayesian inference in a hierarchical model, and may be inferred from much less data than concrete details. Very different from “bottom-up” empiricist view of abstraction…. To what extent can other core inductive constraints observed in cognitive development be acquired? Word Learning Whole object bias (Markman) Principle of contrast (Clark) Shape bias (Smith) Causal reasoning Causal schemata (Kelley) Folk physics Objects are unified, persistent (Spelke) Number Counting principles (Gelman) Folk biology Taxonomic principle (Atran) Folk psychology Principle of rationality (Gergely) Predicability M-constraint (Keil) Syntax Various aspects of UG (Chomsky) Phonology Faithfulness, Markedness constraints (Prince, Smolensky) Learning at the top layer of a hierarchical model

What of the workshop hypothesis? “Statistical approaches should work within the framework of classical linguistics rather than supplant it.”

What of the workshop hypothesis? OK, with a minor change: “Classical linguistics should work within the frameworks of statistical approaches rather than ignore them.”

What of the workshop hypothesis? “Classical linguistics and statistical approaches have much to gain by working together.” Classical linguistics suggests hypothesis spaces; statistical inference can evaluate these hypotheses in rigorous and powerful ways, using large-scale realistic corpora and whole grammars rather than just fragments of languages and grammars. Bayesian approaches may be particularly worth investigating: Rational, objective metric trading off simplicity and fit: less arbitrary than traditional metrics, better suited to learning from real data sources. Fits naturally with a multi-level generative system (e.g., LSLT). Recent progress in techniques for large-scale inference. Potential long-term benefits. A method for studying the contents of UG: formalizing stimulus poverty arguments, and assessing which linguistic universals or constraints are best attributed to UG. A tool to show that linguistic theory “works” in an engineering sense. May help integration with other areas of cognitive science and AI.