Optimality in Cognition and Grammar Paul Smolensky Cognitive Science Department, Johns Hopkins University Plan of lectures 1.Cognitive architecture: Symbols.

Slides:

Advertisements

Similar presentations

Optimality Theory (OT) Prepared and presented by: Abdullah Bosaad & Liú Chàng Spring 2011.

Advertisements

Multi-Layer Perceptron (MLP)

Optimality Theory Presented by Ashour Abdulaziz, Eric Dodson, Jessica Hanson, and Teresa Li.

Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

1 Lecture 32 Closure Properties for CFL’s –Kleene Closure construction examples proof of correctness –Others covered less thoroughly in lecture union,

The sound patterns of language

The Sound Patterns of Language: Phonology

An RG theory of cultural evolution Gábor Fáth Hungarian Academy of Sciences Budapest, Hungary in collaboration with Miklos Sarvary - INSEAD, Fontainebleau,

Optimality Theory Abdullah Khalid Bosaad 刘畅 Liú Chàng.

Tuomas Sandholm Carnegie Mellon University Computer Science Department

Optimality Theory (Prince & Smolensky 1993). Outline Phonetics and Phonology OT Characteristics Output-Oriented Conflicting Soft Well-formedness Constraints.

Gestural overlap and self-organizing phonological contrasts Contrast in Phonology, University of Toronto May 3-5, 2002 Alexei Kochetov Haskins Laboratories/

Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Language, Cognition and Optimality Henriëtte de Swart ESSLLI 2008, Hamburg.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Computational Grammars Azadeh Maghsoodi. History Before First 20s 20s World War II Last 1950s Nowadays.

January 24-25, 2003Workshop on Markedness and the Lexicon1 On the Priority of Markedness Paul Smolensky Cognitive Science Department Johns Hopkins University.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

January 24-25, 2003Workshop on Markedness and the Lexicon1  Empirical Relevance Local conjunction has seen many empirical applications; here, vowel harmony.

[kmpjuteynl] [fownldi]

Invitation to Computer Science 5th Edition

…not the study of telephones!

Jakobson's Grand Unified Theory of Linguistic Cognition Paul Smolensky Cognitive Science Department Johns Hopkins University Elliott Moreton Karen Arnold.

Phonological Theory Beijing Foreign Studies University 2008.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Markedness Optimization in Grammar and Cognition Paul Smolensky Cognitive Science Department Johns Hopkins University Elliott Moreton Karen Arnold Donald.

FDA- A scalable evolutionary algorithm for the optimization of ADFs By Hossein Momeni.

Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

Chapter 8 Fuzzy Associative Memories Li Lin

Phonological Theory.

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

Attendee questionnaire Name Affiliation/status Area of study/research For each of these subjects: –Linguistics (Optimality Theory) –Computation (connectionism/neural.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

Formal Typology: Explanation in Optimality Theory Paul Smolensky Cognitive Science Department Johns Hopkins University Géraldine Legendre Donald Mathis.

Harmonic Ascent  Getting better all the time Timestamp: Jul 25, 2005.

Theories of Child Language Acquisition (see 8.1).

May 7, 2003University of Amsterdam1 Markedness in Acquisition Is there evidence for innate markedness- based bias in language processing? Look to see whether.

Lecture 2 Phonology Sounds: Basic Principles. Definition Phonology is the component of linguistic knowledge concerned with rules, representations, and.

The Harmonic Mind Paul Smolensky Cognitive Science Department Johns Hopkins University A Mystery ‘Co’-laborator Géraldine Legendre Alan Prince Peter Jusczyk.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

The phonology of Hakka zero- initials Raung-fu Chung Southern Taiwan University 2011, 05, 29, Cheng Da.

Program Structure  OT Constructs formal grammars directly from markedness principles Strongly universalist: inherent typology  OT allows completely formal.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

A Psycholinguistic Perspective on Child Phonology Sharon Peperkamp Emmanuel Dupoux Laboratoire de Sciences Cognitives et Psycholinguistique, EHESS-CNRS,

Linguistics as a Model for the Cognitive Approaches in Biblical Studies Tamás Biró SBL, London, 4 July 2011.

State Observer (Estimator)

Chapter 4: Phonology… …not the study of telephones! NOTES: The slides/lecture/discussion for this chapter deviate from the order of the book… You WILL.

THE SOUND PATTERNS OF LANGUAGE

Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,

Principles Rules or Constraints

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Bridging the gap between L2 speech perception research and phonological theory Paola Escudero & Paul Boersma (March 2002) Presented by Paola Escudero.

Optimality Theory. Linguistic theory in the 1990s... and beyond!

1 LING 696B: Maximum-Entropy and Random Fields. 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what.

Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.

King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد 1 [ ] 1 King Faisal University.

MENTAL GRAMMAR Language and mind. First half of 20 th cent. – What the main goal of linguistics should be? Behaviorism – Bloomfield: goal of linguistics.

Introductory Lecture. What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous)

10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.

Lecture 4 The Syllable.

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

An overview of decoding techniques for LVCSR

Conceptual Puzzles & Theoretical Elegance (principledness. naturalness

Chapter 10: Compilers and Language Translation

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Optimality in Cognition and Grammar Paul Smolensky Cognitive Science Department, Johns Hopkins University Plan of lectures 1.Cognitive architecture: Symbols & optimization in neural networks 2.Optimization in grammar: HG  OT From numerical to algebraic optimization in grammar 3.OT and nativism The initial state & neural/genomic encoding of UG 4.?

The ICS Hypothesis The Integrated Connectionist/Symbolic Cognitive Architecture (ICS) In higher cognitive domains, representations and fuctions are well approximated by symbolic computation The Connectionist Hypothesis is correct Thus, cognitive theory must supply a computational reduction of symbolic functions to PDP computation

Levels

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal:  opt. constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Satisfaction Spreading Activation σ k tæ σ k tæ σ k tæ  A ƒ kæt[ σ k[æt]] G  σ k tæ

Representation σ k tæ σ /r ε σ k /r 0 k [ σ k [æ t]] æ /r 01 æ t /r 11 t

⊗ Role vectors: r ε = (1; 0 0) r 0 = (0; 1 1) r 1 = (0; 1  1)  Representations: Filler vectors: A, B, X, Y i, j, k ∊ {A, B, X, Y} j k i Depth 0Depth 1 ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩ ⑪ ⑫ Tensor Product Representations Filler vectors: A, B, X, Y i, j, k ∊ {A, B, X, Y} j k i ⑧ ⑦ ⑤ ⑥ Depth 1Depth 0 ① ② ③ ④ ① ② ③ ④⑫ ⑪ ⑩ ⑨ ⊗ Role vectors: r ε = (1; 0 0) r 0 = (0; 1 1) r 1 = (0; 1  1) ⊗

Local tree realizations  Representations:

The ICS Isomorphism  Aux by V A P Passive V PA LF Output Agent B DCFbyby EG Patie nt Au x F G B DC Patie nt Input W Tensor product representationsTensorial networks

Structuring operation Symbolic formalizationConnectionist formalization StructuresExample Vector operation CombiningSets {c 1, c 2 } c 1 + c 2 Vector sum: + Role/filler binding Strings, frames AB = { A / r 1, B / r 2 } A  r 1 + B  r 2 Tensor product:  Recursive embedding Trees A B C A  r 0 +[B  r 0 + C  r 1 ]  r 1 Recursive role vectors: r left/right-child(x) = r 0/1  r x Tensor Product Representations

 Formal Role Filler Binding by Synchrony =  s= r 1  [ f book + f give-obj ] + r 3  [ f Mary + f recipient ] + r 2  [ f giver + f John ] time = recipient giver give-obj John Mary book give(John, book, Mary) (Shastri & Ajjanagadde 1993) r 1  [ f book + f give-obj ] [Tesar & Smolensky 1994]

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal:  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ  A ƒ kæt[ σ k[æt]] G 

Two Fundamental Questions 2.What are the constraints? Knowledge representation Prior question: 1.What are the activation patterns — data structures — mental representations — evaluated by these constraints?  Harmony maximization is satisfaction of parallel, violable constraints

Representation σ k tæ σ /r ε σ k /r 0 k [ σ k [æ t]] æ /r 01 æ t /r 11 t

Two Fundamental Questions 2.What are the constraints? Knowledge representation Prior question: 1.What are the activation patterns — data structures — mental representations — evaluated by these constraints?  Harmony maximization is satisfaction of parallel, violable constraints

Constraints N O C ODA : A syllable has no coda [Maori/French/English] W * H ( a [ σ k [æ t]] ) = – s N O C ODA < 0 a [ σ k [æ t ]] * * violation σ k tæ ‘cat’

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal:  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ  A ƒ kæt[ σ k[æt]] G 

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal:  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ  A ƒ kæt[ σ k[æt]] G  Constraint Interaction ??

Constraint Interaction I ICS  Grammatical theory –Harmonic Grammar Legendre, Miyata, Smolensky 1990 et seq.

= = H Constraint Interaction I σ k tæ H = The grammar generates the representation that maximizes H: this best-satisfies the constraints, given their differential strengths Any formal language can be so generated. H( k, σ ) H( σ, t ) O NSET Onset/k > 0 N O C ODA Coda/t < 0

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal H :  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ  Constraint Interaction I: HG A ƒ kæt[ σ k[æt]] G  

Harmonic Grammar Parser Simple, comprehensible network Simple grammar G –X → A B Y → B A Language Processing : Completion A BB A X Y A BB A X Y Top-down A BB A Bottom-up X Y

The ICS Architecture Representation ? Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal H :  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ  A ƒ kæt[ σ k[æt]] G 

Simple Network Parser Fully self-connected, symmetric network Like previously shown network … W … Except with 12 units; representations and connections shown below

Harmonic Grammar Parser Weight matrix for Y → B A H(Y, — A) > 0 H(Y, B — ) > 0

Harmonic Grammar Parser Weight matrix for X → A B

Harmonic Grammar Parser Weight matrix for entire grammar G

A BB A X Y Bottom-up Processing

X Y A BB A Top-down Processing

Scaling up Not yet … Still conceptual obstacles to surmount

Explaining Productivity Approaching full-scale parsing of formal languages by neural-network Harmony maximization Have other networks (like PassiveNet) that provably compute recursive functions !  productive competence How to explain?

1. Structured representations

+ 2. Structured connections

= Proof of Productivity Productive behavior follows mathematically from combining –the combinatorial structure of the vectorial representations encoding inputs & outputs and –the combinatorial structure of the weight matrices encoding knowledge

Explaining Productivity I + + Intra-level decomposition: [A B] ⇝ {A, B} Inter-level decomposition: [A B] ⇝ {1,0,  1,…,1} Processes PSA ICS Functions Semantics PSA & ICS

Explaining Productivity II Intra-level decomposition: G ⇝ { X  AB, Y  BA } Functions Semantics Processes PSA Inter-level decomposition: W ( G ) ⇝ {1,0,  1,0;…} Processes ICS ICS & PSA +

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal H :  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ A ƒ kæt[ σ k[æt]] G 

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal H :  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ Constraint Interaction II A ƒ kæt[ σ k[æt]] G 

Constraint Interaction II: OT ICS  Grammatical theory –Optimality Theory Prince & Smolensky 1991, 1993/2004

Constraint Interaction II: OT Differential strength encoded in strict domination hierarchies ( ≫ ) : –Every constraint has complete priority over all lower-ranked constraints (combined) –Approximate numerical encoding employs special (exponentially growing) weights –“Grammars can’t count”

Constraint Interaction II: OT “Grammars can’t count”  Stress is on the initial heavy syllable iff the number of light syllables n obeys No way, man

Constraint Interaction II: OT Differential strength encoded in strict domination hierarchies ( ≫ ) Constraints are universal ( Con ) Candidate outputs are universal ( Gen ) Human grammars differ only in how these constraints are ranked –‘factorial typology’ First true contender for a formal theory of cross-linguistic typology 1 st innovation of OT: constraint ranking 2 nd innovation: ‘Faithfulness’

ŋg ≻ ŋb, ŋd velar nd ≻ md, ŋd coronal The Faithfulness/Markedness Dialectic ‘cat’: /kat/  kæt *N O C ODA — why? –F AITHFULNESS requires pronunciation = lexical form –M ARKEDNESS often opposes it Markedness-Faithfulness dialectic  diversity – English: F AITH ≫ N O C ODA – Polynesian: N O C ODA ≫ F AITH (~French) Another markedness constraint M: –Nasal Place Agreement [‘Assimilation’] (NPA): mb ≻ nb, ŋ b labial

The ICS Architecture Representation ? Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal ≫ :  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ Constraint Interaction II: OT A ƒ kæt[ σ k[æt]] G 

Optimality Theory Diversity of contributions to theoretical linguistics –Phonology & phonetics –Syntax –Semantics & pragmatics –… e.g., following lectures. Now: Can strict domination be explained by connectionism?

Case study Syllabification in Berber Plan –Data, then: OT grammar Harmonic Grammar Network

Syllabification in Berber Dell & Elmedlaoui, 1985: Imdlawn Tashlhit Berber Syllable nucleus can be any segment But driven by universal preference for nuclei to be highest-sonority segments

Berber syllable nuclei have maximal sonority Segment Class Example Segments ρ Sonority son (ρ) Berber Examples voiceless stops t, k 1.ra.t K.ti. voiced stops d, b, g 2.b D.dL..ma.ra.t G t. voiceless fricatives s, f, x 3.t F.tKt..t X.zNt. voiced fricatives z, γ 4.tx Z.nakk w. nasals n, m 5.tz M t..t M.z . liquids l, r 6.t R.g L t. high vocoids i/y, u/w 7.rat.l u lt.. i l.d i. low vowel a 8.tR.b a.

/txznt/O NSET H NUC a. .tX.zNt. n x b..tXz.nT. x! t c..txZ.Nt. *!*! n z OT Grammar: Brbr OT H NUC A syllable nucleus is sonorous O NSET A syllable has an onset Prince & Smolensky ’93/04 Strict Domination

Harmonic Grammar: Brbr HG H NUC A syllable nucleus is sonorous Nucleus of sonority s : Harmony = 2 s  1 s  {1, 2, …, 8} ~ { t, d, f, z, n, l, i, a } O NSET *VV Harmony =  2 8 Theorem. The global Harmony maxima are the correct Berber core syllabifications [of Dell & Elmedlaoui; no sonority plateaux, as in OT analysis, here & henceforth]

BrbrNet realizes Brbr HG O NSET H NUC

BrbrNet’s Global Harmony Maximum is the correct parse Contrasts with Goldsmith’s Dynamic Linear Models (Goldsmith & Larson ’90; Prince ’93) For a given input string, a state of BrbrNet is a global Harmony maximum if and only if it realizes the syllabification produced by the serial Dell-Elmedlaoui algorithm

BrbrNet’s Search Dynamics Greedy local optimization –at each moment, make a small change of state so as to maximally increase Harmony –( gradient ascent : mountain climbing in fog) –guaranteed to construct a local maximum

/txznt/  tx.znt ‘you sing stored’ H txznt

The Hardest Case: /t.bx.ya* * hypothetical, but compare t. bx.la. kk w ‘she even behaved as a miser’ [ tbx.lakk w ]

V V V V Subsymbolic Parsing V V.b.b tx.i.i a V V

Parsing sonority profile a.tb.kf.zn.yay Finds best of infinitely many representations: 1024 corners/parses

BrbrNet has many Local Harmony Maxima An output pattern in BrbrNet is a local Harmony maximum if and only if it realizes a sequence of legal Berber syllables (i.e., an output of Gen ) That is, every activation value is 0 or 1, and the sequence of values is that realizing a sequence of substrings taken from the syllable inventory {CV, CVC, #V, #VC}, where C = 0, V = 1 and # = word edge Greedy optimization avoids local maxima: why?

HG  OT’s Strict Domination Strict Domination: Baffling from a connectionist perspective? Explicable from a connectionist perspective ? – Exponential BrbrNet escapes local H maxima – Linear BrbrNet does not

Linear BrbrNet makes errors (~ Goldsmith-Larson network) Error: /12378/  (correct: )

Subsymbolic Harmony optimization can be stochastic The search for an optimal state can employ randomness Equations for units’ activation values have random terms – pr ( a ) ∝ e H ( a )/ T – T (‘temperature’) ~ randomness  0 during search –Boltzmann Machine (Hinton and Sejnowski 1983, 1986); Harmony Theory (Smolensky 1983, 1986) Can guarantee computation of global optimum in principle In practice: how fast? Exponential vs. linear BrbrNet

Stochastic BrbrNet: Exponential can succeed ‘fast’ 5-run average

Stochastic BrbrNet : Linear can’t succeed ‘fast’ 5-run average

Stochastic BrbrNet (Linear) 5-run average

The ICS Architecture Representation Grammar G Function ƒAlgorithm A Constraints: N O C ODA optimal ≫ :  opt.constr. sat. : Activation Pattern Connection Weights Harmony Opt./ Constraint Sat. Spreading Activation σ k tæ σ k tæ σ k tæ A ƒ kæt[ σ k[æt]] G 