The mental representation of sentences Tree structures or state vectors? Stefan Frank

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

Stefan Frank Department of Cognitive, Perceptual and Brain Sciences
Summer 2011 Tuesday, 8/ No supposition seems to me more natural than that there is no process in the brain correlated with associating or with.
Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Learning linguistic structure with simple recurrent networks February 20, 2013.
A (very) brief introduction to multivoxel analysis “stuff” Jo Etzel, Social Brain Lab
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Statistics for Business and Economics
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Neural Networks Marco Loog.
Evaluating Hypotheses
1/13 Parsing III Probabilistic Parsing and Conclusions.
1/17 Probabilistic Parsing … and some other approaches.
Chapter Topics Types of Regression Models
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Parsing the NEGRA corpus Greg Donaker June 14, 2006.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Modelling Language Evolution Lecture 2: Learning Syntax Simon Kirby University of Edinburgh Language Evolution & Computation Research Unit.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
February 22, 2010 Connectionist Models of Language.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Learning to Transform Natural to Formal Language Presented by Ping Zhang Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
Jamie Alexandre. ≠ = would you like acookie jason.
1 Statistical Distribution Fitting Dr. Jason Merrick.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
Connectionist Models of Language Development: Grammar and the Lexicon Steve R. Howell McMaster University, 1999.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Methodology of Simulations n CS/PY 399 Lecture Presentation # 19 n February 21, 2001 n Mount Union College.
Tokenization & POS-Tagging
Statistical Decision-Tree Models for Parsing NLP lab, POSTECH 김 지 협.
Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Efficient Estimation of Word Representation in Vector Space
CSCI 5832 Natural Language Processing
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
N-Gram Model Formulas Word sequences Chain rule of probability
Learning linguistic structure with simple recurrent neural networks
Artificial Intelligence 2004 Speech & Natural Language Processing
CS249: Neural Language Model
Presentation transcript:

The mental representation of sentences Tree structures or state vectors? Stefan Frank

With help from Rens Bod Victor Kuperman Brian Roark Vera Demberg

Understanding a sentence The very general picture the cat is on the mat word sequence “meaning” “comprehension”

Sentence meaning Theories of mental representation  x,y : cat (x) ∧ mat (y) ∧ on (x,y) tree structure cat on mat logical form conceptual network perceptual simulation state vector or activation pattern Sentence “structure” … ?

Grammar-based vs. connectionist models  Grammars account for the productivity and systematicity of language (Fodor & Pylyshyn, 1988; Marcus, 1998)  Connectionism can explain why there is no (unlimited) productivity and (pure) systematicity (Christiansen & Chater, 1999) The debate (or battle) between the two camps focuses on particular (psycho-) linguistic phenomena, e.g.:

From theories to models Probabilistic Context- Free Grammar (PCFG) Simple Recurrent Network (SRN) versus  Implemented computational models can be evaluated and compared more thoroughly than ‘mere’ theories  Take a common grammar-based model and a common connectionist model  Compare their ability to predict empirical data (measurements of word-reading time)

Probabilistic Context-Free Grammar  A context-free grammar with a probability for each production rule (conditional on the rule’s left-hand side)  The probability of a tree structure is the product of probabilities of the rules involved in its construction.  The probability of a sentence is the sum of probabilities of all its grammatical tree structures.  Rules and their probabilities can be induced from a large corpus of syntactically annotated sentences (a treebank).  Wall Street Journal treebank: approx. 50,000 sentences from WSJ newspaper articles (1988−1989)

Inducing a PCFG S NPVP. PRP VPZNP PP DT NN NP PRP$ IN NP. It has nobearing on our work forcetoday S  NP VP. NP  DT NN NP  PRP$ NN NN NN  bearing NN  today

Simple Recurrent Network Elman (1990)  Feedforward neural network with recurrent connections  Processes sentences, word by word  Usually trained to predict the upcoming word (i.e., the input at t+1) word input at t hidden layer output layer hidden activation at t–1 (copy) state vector representing the sentence up to word t estimated probabilities for words at t+1

Word probability and reading times Hale (2001), Levy (2008) surprisal of w t  Surprisal theory: the more unexpected the occurrence of a word, the more time needed to process it. Formally:  A sentence is a sequence of words: w 1, w 2, w 3, …  The time needed to read word w t, is logarithmically related to its probability in the ‘context’: RT(w t ) ~ −log Pr(w t |context)  If nothing else changes, the context is just the sequence of previous words: RT(w t ) ~ −log Pr(w t |w 1, …, w t−1 )  Both PCFGs and SRNs can estimate Pr(w t |w 1, …, w t−1 )  So can they predict word-reading times?

Testing surprisal theory Demberg & Keller (2008)  Reading-time data:  Dundee corpus : approx. 2,400 sentences from The Independent newspaper editorials  Read by 10 subjects  Eye-movement registration  First-pass RTs: fixation time on a word before any fixation on later words  Computation of surprisal:  PCFG induced from WSJ treebank  Applied to Dundee corpus sentences  Using Brian Roark’s incremental PCFG parser

Testing surprisal theory Demberg & Keller (2008) But accurate word prediction is difficult because of  required world knowledge  differences between WSJ and The Independent: − 1988-’89 versus 2002 − general WSJ articles versus Independent editorials − American English versus British English − only major similarity: both are in English Result: No significant effect of word surprisal on RT, apart from the effects of Pr(w t ) and Pr(w t |w t  1 ) Test for a purely ‘structural’ (i.e., non-semantic) effect by ignoring the actual words

S NPVP. PRP VPZNP PP DT NN NP PRP$ IN NP. no on today it has bearing our work force part-of-speech (pos) tags

 ‘Unlexicalized’ (or ‘structural’) surprisal −PCFG induced from WSJ trees with words removed −surprisal estimation by parsing sequences of pos-tags (instead of words) of Dundee corpus texts −independent of semantics, so more accurate estimation possible −but probably a weaker relation with reading times  Is a word’s RT related to the predictability of its part- of-speech?  Result: Yes, statistically significant (but very small) effect of pos-surprisal on word-RT Testing surprisal theory Demberg & Keller (2008)

Caveats  Statistical analysis: −The analysis assumes independent measurements −Surprisal theory is based on dependencies between words −So the analysis is inconsistent with the theory  Implicit assumptions: −The PCFG forms an accurate language model (i.e., it gives high probability to the parts-of-speech that actually occur) −An accurate language model is also an accurate psycholinguistic model (i.e., it predicts reading times)

Solutions  Sentence-level (instead of word-level) analysis −Both PCFG and statistical analysis assume independence between sentences −Surprisal averaged over pos-tags in the sentence −Total sentence RT divided by sentence length (# letters)  Measure accuracy a)of the language model: lower average surprisal  more accurate language model b)of the psycholinguistic model: RT and surprisal correlate more strongly  more accurate psycholinguistic model  If a) and b) increase together, accurate language models are also accurate psycholinguistic models

Comparing PCFG and SRN  PCFG −Train on WSJ treebank (unlexicalized) −Parse pos-tag sequences from Dundee corpus −Obtain range of surprisal estimates by varying ‘beam-width’ parameter, which controls parser accuracy  SRN −Train on sequences of pos-tags (not the trees) from WSJ −During training (at regular intervals), process pos-tags from Dundee corpus, obtaining a range of surprisal estimates  Evaluation −Language model: average surprisal measures inaccuracy (and estimates language entropy) −Psycholinguistic model: correlation between surprisals and RTs just like Demberg & Keller

PCFG SRN Results

Preliminary conclusions  Both models account for a statistically significant fraction of variance in reading-time data.  The human sentence-processing system seems to be using an accurate language model.  The SRN is the more accurate psycholinguistic model. But PCFG and SRN together might form an ever better psycholinguistic model

Improved analysis  Linear mixed-effect regression model (to take into account random effects of subject and item)  Compare regression models that include: −surprisal estimates by PCFG with largest beam width −surprisal estimates by fully trained SRN −both  Also include: sentence length, word frequency, forward and backward transitional probabilities, and all significant two-way interactions between these

Results Regression model includes… PCFGSRNboth Effect of surprisal according to… PCFG 0.45 p<.02 SRN Estimated β-coefficients (and associated p-values)

Results Regression model includes… PCFGSRNboth Effect of surprisal according to… PCFG 0.45 p<.02 SRN 0.64 p<.001 Estimated β-coefficients (and associated p-values)

Results Regression model includes… PCFGSRNboth Effect of surprisal according to… PCFG 0.45 p<.02 −0.46 p>.2 SRN 0.64 p< p<.01 Estimated β-coefficients (and associated p-values)

Conclusions  Both PCFG and SRN do account for the reading-time data to some extent  But the PCFG does not improve on the SRN’s predictions No evidence for tree structures in the mental representation of sentences

 Why does the SRN fit the data better than the PCFG?  Is it more accurate on a particular group of data points, or does it perform better overall?  Take the regression analyses’ residuals (i.e., differences between predicted and measured reading times) δ i = |resid i ( PCFG )| − |resid i ( SRN )|  δ i is the extent to which data point i is predicted better by the SRN than by the PCFG.  Is there a group of data points for which δ is larger than might be expected?  Look at the distribution of the δs. Qualitative comparison

Possible distributions of δ 000 No difference between SRN and PCFG (only random noise) Overall better predictions by SRN Particular subset of the data is predicted better by SRN right-shifted right-skewed symmetrical, mean is 0

 If the distribution of δ is asymmetric, particular data points are predicted better by the SRN than the PCFG.  The distribution is not significantly asymmetric (two-sample Kolmogorov-Smirnov test, p>.17)  The SRN seems to be a more accurate psycholinguistic model overall. Test for symmetry

Questions I cannot (yet) answer (but would like to)  Why does the SRN fit the data better than the PCFG?  Perhaps people… −are bad at dealing with long-distance dependencies in a sentence? −store information about the frequency of multi-word sequences?

Questions I cannot (yet) answer (but would like to) In general:  Is this SRN a more accurate psychological model than this PCFG?  Are SRNs more accurate psychological models than PCFGs?  Does connectionism make for more accurate psychological models than grammar-based theories?  What kind of representations are used in human sentence processing? Surprisal-based model evaluation may provide some answers