Download presentation
Presentation is loading. Please wait.
Published byNora O’Neal’ Modified over 9 years ago
1
The mental representation of sentences Tree structures or state vectors? Stefan Frank S.L.Frank@uva.nl
2
With help from Rens Bod Victor Kuperman Brian Roark Vera Demberg
3
Understanding a sentence The very general picture the cat is on the mat word sequence “meaning” “comprehension”
4
Sentence meaning Theories of mental representation x,y : cat (x) ∧ mat (y) ∧ on (x,y) tree structure cat on mat logical form conceptual network perceptual simulation state vector or activation pattern Sentence “structure” … ?
5
Grammar-based vs. connectionist models Grammars account for the productivity and systematicity of language (Fodor & Pylyshyn, 1988; Marcus, 1998) Connectionism can explain why there is no (unlimited) productivity and (pure) systematicity (Christiansen & Chater, 1999) The debate (or battle) between the two camps focuses on particular (psycho-) linguistic phenomena, e.g.:
6
From theories to models Probabilistic Context- Free Grammar (PCFG) Simple Recurrent Network (SRN) versus Implemented computational models can be evaluated and compared more thoroughly than ‘mere’ theories Take a common grammar-based model and a common connectionist model Compare their ability to predict empirical data (measurements of word-reading time)
7
Probabilistic Context-Free Grammar A context-free grammar with a probability for each production rule (conditional on the rule’s left-hand side) The probability of a tree structure is the product of probabilities of the rules involved in its construction. The probability of a sentence is the sum of probabilities of all its grammatical tree structures. Rules and their probabilities can be induced from a large corpus of syntactically annotated sentences (a treebank). Wall Street Journal treebank: approx. 50,000 sentences from WSJ newspaper articles (1988−1989)
8
Inducing a PCFG S NPVP. PRP VPZNP PP DT NN NP PRP$ IN NP. It has nobearing on our work forcetoday S NP VP. NP DT NN NP PRP$ NN NN NN bearing NN today
9
Simple Recurrent Network Elman (1990) Feedforward neural network with recurrent connections Processes sentences, word by word Usually trained to predict the upcoming word (i.e., the input at t+1) word input at t hidden layer output layer hidden activation at t–1 (copy) state vector representing the sentence up to word t estimated probabilities for words at t+1
10
Word probability and reading times Hale (2001), Levy (2008) surprisal of w t Surprisal theory: the more unexpected the occurrence of a word, the more time needed to process it. Formally: A sentence is a sequence of words: w 1, w 2, w 3, … The time needed to read word w t, is logarithmically related to its probability in the ‘context’: RT(w t ) ~ −log Pr(w t |context) If nothing else changes, the context is just the sequence of previous words: RT(w t ) ~ −log Pr(w t |w 1, …, w t−1 ) Both PCFGs and SRNs can estimate Pr(w t |w 1, …, w t−1 ) So can they predict word-reading times?
11
Testing surprisal theory Demberg & Keller (2008) Reading-time data: Dundee corpus : approx. 2,400 sentences from The Independent newspaper editorials Read by 10 subjects Eye-movement registration First-pass RTs: fixation time on a word before any fixation on later words Computation of surprisal: PCFG induced from WSJ treebank Applied to Dundee corpus sentences Using Brian Roark’s incremental PCFG parser
12
Testing surprisal theory Demberg & Keller (2008) But accurate word prediction is difficult because of required world knowledge differences between WSJ and The Independent: − 1988-’89 versus 2002 − general WSJ articles versus Independent editorials − American English versus British English − only major similarity: both are in English Result: No significant effect of word surprisal on RT, apart from the effects of Pr(w t ) and Pr(w t |w t 1 ) Test for a purely ‘structural’ (i.e., non-semantic) effect by ignoring the actual words
13
S NPVP. PRP VPZNP PP DT NN NP PRP$ IN NP. no on today it has bearing our work force part-of-speech (pos) tags
14
‘Unlexicalized’ (or ‘structural’) surprisal −PCFG induced from WSJ trees with words removed −surprisal estimation by parsing sequences of pos-tags (instead of words) of Dundee corpus texts −independent of semantics, so more accurate estimation possible −but probably a weaker relation with reading times Is a word’s RT related to the predictability of its part- of-speech? Result: Yes, statistically significant (but very small) effect of pos-surprisal on word-RT Testing surprisal theory Demberg & Keller (2008)
15
Caveats Statistical analysis: −The analysis assumes independent measurements −Surprisal theory is based on dependencies between words −So the analysis is inconsistent with the theory Implicit assumptions: −The PCFG forms an accurate language model (i.e., it gives high probability to the parts-of-speech that actually occur) −An accurate language model is also an accurate psycholinguistic model (i.e., it predicts reading times)
16
Solutions Sentence-level (instead of word-level) analysis −Both PCFG and statistical analysis assume independence between sentences −Surprisal averaged over pos-tags in the sentence −Total sentence RT divided by sentence length (# letters) Measure accuracy a)of the language model: lower average surprisal more accurate language model b)of the psycholinguistic model: RT and surprisal correlate more strongly more accurate psycholinguistic model If a) and b) increase together, accurate language models are also accurate psycholinguistic models
17
Comparing PCFG and SRN PCFG −Train on WSJ treebank (unlexicalized) −Parse pos-tag sequences from Dundee corpus −Obtain range of surprisal estimates by varying ‘beam-width’ parameter, which controls parser accuracy SRN −Train on sequences of pos-tags (not the trees) from WSJ −During training (at regular intervals), process pos-tags from Dundee corpus, obtaining a range of surprisal estimates Evaluation −Language model: average surprisal measures inaccuracy (and estimates language entropy) −Psycholinguistic model: correlation between surprisals and RTs just like Demberg & Keller
18
PCFG SRN Results
19
Preliminary conclusions Both models account for a statistically significant fraction of variance in reading-time data. The human sentence-processing system seems to be using an accurate language model. The SRN is the more accurate psycholinguistic model. But PCFG and SRN together might form an ever better psycholinguistic model
20
Improved analysis Linear mixed-effect regression model (to take into account random effects of subject and item) Compare regression models that include: −surprisal estimates by PCFG with largest beam width −surprisal estimates by fully trained SRN −both Also include: sentence length, word frequency, forward and backward transitional probabilities, and all significant two-way interactions between these
21
Results Regression model includes… PCFGSRNboth Effect of surprisal according to… PCFG 0.45 p<.02 SRN Estimated β-coefficients (and associated p-values)
22
Results Regression model includes… PCFGSRNboth Effect of surprisal according to… PCFG 0.45 p<.02 SRN 0.64 p<.001 Estimated β-coefficients (and associated p-values)
23
Results Regression model includes… PCFGSRNboth Effect of surprisal according to… PCFG 0.45 p<.02 −0.46 p>.2 SRN 0.64 p<.001 1.02 p<.01 Estimated β-coefficients (and associated p-values)
24
Conclusions Both PCFG and SRN do account for the reading-time data to some extent But the PCFG does not improve on the SRN’s predictions No evidence for tree structures in the mental representation of sentences
25
Why does the SRN fit the data better than the PCFG? Is it more accurate on a particular group of data points, or does it perform better overall? Take the regression analyses’ residuals (i.e., differences between predicted and measured reading times) δ i = |resid i ( PCFG )| − |resid i ( SRN )| δ i is the extent to which data point i is predicted better by the SRN than by the PCFG. Is there a group of data points for which δ is larger than might be expected? Look at the distribution of the δs. Qualitative comparison
26
Possible distributions of δ 000 No difference between SRN and PCFG (only random noise) Overall better predictions by SRN Particular subset of the data is predicted better by SRN right-shifted right-skewed symmetrical, mean is 0
27
If the distribution of δ is asymmetric, particular data points are predicted better by the SRN than the PCFG. The distribution is not significantly asymmetric (two-sample Kolmogorov-Smirnov test, p>.17) The SRN seems to be a more accurate psycholinguistic model overall. Test for symmetry
28
Questions I cannot (yet) answer (but would like to) Why does the SRN fit the data better than the PCFG? Perhaps people… −are bad at dealing with long-distance dependencies in a sentence? −store information about the frequency of multi-word sequences?
29
Questions I cannot (yet) answer (but would like to) In general: Is this SRN a more accurate psychological model than this PCFG? Are SRNs more accurate psychological models than PCFGs? Does connectionism make for more accurate psychological models than grammar-based theories? What kind of representations are used in human sentence processing? Surprisal-based model evaluation may provide some answers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.