Download presentation
Presentation is loading. Please wait.
1
Day 5: Entropy reduction models; connectionist models; course wrapup Roger Levy University of Edinburgh & University of California – San Diego
2
Today Other information-theoretic models: Hale 2003 Connectionist models General discussion & course wrap-up
3
Hale 2003, 2005: Entropy reduction Another attempt to measure reduction in uncertainty arising from a word in a sentence The information contained in a grammar can be measured by the grammar’s entropy But grammars usually denote infinite sets of trees! Not a problem: Grenander 1967 shows how to calculate entropy of any* stochastic context-free branching process [skip details of matrix algebra…] *a probability distribution over an infinite set is not, however, guaranteed to have finite entropy (Cover & Thomas 1991, ch. 2)
4
Entropy reduction, cont. A partial input w 1…n rules out some of the possibilities in a grammar Uncertainty after a partial input can be characterized as the entropy of the grammar conditioned on w 1…n We can calculate this uncertainty efficiently by treating a partial input as a function from grammars to grammars (Lang 1988, 1989) [skip details of parsing theory…]
5
Entropy reduction, punchline Hale’s entropy reduction hypothesis: it takes effort to decrease uncertainty The amount of processing work associated with a word is the uncertainty-reduction it causes (if any) Note that entropy doesn’t always go down! more on this in a moment ER idea originally attributed to Wilson and Carroll (1954)
6
Intuitions for entropy reduction Consider a string composed of n independently identically distributed (i.i.d.) variables, e,g.: The entropy of the “grammar” is 1.5n, for any n When you see w i, the entropy goes from 1.5(n- i+1) to 1.5(n-i) So ER of any word is 1.5 Note that ER(w i ) is not really related to P(w i ) (each X i has 1.5 bits entropy)
7
Intuitions for entropy reduction (2) Cases when entropy doesn’t go down: Consider the following PCFG: ROOT -> a b 0.9 ROOT -> x y 0.05 ROOT -> x z 0.05 If you see a, the only possible resulting tree is so entropy goes to 0 for ER of 0.57 If you see x, there are two equiprobable continuations: [] and hence entropy goes to 1 for ER of 0 Entropy goes up, and what’s more, the more likely continuation is predicted to be harder H(G) ≡ H(ROOT) = 0.57
8
Entropy reduction applied Hale 2003 looked at object relative clauses versus subject relative clauses One important property of tree entropy is that it is very high before recursive categories NP → NP RC is recursive because a relative clause can expand into just about any kind of sentence So transitions past the end of an NP have big entropy reduction the reporter who sent the photographer to the editor the reporter who the photographer sent to the editor
9
Subject RCs (Hale 2003)
10
Object RCs (Hale 2003)
11
Entropy reduction in RCs Crossing past the embedded subject NP is a big entropy reduction Means that reading the embedded verb predicted harder in the object RC than in the subject RC No other probabilistic processing theory seems to get this data point right Pruning, attention shift, competition not applicable: no prominent ambiguity about what has been said Surprisal probably predicts the opposite: the embedded subject NP makes the context more constrained Some non-probabilistic theories* do get this right *e.g., Gibson 1998, 2000
12
Surprisal & Entropy reduction compared Surprisal and entropy reduction share some features… High-to-low uncertainty transitions are hard …and make some similar predictions… Constrained syntactic contexts are lower-entropy, thus will tend to be easier under ER ER also predicts facilitative ambiguity, since resulting state is higher-entropy
13
Surprisal & Entropy reduction, cont. …but they are also different… ER compares summary statistics of different probability distributions Surprisal implicitly compares two distr.’s point-by- point …and make some different predictions Surprisal predicts that more predictable words will be read more quickly in a given constrained context ER penalizes transitions across recursive states such as nominal postmodification (useful for some results in English relative clauses; see Hale 2003)
14
Future ideas Deriving some of these measures from more mechanistic views of sentence processing Information-theoretic measures on conditional probability distributions Connection with competition models: entropy of a c.p.d. as an index of amount of competition Closer links between specific theoretical measures and specific types of observable behavior Readers might use local entropy (not reduction) to gauge how carefully they must read
15
Today Other information-theoretic models: Hale 2003 Connectionist models General discussion & course wrap-up
16
Connectionist models: overview Connectionists have also had a lot to say about sentence comprehension Historically, there has been something of a sociological divide between the connectionists and the “grammars” people It’s useful to understand the commonalities and differences between the two approaches *I know less about connectionist literature than about “grammars” literature!
17
What are “connectionist models”? A connectionist model is one that uses a neural network to represent knowledge of language and its deployment in real time Biological plausibility often taken as motivation Mathematically, a generalization of the multiclass generalized linear models covered on Wednesday Two types of generalization: additional layers (gives flexibility of nonlinear modeling) recurrence (allows modeling of time-dependence)
18
Review: (generalized) linear models A (generalized) linear model expresses a predictor as the dot product of a weight vector with inputs For a categorical output variable, the class predictors can be thought of as drawing a decision boundary i ranges from 1 to m for an m-class categorization problem
19
non-linear decision boundary probabilistic interpretation: close to the boundary means P(z1)≈0.5
20
Non-linear classification Extra network layers → non-linear decision boundaries Boundaries can be curved, discontinuous, … A double-edged sword: Expressivity can mean good fit using few parameters But difficult to train (local maxima in likelihood surface) Plus danger of overfitting
21
Schematic picture of multilayer nets output layer hidden layer K hidden layer 1 … input layer
22
Recurrence and online processing The networks shown thus far are “feed-forward” – there are no feedback cycles between nodes contrast with the CI model of McRae et al. How to capture the incremental nature of online language processing in a connectionist model? Proposal by Elman 1990, 1991: let part of the internal state of the network feed back into input later on Simple Recurrent Network (SRN)
23
Simple recurrent networks output layer hidden layer K hidden layer J … input layer … context layer direct copy hidden layer J -1
24
Recurrence “unfolded” … Unfolded over time, the hidden layer at each step becomes the context for the next step Hidden layer has new role as “memory” of sentence input w 1 input w 2 input w 3 output w 4 output w 2
25
Connectionist models: overview (2) Connectionist work typically draws a closer connection between the acquisition process and results in “normal adult” sentence comprehension Highly evident in Christiansen & Chater 1999 The “grammars” modelers, in contrast, tend to assume perfect acquisition & representation e.g., Jurafsky 1996 & Levy 2005 assumed PCFGs derived directly from a treebank
26
Connectionist models: overview (3) Two major types of linking hypothesis between the online comprehension process and observables: Predictive hypothesis: next-word error rates should match reading times (starting from Elman 1990, 1991) Gravitiational hypothesis (Tabor, Juliano, & Tanenhaus; Tabor & Tanenhaus): more like a competition model In both cases, latent outcomes of the acquisition process are of major interest That is, how does a connectionist network learn to represent things like the MV/RR ambiguity?
27
Connectionist models for psycholx. Words presented to the network in sequence The model is trained to minimize next-word prediction error That is, there is no overt syntax or tree structure built into the model Performance of model is evaluated on how well it predicts next words Metrics turns out to be closely related to surprisal n 1 v 5 # N 3 n 8 v 2 V 4 # …
28
Christiansen & Chater 1999 Bach et al. 1986: cross-dependency recursion seems easier to process than nesting recursion Investigation: do connectionist models learn cross-dependency recursion better than nesting recursion? Aand has Jantje the teacher the marbles let help collect up. Aand has Jantje the teacher the marbles up collect help let.
29
These were small artificial languages, so both networks learned well ultimately But the center- embedding recursions were learned more slowly poor learning with 5 HUs
30
Connectionism summary Stronger association posited between constraints on acquisition and constraints on on-line processing among adults Evidence that cross-serial dependencies are easier to learn from a network architecture N.B. Steedman 1999 has critiqued neural networks for essentially learning n-gram surface structure this would do better with cross-serial dependencies Underlying issue: do networks learn hierarchical structure? If not, what learning biases would make them?
31
Today Other information-theoretic models: Hale 2003 Connectionist models General discussion & course wrap-up
32
Course summary: Probabilistic models in psycholinguistics Differences in technical details of models should not obscure major shared features & differences Shared feature: use of probability to model incremental disambiguation computational standpoint: achieves broad coverage with ambiguity management psycholinguistic standpoint: evidence is strong that people do deploy probabilistic information online the complex houses vs. the corporation fires the {crook/cop} arrested by the detective the group (to the mountain) led
33
Course summary (2) Differences: connection with observable behavior Pruning: difficulty when a needed analysis is lost Competition: difficulty when multiple analysis have substantial support Reranking: difficulty when distr. over analyses changes Next-word prediction (surprisal, connectionism) Note the close relationships among some of these approaches Pruning and surprisal are both special types of reranking Competition has conceptual similarity to attention shift
34
Course wrap-up: general ideas In most cases, the theories presented here are not mutually exclusive, e.g.: Surprisal and entropy-reduction are presented as full-parallel, but could also be limited-parallel Attention-shift effects could coexist with competition, surprisal, or entropy reduction In some cases, theories say very different things, e.g.: Ambiguity very different under competition vs. surprisal the {daughter/son} of the colonel who shot himself… But even in these cases, different theories may have different natural domains of explanation
35
Course wrap-up: unresolved issues Serial vs. parallel sentence parsing More of a continuum than a dichotomy Empirically distinguishing serial vs. parallel is difficult Practical limitations of probabilistic models Our ability to estimate surface (n-gram) models is probably beyond that of humans (we have more data!) But our ability to estimate structured (syntactic, semantic) probabilistic models is poor! Less annotated data, No real-world semantics Unsupervised methods still poorly understood We’re still happy if monotonicities match human data
36
Course wrap-up: unexplored territory Best-first search strategies for modeling reading times possible basis for heavily serial computational parsing models Entropy as a direct measure of various types of uncertainty Entropy of P( Tree | String ) as a measure of uncertainty as to what has been said Entropy of P( w i | w 1…i-1 ) as a measure of uncertainty as to what may yet be said Models of processing load combining probabilistic and non-probabilistic factors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.