Computational Psycholinguistics Lecture 2: surprisal, incremental syntactic processing, and approximate surprisal Florian Jaeger & Roger Levy LSA 2011.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Statistical NLP Winter 2009 Lecture 12: Computational Psycholinguistics Roger Levy.
Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ.
The Interaction of Lexical and Syntactic Ambiguity by Maryellen C. MacDonald presented by Joshua Johanson.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Sentence Processing 1: Encapsulation 4/7/04 BCS 261.
Dynamic Bayesian Networks (DBNs)
Decision Theoretic Planning
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
Introduction and Jurafsky Model Resource: A Probabilistic Model of Lexical and Syntactic Access and Disambiguation, Jurafsky 1996.
Planning under Uncertainty
Visual Recognition Tutorial
Computational problems, algorithms, runtime, hardness
CPSC 411, Fall 2008: Set 12 1 CPSC 411 Design and Analysis of Algorithms Set 12: Undecidability Prof. Jennifer Welch Fall 2008.
Heuristic alignment algorithms and cost matrices
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 4: Modeling Decision Processes Decision Support Systems in the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Day 4: Reranking/Attention shift; surprisal-based sentence processing Roger Levy University of Edinburgh & University of California – San Diego.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Day 2: Pruning continued; begin competition models
1/17 Probabilistic Parsing … and some other approaches.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Bayesian Filtering for Location Estimation D. Fox, J. Hightower, L. Liao, D. Schulz, and G. Borriello Presented by: Honggang Zhang.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Lecture 1 Introduction: Linguistic Theory and Theories
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Crash Course on Machine Learning
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Chapter 9 Comparing More than Two Means. Review of Simulation-Based Tests  One proportion:  We created a null distribution by flipping a coin, rolling.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
1 Statistical Distribution Fitting Dr. Jason Merrick.
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 4): Power Fall, 2008.
Hypotheses tests for means
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Randomized Algorithms for Bayesian Hierarchical Clustering
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Rules, Movement, Ambiguity
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:
Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.
Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Associative Theories of Long- Term Memory. Network Theory The basic notion that we need to explore is that memory consists of a large number of associations.
Can small quantum systems learn? NATHAN WIEBE & CHRISTOPHER GRANADE, DEC
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Psychology and Neurobiology of Decision-Making under Uncertainty Angela Yu March 11, 2010.
Optimal Decision-Making in Humans & Animals Angela Yu March 05, 2009.
Rational process models Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Experimental Psychology
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CSCI 5822 Probabilistic Models of Human and Machine Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
PSY 626: Bayesian Statistics for Psychological Science
Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Computational Psycholinguistics Lecture 2: surprisal, incremental syntactic processing, and approximate surprisal Florian Jaeger & Roger Levy LSA 2011 Summer Institute Boulder, CO 12 July 2011

Comprehension: Theoretical Desiderata how to get from here… …to here? the boy will eat… Realistic models of human sentence comprehension must account for: Robustness to arbitrary input Accurate disambiguation Inference on basis of incomplete input (Tanenhaus et al 1995, Altmann and Kamide 1999, Kaiser and Trueswell 2004) Processing difficulty is differential and localized

Review Garden-pathing under Jurafsky 1996 Scoring relative probability of incremental trees An incremental tree is a fully connected sequence of nodes from the root category (typically, S) to all the terminals (words) that have been seen so far Nodes on the right frontier of an incremental tree are still “open” (could accrue further daughters) What kind of uncertainty does the Jurafsky 1996 model of garden-pathing deal with? Uncertainty about what has already been said

Generalizing incremental disambiguation Another type of uncertainty This is uncertainty about what has not yet been said Reading-time (Ehrlich & Rayner, 1981) and EEG (Kutas & Hillyard, 1980, 1984) evidence shows this affects processing rapidly A good model should account for expectations about how this uncertainty will be resolved The old man stopped and stared at the woman? dog? view? statue? The squirrel stored some nuts in thetree

the reporter who the senator attacked Non-probabilistic complexity On the traditional view, resource limitations, especially memory, drive processing complexity Gibson 1998, 2000 (DLT): multiple and/or more distant dependencies are harder to process the reporter who attacked the senator Processing Easy Hard

Probabilistic complexity: surprisal Hale (2001) proposed that a word’s complexity in sentence comprehension is determined by its surprisal This idea can actually be traced back (at least) to Mandelbrot (1953) (Cognitive science in the 1950s was extremely interesting -- many ideas to be mined!]

The surprisal graph

Garden-pathing under surprisal Another type of local syntactic ambiguity Compare with: When the dog scratched the vet and his new assistant removed the muzzle. When the dog scratched, the vet and his new assistant removed the muzzle. When the dog scratched its owner the vet and his new assistant removed the muzzle.

A small PCFG for this sentence type

Two incremental trees

Surprisal for the two variants

Expectations versus memory Suppose you know that some event class X has to happen in the future, but you don’t know: 1.When X is going to occur 2.Which member of X it’s going to be The things W you see before X can give you hints about (1) and (2) If expectations facilitate processing, then seeing W should generally speed processing of X But you also have to keep W in memory and retrieve it at X This could slow processing at X

Study 1: Verb-final domains Konieczny 2000 looked at reading times at German final verbs in a self-paced reading expt Er hat die Gruppe auf den Berg geführt He has the group to the mountain led Er hat die Gruppe geführt He has the group led Er hat die Gruppe auf den SEHR SCHÖNEN Berg geführt He has the group to the VERY BEAUTIFUL mtn. led “He led the group” “He led the group to the mountain” “He led the group to the very beautiful mountain”

Locality predictions and empirical results Locality-based models (Gibson 1998) predict difficulty for longer clauses But Konieczny found that final verbs were read faster in longer clauses Prediction easy hard Result fast fastest slow Er hat die Gruppe auf den Berg geführt Er hat die Gruppe geführt He led the group He led the group to the mountain...die Gruppe auf den sehr schönen Berg geführt He led the group to the very beautiful mountain

Er hat die Gruppe (auf den (sehr schönen) Berg) geführt Predictions of surprisal Locality-based models (e.g., Gibson 1998, 2000) would violate monotonicity Locality-based difficulty (ordinal) Levy 2008

Once we’ve seen a PP goal we’re unlikely to see another So the expectation of seeing anything else goes up p i (w) obtained via a PCFG derived empirically from a syntactically annotated corpus of German (the NEGRA treebank) Seeing more = having more information More information = more accurate expectations Deriving Konieczny’s results auf den Berg PP geführt V NP? PP-goal? PP-loc? Verb? ADVP? die Gruppe VP NP S Vfin Er hat

Study 2: Final verbs, effect of dative... daß der Freund DEM Kunden das Auto verkaufte... that the friend the client the car sold ‘...that the friend sold the client a car...’ Locality: final verb read faster in DES condition Observed: final verb read faster in DEM condition... daß der Freund DES Kunden das Auto verkaufte... that the friend the client the car sold ‘...that the friend of the client sold a car...’ (Konieczny & Döring 2003)

Next: NP nom NP acc NP dat PP ADVP Verb Next: NP nom NP acc NP dat PP ADVP Verb verkaufte V V daß SBAR COMP SBAR COMP der Freund das Auto DEM Kunden DES Kunden NP acc VP S NP nom S VP NP dat NP nom NP gen

Model results Reading time (ms) P(w i ): word probability Locality-based predictions dem Kunden (dative)  slower des Kunden (genitive)  faster ~30% greater expectation in dative condition once again, wrong monotonicity

Theoretical bases for surprisal So far, we have simply stipulated that complexity ~ surprisal To a mathematician, surprisal is a natural cost metric But as a cognitive scientist, it would be nice to derive surprisal from prior principles I’ll present three derivations of surprisal in this section

(1) Surprisal as relative entropy Relative entropy: a fundamental information-theoretic measure of the distance between two probability distributions Intuitively, the penalty paid by encoding one distribution with a different one It turns out that relative entropy over interpretation distributions before and after w i = (surprisal!) Surprisal can thus be thought of as reranking cost  Relative entropy independently proposed as a measure of surprise in visual scene perception (Itti & Baldi 2005) Levy 2008

(2) Surprisal as optimal discrimination Many theories of reading posit lexical access as key bottleneck E-Z Reader (Reichle et al., 1998); SWIFT (Engbert et al., 2005) Same bottleneck should hold for auditory comprehension as well Norris (2006)’s Bayesian Reader: lexical access involves a probabilistic judgment about the word’s identity from noisy input Certainty takes a “random walk” in probability space, and surprisal determines starting point of the walk Decision Threshold Norris 2006 Connections with diffusion model (Ratcliff 1978) and MSPRT (Baum & Veeravalli 1994) Also connections w/ cortical decision-process models (e.g., Usher & McClelland 2001)

(3) Surprisal as optimal preparation Are all RT differences best modeled as discrimination? Intuitively, it makes sense to prepare for events you expect to happen Such preparation allows increased avg. response speed Smith & Levy (2008) formalize this intuition as an optimization of response speed against (fixed) preparation costs: Let the brain choose response times, but faster is costlier + scale-freeness: a unit’s processing cost is sum of costs of its subunits = surprisal, under very general conditions Smith & Levy, 2008

Is probabilistic facilitation logarithmic? What I’ve shown you so far: More expected = faster What the theoretical derivations I’ve shown promised: More expected = faster in a logarithmic scale Established for frequency, not for probability Focused look at subtleties of specific constructions may not be the best way to investigate this issue highly refined probability distributions are challenging to estimate we need a lot of data to get a good view of the picture Solution: broad-coverage model, reading over free text Smith & Levy, 2008

Log-probability: methods Dataset the Dundee Corpus (Kennedy et al., 2003) 50K words of British newspaper text, read by 10 speakers Measures of interest: “Frontier” fixations (all fixations beyond the farthest fixation thus far) First fixations (frontier fixations falling on a new word) foxjumpedoverthelazydog Frontier fixations First fixations

Deconfounding frequency & probability Major confound: log- frequency, widely recognized to have linear effect on RT Unfortunately, freq & prob are heavily correlated (  =0.8) Fortunately, there’s still a big cloud of data to help us discriminate between the two (N≈200,000)

Log-probability: results Facilitation is essentially linear in log-probability True even after controlling conservatively for frequency and word-length effects binned median log-probs and frontier-fixation RTs nonparametric regression

Aggregation across words & spillover Eye-trackingSelf-paced reading

When ambiguity facilitates comprehension Sometimes, ambiguity seems to facilitate processing: Argued to be problematic for parallel constraint-based competition models (Macdonald, Pearlmutter, & Seidenberg 1994) (though see rebuttal by Green & Mitchell 2006) The daughter i of the colonel j who shot himself *i/j The daughter i of the colonel j who shot herself i/*j (Traxler et al. 1998; Van Gompel et al. 2001, 2005) The son i of the colonel j who shot himself i/j slower faster

Sometimes the reader attaches the RC low... and everything’s OK But sometimes the reader attaches the RC high… and the continuation is anomalous So we’re seeing garden-pathing ‘some’ of the time himself  himself Traditional account: stochastic race model NPPP NP P of NP the daughter the colonel RC who shot… (Traxler et al. 1998; Van Gompel et al. 2001, 2005)

Surprisal as a parallel alternative assume a generative model where choice between herself and himself determined only by antecedent’s gender NP PP NP P of NP the daughter the colonel RC who shot… NPPP NP P of NP the daughter the colonel RC who shot… NP self herself Surprisal marginalizes over possible syntactic structures

Ambiguity reduces the surprisal But son…who shot… can daughter…who shot… can’t contribute probability mass to himself

Ambiguity/surprisal conclusion Cases where ambiguity reduces difficulty aren’t problematic for parallel constraint satisfaction Although they may be problematic for competition Surprisal can be thought of as a revision of constraint-based theories with competition Same: a variety of constraints immediately brought to bear on syntactic comprehension Different: linking hypothesis from probabilistic constraints to behavioral observables

Competition versus surpisal: speculation Swets et al. (submitted) : question type can affect behavioral responses to ambiguous RCs: “Did the colonel get shot?” Asking about RC slowed RC reading time across the board And speed of response interacted with question type RC questions answered slowest in ambiguous condition Speculation: Comprehension is generally parallel & surprisal-based Competition emerges when comprehender is forced into a serial channel

Memory constraints: a theoretical puzzle # Logically possible analyses grows at best exponentially in sentence length Exact probabilistic inference with context-free grammars can be done efficiently in O(n 3 ) But… Requires probabilistic locality, limiting conditioning context Human parsing is linear—that is, O(n) —anyway So we must be restricting attention to some subset of analyses Puzzle: how to choose and manage this subset? Previous efforts: k-best beam search Here, we’ll explore the particle filter as a model of limited- parallel approximate inference Levy, Reali, & Griffiths, 2009, NIPS

The particle filter: general picture Sequential Monte Carlo for incremental observations Let x i be observed data, z i be unobserved states For parsing: x i are words, z i are incremental structures Suppose that after n-1 observations we have the distribution over interpretations P(z n-1 |x 1…n-1 ) After next observation x n, represent the next distribution P(z n |x 1…n ) inductively: Approximate P(z i |x 1…i ) by samples Sample z n from P(z n |z n-1 ), and reweight by P(x n |z n )

Particle filter with probabilistic grammars S  NP VP1.0V  brought 0.4 NP  N0.8V  broke0.3 NP  N RRC0.2V  tripped0.3 RRC  Part N1.0Part  brought0.1 VP  V N1.0Part  broken0.7 N  women0.7Part  tripped0.2 N  sandwiches0.3Adv  quickly1.0 S women brought sandwiches VP N VN * NP * * * * * * * * * womenbroughtsandwiches RRC N PartN * * * * * * * * * * * tripped * * VP V S * * * 0.3 NP

Resampling in the particle filter With the naïve particle filter, inferences are highly dependent on initial choices Most particles wind up with small weights Region of dense posterior poorly explored Especially bad for parsing Space of possible parses grows (at best) exponentially with input length input

Resampling in the particle filter With the naïve particle filter, inferences are highly dependent on initial choices Most particles wind up with small weights Region of dense posterior poorly explored Especially bad for parsing Space of possible parses grows (at best) exponentially with input length input We handle this by resampling at each input word

Simple garden-path sentences The woman brought the sandwich from the kitchen tripped Posterior initially misled away from ultimately correct interpretation With finite # of particles, recovery is not always successful MAIN VERB (it was the woman who brought the sandwich) REDUCED RELATIVE (the woman was brought the sandwich)

Solving a puzzle A-STom heard the gossip wasn’t true. A-LTom heard the gossip about the neighbors wasn’t true. U-STom heard that the gossip wasn’t true. U-LTom heard that the gossip about the neighbors wasn’t true. Previous empirical finding: ambiguity induces difficulty… …but so does the length of the ambiguous region Our linking hypothesis: Proportion of parse failures at the disambiguating region should increase with sentence difficulty Frazier & Rayner,1982; Tabor & Hutchins, 2004

Another example (Tabor & Hutchins 2004) As the author wrote the essay the book grew. As the author wrote the book grew. As the author wrote the essay the book describing Babylon grew. As the author wrote the book describing Babylon grew.

Resampling-induced drift In ambiguous region, observed words aren’t strongly informative ( P(x i |z i ) similar across different z i ) But due to resampling, P(z i |x i ) will drift One of the interpretations may be lost The longer the ambiguous region, the more likely this is

Model Results Ambiguity matters… But the length of the ambiguous region also matters!

Human results (offline rating study)