Day 4: Reranking/Attention shift; surprisal-based sentence processing Roger Levy University of Edinburgh & University of California – San Diego.

Slides:

Advertisements

Similar presentations

Statistical NLP Winter 2009 Lecture 12: Computational Psycholinguistics Roger Levy.

Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 11.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Authority 2. HW 8: AGAIN HW 8 I wanted to bring up a couple of issues from grading HW 8. Even people who got problem #1 exactly right didn’t think about.

Intro to NLP - J. Eisner1 Human Sentence Processing.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

The Interaction of Lexical and Syntactic Ambiguity by Maryellen C. MacDonald presented by Joshua Johanson.

Sentence Processing 1: Encapsulation 4/7/04 BCS 261.

Statistical NLP: Lecture 3

1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.

10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :

Introduction and Jurafsky Model Resource: A Probabilistic Model of Lexical and Syntactic Access and Disambiguation, Jurafsky 1996.

March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.

Computational Psycholinguistics Lecture 2: surprisal, incremental syntactic processing, and approximate surprisal Florian Jaeger & Roger Levy LSA 2011.

September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.

The Neural Basis of Thought and Language Week 15 The End is near...

1/13 Parsing III Probabilistic Parsing and Conclusions.

Day 2: Pruning continued; begin competition models

1/17 Probabilistic Parsing … and some other approaches.

1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Day 5: Entropy reduction models; connectionist models; course wrapup Roger Levy University of Edinburgh & University of California – San Diego.

Probabilistic Methods in Computational Psycholinguistics Roger Levy University of Edinburgh & University of California – San Diego.

Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.

Linguistic Theory Lecture 3 Movement. A brief history of movement Movements as ‘special rules’ proposed to capture facts that phrase structure rules cannot.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

Albert Gatt Corpora and Statistical Methods Lecture 5.

The mental representation of sentences Tree structures or state vectors? Stefan Frank

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

SLOW DOWN!!!  Remember… the easiest way to make your score go up is to slow down and miss fewer questions  You’re scored on total points, not the percentage.

Evidence Based Medicine

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Chapter 20 Testing hypotheses about proportions

Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.

Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.

Sampling Distribution Models Chapter 18. Toss a penny 20 times and record the number of heads. Calculate the proportion of heads & mark it on the dot.

Albert Gatt Corpora and Statistical Methods Lecture 11.

Linguistic Essentials

Lexicalized and Probabilistic Parsing Read J & M Chapter 12.

Rules, Movement, Ambiguity

E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.

Results of Eyetracking & Self-Paced Moving Window Studies DO-Bias Verbs: The referees warned the spectators would probably get too rowdy. The referees.

Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.

Dec 11, Human Parsing Do people use probabilities for parsing?! Sentence processing Study of Human Parsing.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)

NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.

Syntactic Priming in Sentence Comprehension (Tooley, Traxler & Swaab, 2009) Zhenghan Qi.

Experimental Psychology PSY 433 Chapter 5 Research Reports.

PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:

Natural Language Processing Vasile Rus

CSC 594 Topics in AI – Natural Language Processing

Experimental Psychology

Statistical NLP: Lecture 3

CSCI 5832 Natural Language Processing

Probabilistic and Lexicalized Parsing

CSCI 5832 Natural Language Processing

David Kauchak CS159 – Spring 2019

Statistical NLP: Lecture 10

Presentation transcript:

Day 4: Reranking/Attention shift; surprisal-based sentence processing Roger Levy University of Edinburgh & University of California – San Diego

Overview for the day Reranking & Attention shift Crash course in information theory Surprisal-based sentence processing

Reranking & Attention shift Suppose an input prefix w 1…I determines a ranked set of incremental structural analyses, call it Struct(w 1…i ) In general, adding a new word w i+1 to the input will determine a new ranked set of analysis Struct(w 1…i+1 ) A reranking theory attributes processing difficulty to some function comparing the structural analyses An attention shift theory is a special case where difficulty is predicted only when the highest- ranked analysis differs between Struct(w 1…i ) and Struct(w 1…i+1 )

Conceptual issues Granularity: what precisely is specified in an incremental structural analysis? Ranking metric: how are analyses ranked? e.g.in terms of conditional probabilities P( T | w 1…i ) Degree of parallelism: how many (and which) analyses are retained in Struct(w 1…i )?

Crocker & Brants 2000 [brainstorming session]

Attention shift: an example Parallel comprehension: two or more analyses entertained simultaneously Disambiguation comes at following context, “many workers…” There is an extra cost paid (reading is slower) at disambiguating context Eye-tracking (Frazier and Rayner 1987) Self-paced reading (MacDonald 1993) The warehouse fires many workers each spring…

Pruning isn’t enough Jurafsky analyzed NN/NV ambiguity for “warehouse fires” and concluded no pruning could happen 3.8 : : 1

Idea of attention shift Suppose that a change in the top-ranked candidate induces empirically-observed “difficulty” Not the same as serial parsing, which doesn’t even entertain alternate parses unless the current parse breaks down Why would this happen? People could be gathering more information about the preferred parse, and need extra time to do this when the preferred parse changes People could simply be surprised, and this could interrupt “normal reading processes”

Crocker & Brants 2000 Adopt an attention-shift linking hypothesis (page 660; unfortunately not stated very explicitly) Architectural aspects of their system: Bottom-up, incremental parsing architecture Some pruning at every “layer” from bottom on up No lexicalization in the grammar Skip other details…

N/V ambiguity under attention shift Crocker & Brants 2000: relative strength of each interpretation changes from word to word

N/V attention shift: which probs? This analysis relies on lexical & syntactic probabilities P(fires|NN) is higher than P(fires|VBZ) P(NP -> Det NN NN) is low, and putting “many” after a subject NP is low-probability Is this a satisfactory analysis? (c.f. day 1!) MacDonald 1993 found no disambiguating- context difficulty when noun (corporation) doesn’t support noun-compound analysis These are, at the least, bilexical affinities The corporation fires many workers each spring

Results from MacDonald 1993 Difficulty only with “warehouse” not “corporation” “fires” Observed difficulty delayed a bit (spillover) relative difficulty in ambiguous case

How to estimate parse probs In an attention-shift model, conditional probabilities are of primary interest “warehouse fires” vs. “corporation fires” creates a practical problem Model should include P(fires|warehouse,{NN,NV}) and P(fires|corporation,{NN,NV}) But no parsed corpus even contains “fires” in the same sentence with either of these words What do we do here?

How to estimate parse probs (2) MacDonald 1993’s approach: collect relevant quantitative norm data and correlate with RTs warehouse head vs. modifying noun freq corresponds to P(NN|warehouse) fires noun/verb ambiguous word usage corresponds (indirectly) to P(fires|NN) warehouse fires modifier+head cooccurrence rate corresponds to P(fires|warehouse,NN) warehouse fires plausibility ratings as NV vs. as NN “how plausible is it to have a fire in a warehouse” “how plausible is it to have a warehouse fire someone?”

How to estimate parse probs (omit) We can use MacDonald’s head vs. modifying freq plus cooccurrence freq, plus bigram and unigram frequencies, to determine P(NN) in each case MacDonald’s estimates corpus estimate

How to estimate parse probs (3) In the era of gigantic corpora (e.g., the Web), another approach: the counting method To estimate P(NN|the warehouse fires), simply collect a sample of the warehouse fires and count how many of them are NN usages Many pitfalls! often can’t hold external sentence context constant vulnerable to undisclosed workings of search engines hand-filtering the results is imperative assumes human prob. estimates will match corpus freqs BUT it gives access to huge data!

How to estimate parse probs (3) Crude method: we’ll use a corpus search (Google) to estimate P(NN|warehouse,fires) 21 instances (excluding psycholinguistics hits!) of “warehouse fires” found; all were NN two of these were potentially NV contexts At least some evidence that P(NN|warehouse,fires) is above 0.5 Supports attention-shift analysis I heard an interview on NPR of a Vieux Carre (French Quarter) native who explained how the warehouse fires started... Not all the warehouse fires were so devastating,...

Attention shift in MV/RR ambiguity? McRae et al also has an attention-shift interpretation (pursued by Narayanan & Jurafsky 2002) shift to RR for good patients shift to RR for good agents the {crook/cop}

Reranking/Attention shift summary Reranking attributes difficulty to changes in the ranking over interpretations caused by a given word Attention shift is a special form in which changes in the highest-ranked candidate matter

Overview for the day Reranking & Attention shift Tiny introduction to information theory Surprisal-based sentence processing

Tiny intro to information theory Shannon information content, or surprisal, of an event : Example: a bent coin with P(heads)=0.4 A loaded die with P(1)=0.4 also has h(1)=1.32 (sometimes called the entropy of event x)

Tiny intro to information theory (2) The entropy of a discrete probability distribution is the expected value of its Shannon information content Example: the entropy of a fair coin is Our bent P(heads)=0.4 coin has entropy less than 1:

Tiny intro to information theory (3) Our loaded die with P(1)=0.4 doesn’t have its entropy completely determined yet. Two examples: A fair die has entropy of 2.58

Overview for the day Reranking & Attention shift Crash course in information theory Surprisal-based sentence processing

Hale 2001, Levy 2005: surprisal Let the difficulty of a word be its surprisal given its context: Captures the expectation intuition: the more we expect an event, the easier it is to process Many probabilistic formalisms, including PCFGs (Jelinek & Lafferty 1991, Stolcke 1995), can give us word surprisals

Intuitions for surprisal & PCFGs Consider the following PCFG Calculate surprisal at destroyed in these sentences: P(S → NP VP) = 1.0 P(NP → DT N) = 0.4 P(NP → DT N N) = 0.3 P(NP → DT Adj N) = 0.3 P(N → warehouse) = 0.03 P(N → fires) = 0.02 P(DT → the) = 0.3 P(VP → V) = 0.3 P(VP → V NP) = 0.4 P(VP → V PP) = 0.1 P(V → fires) = 0.05 P(V → destroyed) = 0.04 the warehouse fires destroyed the neighborhood. the fires destroyed the neighborhood.

Connection with reranking models Levy 2005 shows that surprisal is a special form of reranking model In particular, if reranking cost is taken as the KL divergence* between old & new parse distributions… …then reranking cost turns out equivalent to surprisal of the new word w i Thus representation neutrality is an interesting consequence of the surprisal theory *a measure of the penalty incurred by encoding one probability distribution with another

Levy 2006: syntactically constrained contexts In many cases, you know that you have to encounter a particular category C But you don’t know when you’ll encounter it, or which member of C will actually appear Call these syntactically constrained contexts In these contexts, the more information related to C you obtain, the sharper your expectations about C generally turn out to be Interesting contrast to some non-probabilistic theories that say holding onto the related information is hard

Constrained contexts: final verbs Konieczny 2000 looked at reading times at German final verbs Er hat die Gruppe auf den Berg geführt He has the group to the mountain led Er hat die Gruppe geführt He has the group led Er hat die Gruppe auf den SEHR SCHÖNEN Berg geführt He has the group to the VERY BEAUTIFUL mtn. led “He led the group” “He led the group to the mountain” “He led the group to the very beautiful mountain”

Locality predictions and empirical results Locality-based models (Gibson 1998) predict difficulty for longer clauses But Konieczny found that final verbs were read faster in longer clauses Prediction easy hard Result fast fastest slow Er hat die Gruppe auf den Berg geführt Er hat die Gruppe geführt He led the group He led the group to the mountain...die Gruppe auf den sehr schönen Berg geführt He led the group to the very beautiful mountain

Er hat die Gruppe (auf den (sehr schönen) Berg) geführt Surprisal’s predictions

Once we’ve seen a PP goal we’re unlikely to see another So the expectation of seeing anything else goes up For p i (w), used a PCFG derived empirically from a syntactically annotated corpus of German (the NEGRA treebank) Seeing more = having more information More information = more accurate expectations Deriving Konieczny’s results auf den Berg PP geführt V NP? PP-goal? PP-loc? Verb? ADVP? die Gruppe VP NP S Vfin Er hat

Facilitative ambiguity and surprisal Review of when ambiguity facilitates processing: The daughter i of the colonel j who shot himself *i/j The daughter i of the colonel j who shot herself i/*j (Traxler et al. 1998; Van Gompel et al. 2001, 2005) The son i of the colonel j who shot himself i/j harder easier

Sometimes the reader attaches the RC low... and everything’s OK But sometimes the reader attaches the RC high… and the continuation is anomalous So we’re seeing garden-pathing ‘some’ of the time himself  himself Traditional account: probabilistic serial disambiguation NPPP NP P of NP the daughter the colonel RC who shot…

Surprisal as a parallel alternative assume a generative model where choice between herself and himself determined only by antecedent’s gender NP PP NP P of NP the daughter the colonel RC who shot… NPPP NP P of NP the daughter the colonel RC who shot… NP self herself Surprisal marginalizes over possible syntactic structures

Ambiguity reduces the surprisal But son…who shot… can daughter…who shot… can’t contribute probability mass to himself

Ambiguity/surprisal conclusion Cases where ambiguity reduces difficulty aren’t problematic for parallel constraint satisfaction Although they are problematic for competition Attributing difficulty to surprisal rather than competition is a satisfactory revision of constraint-based theories

Surprisal and garden paths: theory Revisiting the horse raced past the barn fell After the horse raced past the barn, assume 2 parses: Jurafsky 1996 estimated the probability ratio of these parses as 82:1 The surprisal differential of fell in reduced versus unreduced conditions should thus be log 2 83 = 6.4 bits *(assuming independence between RC reduction and main verb)

Surprisal and garden paths: practice An unlexicalized PCFG (from Brown corpus) gets right monotonicity of surprisals at disambiguating word “fell” But there are some unwanted results too this is right but diff. is small these are way too high!

Surprisal and garden paths raced has high surprisal because the grammar is unlexicalized – no connection with horse Unfortunately, lexicalization in practice wouldn’t help: race as a verb never co-occurs with horse in Penn Treebank! surprisal differential at fell is small for the same reason failure to account for lexical preferences of raced means that probability of RR alternative is likely overestimated Is surprisal a plausible source of explanation for most dramatic garden-path effects? Still seems unclear.

Surprisal summary Motivation: expectations affect processing When people encounter something unexpected, they are surprised Translates into slower reading (=processing difficulty?) This intuition can be captured and formalized using tools from probability theory, information theory, and statistical NLP

Tomorrow Other information-theoretic approaches to on-line sentence processing Brief look at connectionist approaches to sentence processing General discussion & course wrap-up