Probabilistic Methods in Computational Psycholinguistics Roger Levy University of Edinburgh & University of California – San Diego.

Slides:



Advertisements
Similar presentations
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Advertisements

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
PARSING WITH CONTEXT-FREE GRAMMARS
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Parsing with Context Free Grammars Reading: Chap 13, Jurafsky & Martin
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
Introduction and Jurafsky Model Resource: A Probabilistic Model of Lexical and Syntactic Access and Disambiguation, Jurafsky 1996.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Visual Recognition Tutorial
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
Computational Psycholinguistics Lecture 1: Introduction, basic probability theory, incremental parsing Florian Jaeger & Roger Levy LSA 2011 Summer Institute.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 4.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Day 2: Pruning continued; begin competition models
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
1/17 Probabilistic Parsing … and some other approaches.
Syntactic Parsing with CFGs CMSC 723: Computational Linguistics I ― Session #7 Jimmy Lin The iSchool University of Maryland Wednesday, October 14, 2009.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Information Theory and Security
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Basic Parsing with Context- Free Grammars 1 Some slides adapted from Julia Hirschberg and Dan Jurafsky.
1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
9/8/20151 Natural Language Processing Lecture Notes 1.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
LINGUISTICA GENERALE E COMPUTAZIONALE ANALISI SINTATTICA (PARSING)
October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Probabilistic CKY Roger Levy [thanks to Jason Eisner]
For Wednesday Read chapter 23 Homework: –Chapter 22, exercises 1,4, 7, and 14.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.
CS 4705 Lecture 10 The Earley Algorithm. Review Top-Down vs. Bottom-Up Parsers –Both generate too many useless trees –Combine the two to avoid over-generation:
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
GRAMMARS David Kauchak CS457 – Spring 2011 some slides adapted from Ray Mooney.
NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.
/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Speech and Language Processing SLP Chapter 13 Parsing.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP Winter 2009
Basic Parsing with Context Free Grammars Chapter 13
CSCI 5832 Natural Language Processing
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Presentation transcript:

Probabilistic Methods in Computational Psycholinguistics Roger Levy University of Edinburgh & University of California – San Diego

Course overview Computational linguistics and psycholinguistics have a long-standing affinity Course focus: comprehension (“sentence processing”) CL: what formalisms and algorithms are required to obtain structural representations of a sentence (string)? PsychoLx: how is knowledge of language mentally represented during deployed during comprehension? Probabilistic methods have “taken CL by storm” This course covers application of probabilistic methods from CL to problems in psycholinguistics

Course overview (2) Unlike most courses at ESSLLI, our data of primary interest are derived from psycholinguistic experimentation However, because we are using probabilistic methods, naturally occurring corpus data also very important Linking hypothesis: people deploy probabilistic information derived from experience with (and thus reflecting) naturally occurring data

Course overview (3) Probabilistic-methods practitioners in psycholinguistics agree that humans use probabilistic information to disambiguate linguistic input But disagree on the linking hypotheses between how probabilistic information is deployed and observable measures of online language comprehension Pruning models Competition models Reranking/attention-shift models Information-theoretic models Connectionist models

Course overview (4) Outline of topics & core articles for course: 1.Pruning approaches: Jurafsky Competition models: McRae et al Reranking/attention-shift models: Narayanan & Jurafsky Information-theoretic models: Hale Connectionist models: Christiansen & Chater 1999 Lots of other related articles and readings, plus some course-related software, on course website Look at core article before each day of lecture

Lecture format I will be covering the major points of each core article Emphasis will be on the major conceptual building components of each approach I’ll be lecturing from slides, but interrupting me (politely ) with questions and discussion is encouraged At various points we’ll have blackboard “brainstorming” sessions as well. *footnotes down here are for valuable points I don’t have time to emphasize in lecture, but feel free to ask about them

Today Crash course in probability theory Crash course in natural language syntax and parsing Crash course in psycholinguistic methods Pruning models: Jurafsky 1996

Probability theory: what? why? Probability theory is the calculus of reasoning under uncertainty This makes it well-suited to modeling the process of language comprehension Language comprehension involves uncertainty about: What has already been said What has not yet been said The girl saw the boy with the telescope. The children went outside to... (who has the telescope?) (play? chat? …)

Event space Ω A function P from subsets of Ω to real numbers such that: Non-negativity: Properness: Disjoint union: An improper function P for which is called deficient Crash course in probability theory

Probability: an example Rolling a die has event space Ω ={1,2,3,4,5,6} If it is a fair die, we require of the function P: Disjoint union means that this requirement completely specifies the probability distribution P For example, the event that a roll of the die comes out even is E={2,4,6}. For a fair die, its probability is Using disjoint union to calculate event probabilities is known as the counting method

Joint and conditional probability P(X,Y) is called a joint probability e.g., probability of a pair of dice coming out Two events are independent if the probability of the joint event is the product of the individual event probabilities: P(Y|X) is called a conditional probability By definition, This gives rise to Bayes’ Theorem:

Estimating probabilistic models With a fair die, we can calculate event probabilities using the counting method But usually, we can’t deduce the probabilities of the subevents involved Instead, we have to estimate them (=statistics!) Usually, this involves assuming a probabilistic model with some free parameters,* and choosing the values of the free parameters to match empirically obtained data *(these are parametric estimation methods)

Maximum likelihood Simpler example: a coin flip fair? unfair? Take a dataset of 20 coin flips, 12 heads and 8 tails Estimate the probability p that the next result is heads Method of maximum likelihood: choose parameter values (i.e., p) that maximize the likelihood* of the data Here, maximum-likelihood estimate (MLE) is the relative-frequency estimate (RFE) *likelihood: the data’s probability, viewed as a function of your free parameters

Issues in model estimation Maximum-likelihood estimation has several problems: Can’t incorporate a belief that coin is “likely” to be fair MLEs can be biased Try to estimate the number of words in a language from a finite sample MLEs will always underestimate the number of words There are other estimation techniques (Bayesian, maximum-entropy,…) that have different advantages When we have “lots” of data,* the choice of estimation technique rarely makes much difference *unfortunately, we rarely have “lots” of data

Generative vs. Discriminative Models Inference makes use of conditional probability distr’s Discriminatively-learned models estimate this conditional distribution directly Generatively-learned models estimate the joint probability of data and observation P(O,H) Bayes’ theorem is used to find c.p.d. and do inference probability of … hidden structure … given observations

Generative vs. Discriminative Models in Psycholinguistics Different researchers have also placed the locus of action at generative (joint) versus discriminative (conditional) models Are we interested in P(Tree|String) or P(Tree,String)? This reflects a difference in ambiguity type Uncertainty only about what has been said Uncertainty also about what may yet be said

Today Crash course in probability theory Crash course in natural language syntax and parsing Crash course in psycholinguistic methods Pruning models: Jurafsky 1996

Crash course in grammars and parsing A grammar is a structured set of production rules Most commonly used for syntactic description, but also useful for (sematics, phonology, …) E.g., context-free grammars: A grammar is said to license a derivation S → NP VP NP → Det N VP → V NP Det → the N → dog N → cat V → chased  OK

Bottom-up parsing VP → V NP PP → P NP S → NP VP … Fundamental operation: check whether a sequence of categories matches a rule’s right- hand side Permits structure building inconsistent with global context

Top-down parsing Fundamental operation: Permits structure building inconsistent with perceived input, or corresponding to as-yet- unseen input S → NP VP NP → Det N Det → The …

Ambiguity There is usually more than one structural analysis for a (partial) sentence Corresponds to choices (non-determinism) in parsing VP can expand to V NP PP or VP can expand to V NP and then NP can expand to NP PP The girl saw the boy with…

Serial vs. Parallel processing A serial processing model is one where, when faced with a choice, chooses one alternative and discards the rest A parallel model is one where at least two alternatives are chosen and maintained A full parallel model is one where all alternatives are maintained A limited parallel model is one where some but not necessarily all alternatives are maintained A joke about the man with an umbrella that I heard… *ambiguity goes as the Catalan numbers (Church and Patel 1982)

Dynamic programming There is an exponential number of parse trees for a given sentence (Church & Patil 1982) So sentence comprehension can’t entail an exhaustive enumeration of possible structural representations But parsing can be made tractable by dynamic programming

Dynamic programming (2) Dynamic programming = storage of partial results There are two ways to make an NP out of… …but the resulting NP can be stored just once in the parsing process Result: parsing time polynomial (cubic for CFGs) in sentence length Still problematic for modeling human sentence proc.

Hybrid bottom-up and top-down Many methods used in practice are combinations of top-down and bottom-up regimens Left-corner parsing: bottom-up parsing with top- down filtering Earley parsing: strictly incremental top-down parsing with dynamic programming* *solves problems of left-recursion that occur in top-down parsing

Probabilistic grammars A (generative) probabilistic grammar is one that associates probabilities with rule productions. e.g., a probabilistic context-free grammar (PCFG) has rule productions with probabilities like Interpret P(NP → Det N) as P(Det N | NP) Among other things, PCFGs can be used to achieve disambiguation among parse structures

a man arrived yesterday 0.3 S  S CC S 0.15 VP  VBD ADVP 0.7 S  NP VP 0.4 ADVP  RB 0.35 NP  DT NN Total probability: 0.7*0.35*0.15*0.3*0.03*0.02*0.4*0.07= 1.85  10 -7

Probabilistic grammars (2) A derivation having zero probability corresponds to it being unlicensed in a non-probabilistic setting But “canonical” or “frequent” structures can be distinguished from “marginal” or “rare” structures via the derivation rule probabilities From a computational perspective, this allows probabilistic grammars to increase coverage (number + type of rules) while maintaining ambiguity management

The probabilistic serial ↔ parallel gradient Suppose two incremental interpretations I 1,2 have probabilities p 1 >0.5>p 2 after seeing the last word w i A full-serial model might keep I 1 at activation level 1 and discard I 2 (i.e., activation level 0) A full-parallel model would keep both I 1 and I 2 at probabilities p 1 and p 2 respectively An intermediate model would keep I 1 at a 1 >p 1 and I 2 at a 2 <p 2 (A “hyper-parallel” model might keep I 1 at 0.5 a 2 >p 2 )

Today Crash course in probability theory Crash course in natural language syntax and parsing Crash course in psycholinguistic methods Pruning models: Jurafsky 1996

Psycholinguistic methodology The workhorses of psycholinguistic experimentation involve behavioral measures What choices do people make in various types of language-producing and language-comprehending situations? and how long do they take to make these choices? Offline measures rating sentences, completing sentences, … Online measures tracking people’s eye movements, having people read words aloud, reading under (implicit) time pressure…

Psycholinguistic methodology (2) [self-paced reading experiment demo now]

Psycholinguistic methodology (3) Caveat: neurolinguistic experimentation more and more widely used to study language comprehension methods vary in temporal and spatial resolution people are more passive in these experiments: sit back and listen to/read a sentence, word by word strictly speaking not behavioral measures the question of “what is difficult” becomes a little less straightforward

Today Crash course in probability theory Crash course in natural language syntax and parsing Crash course in psycholinguistic methods Pruning models: Jurafsky 1996

Pruning approaches Jurafsky 1996: a probabilistic approach to lexical access and syntactic disambiguation Main argument: sentence comprehension is probabilistic, construction-based, and parallel Probabilistic parsing model explains human disambiguation preferences garden-path sentences The probabilistic parsing model has two components: constituent probabilities – a probabilistic CFG model valence probabilities

Jurafsky 1996 Every word is immediately completely integrated into the parse of the sentence (i.e., full incrementality) Alternative parses are ranked in a probabilistic model Parsing is limited-parallel: when an alternative parse has unacceptably low probability, it is pruned “Unacceptably low” is determined by beam search (described a few slides later)

Jurafsky 1996: valency model Whereas the constituency model makes use of only phrasal, not lexical information, the valency model tracks lexical subcategorization, e.g.: P( | discuss ) = 0.24 P( | discuss ) = 0.76 (in today’s NLP, these are called monolexical probabilities) In some cases, Jurafsky bins across categories:* P( | keep) = 0.81 P( | keep ) = 0.19 where XP[+pred] can vary across AdjP, VP, PP, Particle… *valence probs are RFEs from Connine et al. (1984) and Penn Treebank

Jurafsky 1996: syntactic model The syntactic component of Jurafsky’s model is just probabilistic context-free grammars (PCFGs) Total probability: 0.7*0.35*0.15*0.3*0.03*0.02*0.4*0.07= 1.85  10 -7

Modeling offline preferences Ford et al found effect of lexical selection in PP attachment preferences (offline, forced- choice): The women discussed the dogs on the beach NP-attachment (the dogs that were on the beach) -- 90% VP-attachment (discussed while on the beach) – 10% The women kept the dogs on the beach NP-attachment – 5% VP-attachment – 95% Broadly confirmed in online attachment study by Taraban and McClelland 1988

Modeling offline preferences (2) Jurafsky ranks parses as the product of constituent and valence probabilities:

Modeling offline preferences (3)

Result Ranking with respect to parse probability matches offline preferences Note that only monotonicity, not degree of preference is matched