Statistical NLP Winter 2009

Statistical NLP Winter 2009
Lecture 13: word-sense disambiguation & semantic roles Roger Levy Thanks to Jason Eisner and Dan Klein for slides

A Concordance for “party” from www.webcorp.org.uk

A Concordance for “party” from www.webcorp.org.uk
thing. She was talking at a party thrown at Daphne's restaurant in have turned it into the hot dinner-party topic. The comedy is the selection for the World Cup party, which will be announced on May 1 in the 1983 general election for a party which, when it could not bear to to attack the Scottish National Party, who look set to seize Perth and that had been passed to a second party who made a financial decision the by-pass there will be a street party. "Then," he says, "we are going number-crunchers within the Labour party, there now seems little doubt political tradition and the same party. They are both relatively Anglophilic he told Tony Blair's modernised party they must not retreat into "warm "Oh no, I'm just here for the party," they said. "I think it's terrible A future obliges each party to the contract to fulfil it by be signed by or on behalf of each party to the contract." Mr David N

What Good are Word Senses?
thing. She was talking at a party thrown at Daphne's restaurant in have turned it into the hot dinner-party topic. The comedy is the selection for the World Cup party, which will be announced on May 1 in the 1983 general election for a party which, when it could not bear to to attack the Scottish National Party, who look set to seize Perth and that had been passed to a second party who made a financial decision the by-pass there will be a street party. "Then," he says, "we are going number-crunchers within the Labour party, there now seems little doubt political tradition and the same party. They are both relatively Anglophilic he told Tony Blair's modernised party they must not retreat into "warm "Oh no, I'm just here for the party," they said. "I think it's terrible A future obliges each party to the contract to fulfil it by be signed by or on behalf of each party to the contract." Mr David N

thing. She was talking at a party thrown at Daphne's restaurant in have turned it into the hot dinner-party topic. The comedy is the selection for the World Cup party, which will be announced on May 1 the by-pass there will be a street party. "Then," he says, "we are going "Oh no, I'm just here for the party," they said. "I think it's terrible in the 1983 general election for a party which, when it could not bear to to attack the Scottish National Party, who look set to seize Perth and number-crunchers within the Labour party, there now seems little doubt political tradition and the same party. They are both relatively Anglophilic he told Tony Blair's modernised party they must not retreat into "warm that had been passed to a second party who made a financial decision A future obliges each party to the contract to fulfil it by be signed by or on behalf of each party to the contract." Mr David N

Replace word w with sense s Splits w into senses: distinguishes this token of w from tokens with sense t Groups w with other words: groups this token of w with tokens of x that also have sense s

Word senses create groupings
number-crunchers within the Labour party, there now seems little doubt political tradition and the same party. They are both relatively Anglophilic he told Tony Blair's modernised party they must not retreat into "warm thing. She was talking at a party thrown at Daphne's restaurant in have turned it into the hot dinner-party topic. The comedy is the selection for the World Cup party, which will be announced on May 1 the by-pass there will be a street party. "Then," he says, "we are going "Oh no, I'm just here for the party," they said. "I think it's terrible an appearance at the annual awards bash , but feels in no fit state to -known families at a fundraising bash on Thursday night for Learning Who was paying for the bash? The only clue was the name Asprey, Mail, always hosted the annual bash for the Scottish Labour front- popular. Their method is to bash sense into criminals with a short, just cut off people's heads and bash their brains out over the floor,

Word senses create groupings
number-crunchers within the Labour party, there now seems little doubt political tradition and the same party. They are both relatively Anglophilic he told Tony Blair's modernised party they must not retreat into "warm thing. She was talking at a party thrown at Daphne's restaurant in have turned it into the hot dinner-party topic. The comedy is the selection for the World Cup party, which will be announced on May 1 the by-pass there will be a street party. "Then," he says, "we are going "Oh no, I'm just here for the party," they said. "I think it's terrible an appearance at the annual awards bash, but feels in no fit state to -known families at a fundraising bash on Thursday night for Learning Who was paying for the bash? The only clue was the name Asprey, Mail, always hosted the annual bash for the Scottish Labour front- popular. Their method is to bash sense into criminals with a short, just cut off people's heads and bash their brains out over the floor,

Semantics / Text understanding Many systems Axioms about TRANSFER apply to (some tokens of) throw Axioms about BUILDING apply to (some tokens of) bank Machine translation bass has Spanish translation of lubina or bajo depending on sense Info retrieval / Question answering / Text categ. Query or pattern might not match document exactly Backoff for just about anything what word comes next? (speech recognition, language ID, …) trigrams are sparse but tri-meanings might not be lexicalized probabilistic grammars (coming up later in the course) Speaker’s real intention is senses; words are a noisy channel

Cues to Word Sense Adjacent words (or their senses)
Grammatically related words (subject, object, …) Other nearby words Topic of document Sense of other tokens of the word in the same document

More details about word Senses
Words have multiple distinct meanings, or senses: Plant: living plant, manufacturing plant, … Title: name of a work, ownership document, form of address, material at the start of a film, … Many levels of sense distinctions Homonymy: totally unrelated meanings (river bank, money bank) Polysemy: related meanings (star in sky, star on tv) Systematic polysemy: productive meaning extensions (organizations to their buildings) or metaphor Sense distinctions can be extremely subtle (or not) Granularity of senses needed depends a lot on the task

Word Sense Disambiguation
Example: living plant vs. manufacturing plant How do we tell these senses apart? “context” Maybe it’s just text categorization Each word sense represents a topic Run the naive-bayes classifier from last class? Bag-of-words classification works ok for noun senses 90% on classic, shockingly easy examples (line, interest, star) 80% on senseval-1 nouns 70% on senseval-1 verbs The manufacturing plant which had previously sustained the town’s economy shut down after an extended labor strike.

More on WSD A key component of machine translation
English bank could be Russian банк or берег An “AI-hard” problem: Sam’s money and credit cards slipped out of his pocket as he fell off the boat. He struggled mightily and finally managed to get to the bank. Is bank a river bank or a money bank? Fortunately, simple features of local context are usually more discriminating than this!

Verb WSD Why are verbs harder? Verb Example: “Serve”
Verbal senses less topical More sensitive to structure, argument choice Verb Example: “Serve” [function] The tree stump serves as a table [enable] The scandal served to increase his popularity [dish] We serve meals for the homeless [enlist] He served his country [jail] He served six years for embezzlement [tennis] It was Agassi's turn to serve [legal] He was served by the sheriff

Various Approaches to WSD
Unsupervised learning Bootstrapping (Yarowsky 95) Clustering Indirect supervision From thesauri From WordNet From parallel corpora Supervised learning Most systems do some kind of supervised learning Many competing classification technologies perform about the same (it’s all about the knowledge sources you tap) Problem: training data available for only a few words

Resources WordNet SensEval SemCor OtherResources
Hand-build (but large) hierarchy of word senses Basically a hierarchical thesaurus SensEval A WSD competition, of which there have been 3 iterations Training / test sets for a wide range of words, difficulties, and parts-of-speech Bake-off where lots of labs tried lots of competing approaches SemCor A big chunk of the Brown corpus annotated with WordNet senses OtherResources The Open Mind Word Expert Parallel texts Flat thesauri

Knowledge Sources c w1 w2 wn . . .
So what do we need to model to handle “serve”? There are distant topical cues …. point … court ………………… serve ……… game … We could just use a Naïve Bayes classifier (same as texcat!) c w1 w2 wn . . .

Weighted Windows with NB
Distance conditioning Some words are important only when they are nearby …. as …. point … court ………………… serve ……… game … …. ………………………………………… serve as…………….. Distance weighting Nearby words should get a larger vote … court …… serve as……… game …… point boost relative position i

Better Features There are smarter features:
Argument selectional preference: serve NP[meals] vs. serve NP[papers] vs. serve NP[country] Subcategorization: [function] serve PP[as] [enable] serve VP[to] [tennis] serve <intransitive> [food] serve NP {PP[to]} Can capture poorly (but robustly) with local windows … but we can also use a parser and get these features explicitly Other constraints (Yarowsky 95) One-sense-per-discourse (only true for broad topical distinctions) One-sense-per-collocation (pretty reliable when it kicks in: manufacturing plant, flowering plant)

Complex Features with NB?
Example: So we have a decision to make based on a set of cues: context:jail, context:county, context:feeding, … local-context:jail, local-context:meals subcat:NP, direct-object-head:meals Not clear how build a generative derivation for these because the features cannot be independent: Choose topic, then decide on having a transitive usage, then pick “meals” to be the object’s head, then generate other words? How about the words that appear in multiple features? Hard to make this work (though maybe possible) No real reason to try Washington County jail served 11,166 meals last month - a figure that translates to feeding some 120 people three times daily for 31 days.

A Discriminative Approach
View WSD as a discrimination task (regression, really) Have to estimate multinomial (over senses) where there are a huge number of things to condition on History is too complex to think about this as a smoothing / back-off problem Many feature-based classification techniques out there We tend to need ones that output distributions over classes P(sense | context:jail, context:county, context:feeding, … local-context:jail, local-context:meals subcat:NP, direct-object-head:meals, ….)

Feature Representations
Features are indicator functions fi which count the occurrences of certain patterns in the input We map each input to a vector of feature predicate counts Washington County jail served 11,166 meals last month - a figure that translates to feeding some 120 people three times daily for 31 days. context:jail = 1 context:county = 1 context:feeding = 1 context:game = 0 … local-context:jail = 1 local-context:meals = 1 subcat:NP = 1 subcat:PP = 0 object-head:meals = 1 object-head:ball = 0

Linear Classifiers For a pair (c,d), we take a weighted vote for each class: There are many ways to set these weights Perceptron: find a currently misclassified example, and nudge weights in the direction of a correct classification Other discriminative methods usually work in the same way: try out various weights until you maximize some objective We’ll look at an elegant probabilistic framework for weight-setting: maximum entropy Feature Food Jail Tennis context:jail -0.5 * 1 +1.2 * 1 -0.8 * 1 subcat:NP +1.0 * 1 -0.3 * 1 object-head:meals +2.0 * 1 -1.5 * 1 object-head:years = 0 -1.8 * 0 +2.1 * 0 -1.1 * 0 TOTAL +3.5 +0.7 -2.6

Interpreting language
We’ve seen how to recover hierarchical structure from sentences via parsing Why on earth would we want that kind of structure? Insight: that structure is a major cue to sentence meaning We’ll look in the next few classes into how to recover aspects of those meanings

Interpreting language: today
Today we’ll look at two interrelated problems 1:Semantic Roles Different verbs/nouns assign different semantic properties to the phrases that stand in structural relations with them We want to figure out how those semantic properties are assigned, and recover them for input text 2: Discontinuous/nonlocal Dependencies In some cases, a given role is assigned to a semantic unit that is discontinuous in “surface” phrase structure Role assignment & interpretation can be improved by “reuniting” the fragments in these discontinuities

Semantic Role Labeling (SRL)
Characterize clauses as relations with roles: We want to know more than which NP is the subject: Relations like subject are syntactic; relations like agent or message are semantic Typical pipeline: Parse, then label roles Almost all errors locked in by parser Really, SRL is (or should be) quite a lot easier than parsing

SRL: applications You don’t have to look very far to see where SRL can be useful Example: Information Extraction: who bought what? Home Depot sells lumber Home Depot was sold by its parent company Home Depot sold its subsidiary company for $10 million Home Depot sold for $100 billion Home Depot’s lumber sold well last year.

SRL Example Gildea & Jurafsky 2002

Roles: PropBank / FrameNet
FrameNet: roles shared between verbs PropBank: each verb has it’s own roles PropBank more used, because it’s layered over the treebank (and so has greater coverage, plus parses) Note: some linguistic theories postulate even fewer roles than FrameNet (e.g total: agent, patient, instrument, etc.)

PropBank Example

FrameNet example

Shared Arguments

Path Features

Results Features: Gold vs parsed source trees
Path from target to filler Filler’s syntactic type, headword, case Target’s identity Sentence voice, etc. Lots of other second-order features Gold vs parsed source trees SRL is fairly easy on gold trees Harder on automatic parses

Motivation: Computing sentence meaning
How are the meanings of these sentences computed? A man arrived who I knew. Who was believed to know the answer? The story was about the children that the witch wanted to eat. Cries of pain could be heard from below. We discussed plans yesterday to redecorate the house. Add roadmap before getting anywhere! (say “I’m going to introduce the problem, then do xyz, ...”) ambiguity of “witch wanted to eat” sentence --and show how there’s no different CF tree structure for it, but how I would get two structures out of it This slide was up for too long!

Discontinuous string fragments, unified semantic meaning
These sentences have a common property. They all involve a unified semantic proposition broken up into two pieces across the sentence The intervening material is part of an unrelated proposition A man arrived who I knew. Who was believed to know the answer? The story was about the children that the witch wanted to eat. Cries of pain could be heard from below. We discussed plans yesterday to redecorate the house.

a man arrived yesterday
What is the state of the art in robust computation of linguistic meaning? Probabilistic context-free grammars trained on syntactically annotated corpora (Treebanks) yield robust, high-quality syntactic parse trees Nodes of these parse trees are often reliable indicators of phrases corresponding to semantic units (Gildea & Palmer 2002) Agent 0.7 0.15 0.35 0.4 0.3 0.03 0.02 0.07 Total: 0.7*0.35*0.15*0.3*0.03*0.02*0.4*0.07= 1.8510-7 Time use of tree probability was unclear use an ambiguous example for this slide to show that PCFGs do ambiguity resolution TARGET a man arrived yesterday

Dependency trees Alternatively, syntactic parse trees can directly induce dependency trees Can be interpreted pseudo-propositionally;high utility for Question Answering (Pasca and Harabagiu 2001)

Parse trees to dependency trees
A man who I knew arrived. I ate what you cooked.

Parse trees to dependency trees
A man who I knew arrived. A man arrived who I knew.

Limits of context-free parse trees
Agent? Agent? Agent

Recovering correct dependency trees
Context-free trees can’t transparently encode such non-local semantic dependency relations Treebanks typically do encode non-local dependencies in some form But nobody has used them, probably because: it’s not so clear how to do probability models for non-local dependencies Linguists think non-local dependencies are relatively rare, in English (unquantified, though!) Nobody’s really tried to do much deep semantic interpretation of parse trees yet, anyway

Today’s agenda How to quantify the frequency and difficulty of non-local dependencies? Are non-local dependencies really a big problem for English? For other languages? How can we reliably and robustly recover non-local dependencies so that: The assumptions and representation structure are as theory-neutral as possible? The recovery algorithm is independent of future semantic analysis we might want to apply? Mark transition points later in the talk according to this agenda.

Quantifying non-local dependencies
Kruijff (2002) compared English, German, Dutch, and Czech treebanks for frequency of nodes-with-holes. Result: English < {Dutch,German} < Czech This metric gives no similarity standard of comparison of two trees Also, this metric can be sensitive to the details of syntactic tree construction Typed dependency evaluation provides an alternative measure; initial results suggest good correlation Say explicitly that how we’re going to quantify non-local dependency by SCORING gold-standard dependencies against CF tree dependencies “Holes” thing was pretty damn unclear

Other work on nonlocal dependencies
Johnson 2002: pattern matching on CF tree parses Dienes & Dubey 2003: gap threading in a lexicalized PCFG Campbell 2004: recovery based on linguistic principles (non statistical) Schmid 2006: gap threading in an unlexicalized PCFG

Johnson 2002: labeled-bracketing style
* Previously proposed evaluation metric (Johnson 2002) requires correctly identified antecedent and correct string position of relocation This metric underdetermines dependency relations * totally unclear WHNP

Proposed new metric Given a sentence S and a context-free parse tree T for S, compare the dependency tree D induced from T with the correct dependency tree for S Dependency trees can be scored by the edges connecting different words S(arrived,man) NP(man,a) VP(arrived,yesterday)

Proposed new metric Dependencies induced from a CF tree will be counted as wrong when the true dependency is non-local S(arrived,man) NP(man,a) VP(arrived,knew) RC(knew,who) S(knew,I)   S(arrived,man) NP(man,a) NP(man,knew) RC(knew,who) S(knew,I)

Two types of non-local dependencies
Treebanks give us gold-standard dependency trees: Most dependencies can be read off the context-free tree structure of the annotation Non-local dependencies can be obtained from special treebank-specific annotations Two types of non-local dependency: Dislocated dependency A man arrived who I knew who I knew is related to man, NOT to arrived Shared dependency I promised to eat it I is related to BOTH promised AND eat

Nonlocal dependency annotation in treebanks
Penn treebanks annotate: null complementizers (which can mediate relativization) dislocated dependencies sharing dependencies

Nonlocal dependencies in treebanks
The NEGRA treebank (German) permits crossing dependencies during annotation, then algorithmically maps to a context-free representation. Null complementizers and shared dependencies are unannotated.

Nonlocal dependency, quantified
distinguish between “CF deps” this page and “CF deps” next page

Nonlocal dependency, quantified
(parsing done with state-of-the-art Charniak 2000 parser)

Cross-linguistic comparison
(parsing done with vanilla PCFG; sharing dependencies and relativizations excluded from non-local dependency)

Underlying dependency evaluation: conclusions
Non-local dependency errors increase in prominence as parser quality improves In German, non-local dependency a much more serious problem than for English. Context-free assumption less accurate on gold-standard trees for categories involving combinations of phrases (SBAR, S, VP) than for lower-level phrasal categories (NP, ADVP, ADJP)

Recovery of non-local dependencies
Context-free trees can directly recover only local (non-crossing) dependencies. To recover non-local dependencies, we have three options: Treat the parsing task as initially context-free, then correct the parse tree post-hoc (Johnson 2002; present work) Incorporate non-local dependency into the category structure of parse trees (Collins 1999; Dienes & Dubey 2003). Incorporate non-local dependency into the edge structure of parse trees (Plaehn 2000; elsewhere in my thesis)

Linguistically motivated tree reshaping
1. Null complementizers (mediate relativization) A. Identify sites for null node insertion B. Find best daughter position and insert. 2. Dislocated dependencies A. Identify dislocated nodes B. find original/“deep” mother node C. Find best daughter position in mother and insert 3. Shared dependencies A. Identify sites of nonlocal shared dependency B. Identify best daughter position and insert. C. Find controller for each control locus. (Inactive for NEGRA) (Active for NEGRA) null complementizer insertion is a surprise here insert some mini-trees on the right to explain what’s going on (Inactive for NEGRA)

Full example

A context-free parse

1a: null insertion sites

1b: location of null insertions

2a: identify dislocations

2b: identify origin sites

2c: insert dislocations in origins

3a: identify sites of non-local shared dependency

3b: insert non-local shared dependency sites

3c: find controllers of shared dependencies

Application of maximum entropy models
Discriminative classification model with well-founded probabilistic interpretation, close relationship to log-linear models and logistic regression Node-by-node discriminative classification of the form Quadratic regularization and thresholding by feature token count to prevent overfitting In node relocation and controller identification, candidate nodes are ranked by binary (yes/no) classification scores and the highest-ranked node is selected

Larger feature space Words in strings have only one primitive measure of proximity: linear order Tree nodes have several: precedence, sisterhood, domination Results in much richer feature space to explore w-2 w-1 w+1 w+2 A man arrived who I knew parent grandparent right-sister precedes

Feature types Syntactic categories, mothers, grandmothers (infinitive VP) Head words (wanted vs to vs eat) Syntactic path: <SBAR,S,VP,S,VP> Presence of daughters (NP under S) origin?

Evaluation on new metric: gold-standard input trees

Evaluation on new metric: parsed input trees

English-language evaluation
Deep dependency error reduction of 76% over baseline gold-standard context-free dependencies, and 42% over Johnson 2002 (95.5%/98.1%/91.9%) Performance particularly good on individual subtasks -- over 50% error reduction (95.2%) over Johnson 2002 on identification of non-local shared dependency sites Relative error reduction degraded across the board on parsed trees -- performance on parsed trees doesn’t match Johnson 2002 (could be due to overfitting on gold-standard dataset)

Cross-linguistic comparison: dislocated dependencies only
Compare performance of classifier on NEGRA and similar-sized subset of WSJ (~350K words) NEGRA has no annotations of null complementizers or sharing dependencies, so evaluate only on dislocations and exclude relativization WSJ context-free parsing is far more accurate than NEGRA parsing, so to even the field, use unlexicalized, untuned PCFGs for parsing

Algorithm performance: cross-linguistic comparison

Major error types -- German
Major ambiguity between local and non-local dependency for clause-final VP and S nodes hold this in reserve for questions? “The RMV will not begin to be formed until much later.”

Major error types -- German
Scrambling -- clause-initial NPs don’t always belong in S “The researcher Otto Schwabe documented the history of the excavation.”

Statistical NLP Winter 2009

Similar presentations

Presentation on theme: "Statistical NLP Winter 2009"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical NLP Winter 2009

Similar presentations

Presentation on theme: "Statistical NLP Winter 2009"— Presentation transcript:

Similar presentations

About project

Feedback