Preposition Phrase Attachment To what previous verb or noun phrase does a prepositional phrase (PP) attach? The womanwith a poodle saw in the park with.

Slides:



Advertisements
Similar presentations
Unsupervised Learning
Advertisements

Clustering Beyond K-means
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Planning under Uncertainty
Overview Full Bayesian Learning MAP learning
x – independent variable (input)
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Learning Random Walk Models for Inducing Word Dependency Distributions Kristina Toutanova Christopher D. Manning Andrew Y. Ng.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Intro to NLP - J. Eisner1 Probabilistic CKY.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
1/13 Parsing III Probabilistic Parsing and Conclusions.
1/17 Probabilistic Parsing … and some other approaches.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Albert Gatt Corpora and Statistical Methods Lecture 5.
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Text Classification, Active/Interactive learning.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short.
Improving Subcategorization Acquisition using Word Sense Disambiguation Anna Korhonen and Judith Preiss University of Cambridge, Computer Laboratory 15.
Yaomin Jin Design of Experiments Morris Method.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lexical Acquisition Extending our information about words, particularly quantitative information.
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Data Mining: Text Mining
Natural Language Processing Statistical Inference: n-grams
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
11-1 Chapter 11 Lexical Acquisition Lecture Overview Methodological Issues: Evaluation Measures Verb Subcategorization –the syntactic means by which.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Natural Language Processing Vasile Rus
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
10701 / Machine Learning.
CS : Speech, NLP and the Web/Topics in AI
A Statistical Model for Parsing Czech
Probabilistic and Lexicalized Parsing
Hidden Markov Models Part 2: Algorithms
CSCI 5832 Natural Language Processing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
David Kauchak CS159 – Spring 2019
Statistical NLP: Lecture 10
Presentation transcript:

Preposition Phrase Attachment To what previous verb or noun phrase does a prepositional phrase (PP) attach? The womanwith a poodle saw in the park with a telescope on Tuesdayon his bicycle a manwith a poodle

A Simplified Version Assume ambiguity only between preceding base NP and preceding base VP: The woman had seen the man with the telescope. Q: Does the PP attach to the NP or the VP? Assumption: Consider only NP/VP head and the preposition

Simple Formulation Determine attachment based on log-likelihood ratio: LLR(v, n, p) = log P(p | v) - log P(p | n) If LLR > 0 then attach to verb, If LLR < 0 attach to noun

Issues Multiple attachment: –Attachment lines cannot cross Proximity: –Preference for attaching to closer structures, all else being equal Chrysler will end its troubled venture with Maserati. P(with | end) = P(with | venture) =0.107 !!!

Hindle & Rooth (1993) Consider just sentences with a transitive verb and PP, i.e., of the form:... bVP bNP PP... Q: Where does the first PP attach (NP or VP)? Indicator variables (0 or 1): VA p : Is there a PP headed by p after v attached to v? NA p : Is there a PP headed by p after n attached to n? NB: Both variables can be 1 in a sentence

Attachment Probabilities P(attach(p) = n | v, n) = P(NA p =1 | n) –Verb attachment is irrelevant; if it attaches to the noun it cannot attach to the verb P(attach(p) = v | v, n) = P(VA p =1, NA p =0 | v, n) = P(VA p =1 | v) P(NA p =0 | n) –Noun attachment is relevant, since the noun ‘shadows’ the verb (by proximity principle)

Estimating Parameters MLE: P(VA p = 1 | v) = C(v,p) / C(v) P(NA p = 1 | n) = C(n,p) / C(n) Using an unlabeled corpus: –Bootstrap from unambiguous cases: The road from Chicago to New York is long. She went from Albany towards Buffalo.

Unsupervised Training 1.Build initial model using only unambiguous attachments 2.Apply initial model and assign attachments if LLR above threshhold 3.Divide remaining ambiguous cases as 0.5 counts for each possibility Use of EM as principled method?

Limitations Semantic issues: I examined the man with a stethoscope. I examined the man with a broken leg. Other contextual features: Superlative adjectives (biggest) indicate NP More complex sentences: The board approved its acquisition by BigCo of Milwaukee for $32 a share at its meeting on Tuesday.

Memory-Based Formulation Each example has four components: VN1PN2 examinemanwithstethoscope Class = V Similarity based on information gain weighting for matching components Need ‘semantic’ similarity measure for words: stethoscope ~ thermometerkidney ~ leg

MVDM Word Similarity Idea: Words are similar to the extent that they predict similar class distributions Data sparseness is a serious problem, though! Extend idea to task independent similarity metric...

Lexical Space Represent ‘semantics’ of a word by frequencies of words which coöccur with it, instead of relative frequencies of classes Each word has 4 vectors of frequencies for words 2 before, 1 before, 1 after, and 2 after INfor(0.05) since(0.10) at(0.11) after(0.11) under(0.11) GROUPnetwork(0.08) farm(0.11) measure(0.11) package(0.11) chain(0.11) club(0.11) bill(0.11) JAPANchina(0.16) france(0.16) britain(0.19) canada(0.19) mexico(0.19) india(0.19) australia(0.19) korea(0.22)

Results Baseline comparisons: –Humans (4-tuple):88.2% –Humans (full sentence):93.2% –Noun always:59.0% –Most likely for prep:72.2% Without Info Gain:83.7% With Info Gain:84.1%

Using Many Features Use many features of an example together Consider interaction between features during learning Each example represented as a feature vector: x = (f 1,f 2,...,f n )

Geometric Interpretation kNN Linear Separator Learning

Linear Separators Linear separator model is a vector of weights: w = (w 1,w 2,...,w n ) Binary classification: Is w T x > 0 ? –‘Positive’ and ‘Negative’ classes A threshhold other than 0 is possible by adding dummy element of “1” to all vectors – the threshhold is just the weight for that element

Error-Based Learning 1.Initialize w to be all 1’s 2.Cycle x through examples repeatedly (random order): If w T x > 0 but x is really negative, then decrease w’s elements If w T x < 0 but x is really positive, then decrease w’s elements

Winnow 1.Initialize w to be all 1’s 2.Cycle v through examples repeatedly (random order): b) If w T x > 0 but x is really negative, then: a) If w T x < 0 but x is really positive, then

Issues No negative weights possible! –Balanced Winnow: Formulate weights as sum of 2 weight vectors: w = w + - w - Learn each vector separately, w + regularly, and w - with polarity reversed Multiple classes: –Learn one weight vector for each class (learning X vs. not-X) –Choose highest value result for example

PP Attachment Features Words in each position Subsets of the above, e.g: Word classes at various levels of generality: stethoscope  medical instrument  instrument  device  instrumentation  artifact  object  physical thing –Derived from WordNet – handmade lexicon 15 basic features plus word-class features

Results Results without preposition of: BaseWord Results including preposition of: WinnowMBLBackoffTransform