Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Tricks for Statistical Semantic Knowledge Discovery: A Selectionally Restricted Sample Marti A. Hearst UC Berkeley.
1 Final Projects  Please make an appointment to come talk to me (or office hours)  What additional things should you add to your project?  Are you on.
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Unambiguous + Unlimited = Unsupervised Marti Hearst School of Information, UC Berkeley Invited Talk, University of Toronto January 31, 2006 This research.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley Joint work.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1/17 Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation Hiram Calvo and Alexander Gelbukh Presented.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
9/8/20151 Natural Language Processing Lecture Notes 1.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CSA2050 Introduction to Computational Linguistics Parsing I.
1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
NLP. Parsing Manually label a set of instances. Split the labeled data into training and testing sets. Use the training data to find patterns. Apply.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
The Unreasonable Effectiveness of Data
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
CSC 594 Topics in AI – Natural Language Processing
Sample Selection for Statistical Parsing
Probabilistic and Lexicalized Parsing
Supported by NSF DBI and a gift from Genentech
LING/C SC 581: Advanced Computational Linguistics
Statistical NLP: Lecture 9
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
A method for WSD on Unrestricted Text
Supported by NSF DBI and a gift from Genentech
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI and a gift from Genentech

Motivation Huge datasets trump sophisticated algorithms. “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, ACL 2001 (Banko & Brill, 2001) Task: spelling correction Raw text as “training data” Log-linear improvement even to billion words  Getting more data is better than fine-tuning algorithms. How to generalize to other problems?

Web as a Baseline “Web as a baseline” (Lapata & Keller 04;05): applied simple n-gram models to: machine translation candidate selection article generation noun compound interpretation noun compound bracketing adjective ordering spelling correction countability detection prepositional phrase attachment All unsupervised Findings: Sometimes rival best supervised approaches. => Web n-grams should be used as a baseline. Significantly better than the best supervised algorithm. Not significantly different from the best supervised algorithm.

Our Contribution Potential of these ideas is not yet fully realized We introduce new features paraphrases surface features Applied to structural ambiguity problems Data sparseness: need statistics for every possible word and for word combinations Problems (unsupervised): Noun compound bracketing PP attachment NP coordination state-of-the-art results (Nakov&Hearst, 2005) this work

Task 1: Prepositional Phrase Attachment

PP attachment (a) Peter spent millions of dollars.(noun) (b) Peter spent time with his family.(verb) quadruple: (v, n1, p, n2) (a)(spent, millions, of, dollars) (b)(spent, time, with, family) Human performance: quadruple: 88% whole sentence: 93% PP combines with the NP to form another NP PP is an indirect object of the verb

Related Work Supervised (Brill & Resnik, 94): transformation-based learning, WordNet classes, P=82% (Ratnaparkhi & al., 94): ME, word classes (MI), P=81.6% (Collins & Brooks, 95): back-off, P=84.5% (Stetina & Makoto, 97): decision trees, WordNet, P=88.1% (Toutanova & al., 04): morphology, syntax, WordNet, P=87.5% (Olteanu & Moldovan, 05): in context, parser, FrameNet, Web, SVM, P=92.85% Unsupervised (Hindle & Rooth, 93): partially parsed corpus, lexical associations over subsets of (v,n1,p), P=80%,R=80% (Ratnaparkhi, 98): POS tagged corpus, unambiguous cases for (v,n1,p), (n1,p,n2), classifier: P=81.9% (Pantel & Lin,00): collocation database, dependency parser, large corpus (125M words), P=84.3% Unsup. state-of-the-art Ratnaparkhi dataset

Related Work: Web Unsup. (Volk, 00): Altavista, NEAR operator, German, compare Pr(p|n1) vs. Pr(p|v), P=75%, R=58% (Volk, 01): Altavista, NEAR operator, German, inflected queries, Pr(p,n2|n1) vs. Pr(p,n2|v), P=75%, R=85% (Calvo & Gelbukh, 03): exact phrases, Spanish, P=91.97%, R=89.5% (Lapata & Keller,05): Web n-grams, English, Ratnaparkhi dataset, P in low 70’s (Olteanu & Moldovan, 05): supervised, English, in context, parser, FrameNet, Web counts, SVM, P=92.85%

PP-attachment: Our Approach Unsupervised (v,n1,p,n2) quadruples, Ratnaparkhi test set Google and MSN Search Exact phrase queries Inflections: WordNet 2.0 Adding determiners where appropriate Models: n-gram association models Web-derived surface features paraphrases

Probabilities: Estimation Using page hits as a proxy for n-gram counts Pr(w 1 |w 2 ) = #(w 1, w 2 ) / #(w 2 ) #(w 2 ) word frequency; query for “w 2 ” #(w 1, w 2 ) bigram frequency; query for “w 1 w 2 ” Pr(w 1,w 2 |w 3 ) = #(w 1, w 2, w 3 ) / #(w 3 )

N-gram models (i) Pr(p|n1) vs. Pr(p|v) (ii) Pr(p,n2|n1) vs. Pr(p,n2|v) I eat/v spaghetti/n1 with/p a fork/n2. I eat/v spaghetti/n1 with/p sauce/n2. Pr or # (frequency) smoothing as in (Hindle & Rooth, 93) back-off from (ii) to (i) N-grams unreliable, if n1 or n2 is a pronoun. MSN Search: no rounding of n-gram estimates

Web-derived Surface Features Example features open the door / with a key  verb (100.00%, 0.13%) open the door (with a key)  verb (73.58%, 2.44%) open the door – with a key  verb (68.18%, 2.03%) open the door, with a key  verb (58.44%, 7.09%) eat Spaghetti with sauce  noun (100.00%, 0.14%) eat ? spaghetti with sauce  noun (83.33%, 0.55%) eat, spaghetti with sauce  noun (65.77%, 5.11%) eat : spaghetti with sauce  noun (64.71%, 1.57%) Summing achieves high precision, low recall. PRPR sum compare

Paraphrases v n1 p n2 v n2 n1(noun) v p n2 n1(verb) p n2 * v n1(verb) n1 p n2 v(noun) v PRONOUN p n2(verb) BE n1 p n2(noun)

Paraphrases: pattern (1) (1) v n1 p n2  v n2 n1(noun) Can we turn “n1 p n2” into a noun compound “n2 n1”? meet/v demands/n1 from/p customers/n2  meet/v the customer/n2 demands/n1 Problem: ditransitive verbs like give gave/v an apple/n1 to/p him/n2  gave/v him/n2 an apple/n1 Solution: no determiner before n1 determiner before n2 is required the preposition cannot be to

Paraphrases: pattern (2) (2) v n1 p n2  v p n2 n1(verb) If “p n2” is an indirect object of v, then it could be switched with the direct object n1. had/v a program/n1 in/p place/n2  had/v in/p place/n2 a program/n1 Determiner before n1 is required to prevent “n2 n1” from forming a noun compound.

Paraphrases: pattern (3) (3) v n1 p n2  p n2 * v n1(verb) “*” indicates a wildcard position (up to three intervening words are allowed) Looks for appositions, where the PP has moved in front of the verb, e.g. I gave/v an apple/n1 to/p him/n2  to/p him/n2 I gave/v an apple/n1

Paraphrases: pattern (4) (4) v n1 p n2  n1 p n2 v(noun) Looks for appositions, where “n1 p n2” has moved in front of v shaken/v confidence/n1 in/p markets/n2  confidence/n1 in/p markets/n2 shaken/v

Paraphrases: pattern (5) (5) v n1 p n2  v PRONOUN p n2(verb) n1 is a pronoun  verb (Hindle&Rooth, 93) Pattern (5) substitutes n1 with a dative pronoun (him or her), e.g. put/v a client/n1 at/p odds/n2  put/v him at/p odds/n2 pronoun

Paraphrases: pattern (6) (6) v n1 p n2  BE n1 p n2(noun) BE is typically used with a noun attachment Pattern (6) substitutes v with a form of to be (is or are), e.g. eat/v spaghetti/n1 with/p sauce/n2  is spaghetti/n1 with/p sauce/n2 to be

Evaluation Ratnaparkhi dataset 3097 test examples, e.g. prepare dinner for family V shipped crabs from province V n1 or n2 is a bare determiner: 149 examples problem for unsupervised methods left chairmanship of the N is the of kind N acquire securities for an N special symbols: %, /, & etc.: 230 examples problem for Web queries buy % for 10 V beat S&P-down from % V is 43%-owned by firm N

Results Simpler but not significantly different from 84.3% (Pantel&Lin,00). For prepositions other then OF. (of  noun attachment) Smoothing is not needed on the Web Models in bold are combined in a majority vote. Checking directly for...

Task 2: Coordination

Coordination & Problems (Modified) real sentence: The Department of Chronic Diseases and Health Promotion leads and strengthens global efforts to prevent and control chronic diseases or disabilities and to promote health and quality of life. Problems: boundaries: words, constituents, clauses etc. interactions with PPs: [health and [quality of life]] vs. [[health and quality] of life] or meaning and: chronic diseases or disabilities ellipsis

NC coordination: ellipsis Ellipsis car and truck production means car production and truck production No ellipsis president and chief executive All-way coordination Securities and Exchange Commission

NC Coordination: ellipsis Quadruple (n1,c,n2,h) Penn Treebank annotations ellipsis: (NP car/NN and/CC truck/NN production/NN). no ellipsis: (NP (NP president/NN) and/CC (NP chief/NN executive/NN)) all-way: can be annotated either way This is a problem a parser must deal with. Collins’ parser always predicts ellipsis, but other parsers (e.g. Charniak’s) try to solve it.

Related Work (Resnik, 99): similarity of form and meaning, conceptual association, decision tree, P=80%, R=100% (Rus & al., 02): deterministic, rule-based bracketing in context, P=87.42%, R=71.05% (Chantree & al., 05): distributional similarities from BNC, Sketch Engine (freqs., object/modifier etc.), P=80.3%, R=53.8% (Goldberg, 99): different problem (n1,p,n2,c,n3), adapts Ratnaparkhi (99) algorithm, P=72%, R=100%

N-gram models (n1,c,n2,h) (i) #(n1,h) vs. #(n2,h) (ii) #(n1,h) vs. #(n1,c,n2)

Surface Features sum compare

Paraphrases n1 c n2 h n2 c n1 h(ellipsis) n2 h c n1(NO ellipsis) n1 h c n2 h(ellipsis) n2 h c n1 h(ellipsis)

Paraphrases: Pattern (1) (1) n1 c n2 h  n2 c n1 h(ellipsis) Switch places of n1 and n2 bar/n1 and/c pie/n2 graph/h  pie/n2 and/c bar/n1 graph/h

Paraphrases: Pattern (2) (2) n1 c n2 h  n2 h c n1(NO ellipsis) Switch places of n1 and n2 h president/n1 and/c chief/n2 executive/h  chief/n2 executive/h and/c president/n1

Paraphrases: Pattern (3) (3) n1 c n2 h  n1 h c n2 h(ellipsis) Insert the elided head h bar/n1 and/c pie/n2 graph/h  bar/n1 graph/h and/c pie/n2 graph/h h

Paraphrases: Pattern (4) (4) n1 c n2 h  n2 h c n1 h(ellipsis) Insert the elided head h, but also switch n1 and n2 bar/n1 and/c pie/n2 graph/h  pie/n2 graph/h and/c bar/n1 graph/h h

(Rus & al.,02) Heuristics Heuristic 1: no ellipsis n1=n2 milk/n1 and/c milk/n2 products/h Heuristic 4: no ellipsis n1 and n2 are modified by an adjective Heuristic 5: ellipsis only n1 is modified by an adjective Heuristic 6: no ellipsis only n2 is modified by an adjective We use a determiner.

Number Agreement Introduced by Resnik (93) (a) n1&n2 agree, but n1&h do not  ellipsis; (b) n1&n2 don’t agree, but n1&h do  no ellipsis; (c) otherwise leave undecided.

Results 428 examples from Penn TB Models in bold are combined in a majority vote. Comparable to other researchers (but no standard dataset). Bad, compares bigram to trigram.

Conclusions & Future Work Tapping the potential of very large corpora for unsupervised algorithms Go beyond n-grams Surface features Paraphrases Results competitive with best unsupervised Results can rival supervised algorithms’ Future Work other NLP tasks better evidence combination There should be even more exciting features on the Web!

The End Thank you!