Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and.
By: Michael Vorobyov. Moments In general, moments are quantitative values that describe a distribution by raising the components to different powers.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell.
Faculty Of Applied Science Simon Fraser University Cmpt 825 presentation Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary Jiri.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
1/17 Probabilistic Parsing … and some other approaches.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
1/17 Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation Hiram Calvo and Alexander Gelbukh Presented.
Towards the automatic identification of adjectival scales: clustering adjectives according to meaning Authors: Vasileios Hatzivassiloglou and Kathleen.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.
Beyond Nouns Exploiting Preposition and Comparative adjectives for learning visual classifiers.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
CSA2050 Introduction to Computational Linguistics Parsing I.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
Phrasal verbs (I)  In modern English it is very usual to place prepositions or adverbs after certain verbs so as to obtain a variety of meanings: look.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
MultiModality Registration Using Hilbert-Schmidt Estimators By: Srinivas Peddi Computer Integrated Surgery II April 6 th, 2001.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
CSE 330: Numerical Methods. What is true error? True error is the difference between the true value (also called the exact value) and the approximate.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Natural Language Processing Vasile Rus
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Probabilistic CKY Parser
Probabilistic and Lexicalized Parsing
THE RESULTS.
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
CS246: Information Retrieval
CSA2050 Introduction to Computational Linguistics
Presentation transcript:

Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium

Introduction Finding the correct attachment site for PP’s is one of the problems when parsing natural languages Volk (2000;2001) has presented an approach for German by using cooccurrence frequencies on the WWW

Introduction (2) We present a replication of the approach used by Volk, but applied on Dutch We present a number of changes that have been made on the initial formula and their effect on the results

cooccurrence values On the one hand, the cooccurrence strength between nouns and prepositions is measured On the other hand, the cooccurrence strength between verbs and prepositions is measured The competing values of N+P vs. V+P are used to decide whether to attach the PP to the noun or to the verb

Experiment 1 Method –Altavista search engine –noun NEAR preposition vs. verb NEAR preposition –restricted to Dutch documents –lemmata are used for lookup –minimal cooccurrence threshold

Experiment 1 Evaluation –500 PP’s were selected which were immediately following a noun or a pronoun which functions as a noun. –It was manually decided if the PP was attached either to the verb or to the noun.

Experiment 1 Algorithm –if cooc(N+P) and cooc(V+P) are available, the higher value decides –if one is not available (2% of test cases), the other value is compared to a threshold –if both are unavailable, no decision can be made

Experiment 1 Results –100% coverage: 58.4% correct attachment –max. accuracy 59%, coverage 98% Conclusion –better than pure guessing (50%) –much lower than Volk for German –defaulting to Noun-attachment: 68%

Experiment 2 Method –Full forms, not lemmata Results –we want to compare at a rate of 75% correct attachments –if we set threshold so we have 75% correct attachment: coverage =21.6% Conclusion :Results are much better than with lemmata, but still low

Experiment 3 Method –Full forms –Minimal distance threshold Results –75% correct attachment: coverage=27% Conclusion: Still a lot lower than Volk (58%), but improving

Experiment 4 Method –We include the head noun of the PP into the queries –cooc(X,P,N2)=freq(X,P,N2)/freq(X) –without thresholds –defaulting to N-attachment if cooc’s don’t exist Results –General accuracy = 68% with coverage=100% Conclusions: Results are as accurate as defaulting to N-attachment

Experiment 5 Method –minimal cooc-threshold when triple cooc not available for one –when both unavailable: no decision Results –setting the threshold to reach an accuracy of 75% is impossible

Experiment 6 Method –full forms + lemmata Results: –maximum accuracy is 68.77% Conclusions: –Volk gets nice results in the just described conditions: coverage of 63% with an accuracy of 75% –We get only 27% coverage with same accuracy

Experiment 7 Method –combining doubles and triples into one algorithm –minimal distance and 2 different thresholds –when min-distance < threshold for triples then use minimal distance of doubles Results: –coverage of 48.8% with an accuracy of 75% –coverage of 50% with an accuracy of 74.4%

Experiment 8 Method –accuracy with preprocessed triples test cases where N1 is not a real noun are removed from testset (492 cases remaining) unlexicalized compounds are reduced to the heads of the compounds krijtstreepjeskostuum => kostuum Results –coverage of 60.4% with an accuracy of 75% –coverage of 50% with an accuracy of 76.8%

Experiment 8 Results: –combining the two minimal distances algorithms (for doubles and triples) gives a big rise in coverage for the same accuracy –preprocessing of nouns and leaving out pronouns gives a second big rise in coverage for the same accuracy –after defaulting the remaining cases to N- attachment we end up with an accuracy of 70.33%

General Conclusions using the WWW helps to get a more accurate estimate of PP-attachment difference between our results and German results: Number of decidable cases is higher for German since the number of WWW documents is higher for German Querying cooccurrence freqs with WWW search engines using the NEAR operator allows only very rough queries

Future improvements Using cooccurrence freqs on a controlled corpus might improve results: –more exact queries are possible than with AltaVista –less noise in the corpus

References Volk, M. (2000). Scaling up using the WWW to resolve PP-attachment ambiguities. In Proceedings of Konvens, Ilmenau. Volk, M. (2001). Exploiting the WWW qs q corpus to resolve PP-attachment ambiguities. In Proceedings of Corpus Linguistics, Lancaster.