Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.

Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium

Introduction Finding the correct attachment site for PP’s is one of the problems when parsing natural languages Volk (2000;2001) has presented an approach for German by using cooccurrence frequencies on the WWW

Introduction (2) We present a replication of the approach used by Volk, but applied on Dutch We present a number of changes that have been made on the initial formula and their effect on the results

cooccurrence values On the one hand, the cooccurrence strength between nouns and prepositions is measured On the other hand, the cooccurrence strength between verbs and prepositions is measured The competing values of N+P vs. V+P are used to decide whether to attach the PP to the noun or to the verb

Experiment 1 Method –Altavista search engine –noun NEAR preposition vs. verb NEAR preposition –restricted to Dutch documents –lemmata are used for lookup –minimal cooccurrence threshold

Experiment 1 Evaluation –500 PP’s were selected which were immediately following a noun or a pronoun which functions as a noun. –It was manually decided if the PP was attached either to the verb or to the noun.

Experiment 1 Algorithm –if cooc(N+P) and cooc(V+P) are available, the higher value decides –if one is not available (2% of test cases), the other value is compared to a threshold –if both are unavailable, no decision can be made

Experiment 1 Results –100% coverage: 58.4% correct attachment –max. accuracy 59%, coverage 98% Conclusion –better than pure guessing (50%) –much lower than Volk for German –defaulting to Noun-attachment: 68%

Experiment 2 Method –Full forms, not lemmata Results –we want to compare at a rate of 75% correct attachments –if we set threshold so we have 75% correct attachment: coverage =21.6% Conclusion :Results are much better than with lemmata, but still low

Experiment 3 Method –Full forms –Minimal distance threshold Results –75% correct attachment: coverage=27% Conclusion: Still a lot lower than Volk (58%), but improving

Experiment 4 Method –We include the head noun of the PP into the queries –cooc(X,P,N2)=freq(X,P,N2)/freq(X) –without thresholds –defaulting to N-attachment if cooc’s don’t exist Results –General accuracy = 68% with coverage=100% Conclusions: Results are as accurate as defaulting to N-attachment

Experiment 5 Method –minimal cooc-threshold when triple cooc not available for one –when both unavailable: no decision Results –setting the threshold to reach an accuracy of 75% is impossible

Experiment 6 Method –full forms + lemmata Results: –maximum accuracy is 68.77% Conclusions: –Volk gets nice results in the just described conditions: coverage of 63% with an accuracy of 75% –We get only 27% coverage with same accuracy

Experiment 7 Method –combining doubles and triples into one algorithm –minimal distance and 2 different thresholds –when min-distance < threshold for triples then use minimal distance of doubles Results: –coverage of 48.8% with an accuracy of 75% –coverage of 50% with an accuracy of 74.4%

Experiment 8 Method –accuracy with preprocessed triples test cases where N1 is not a real noun are removed from testset (492 cases remaining) unlexicalized compounds are reduced to the heads of the compounds krijtstreepjeskostuum => kostuum Results –coverage of 60.4% with an accuracy of 75% –coverage of 50% with an accuracy of 76.8%

Experiment 8 Results: –combining the two minimal distances algorithms (for doubles and triples) gives a big rise in coverage for the same accuracy –preprocessing of nouns and leaving out pronouns gives a second big rise in coverage for the same accuracy –after defaulting the remaining cases to N- attachment we end up with an accuracy of 70.33%

General Conclusions using the WWW helps to get a more accurate estimate of PP-attachment difference between our results and German results: Number of decidable cases is higher for German since the number of WWW documents is higher for German Querying cooccurrence freqs with WWW search engines using the NEAR operator allows only very rough queries

Future improvements Using cooccurrence freqs on a controlled corpus might improve results: –more exact queries are possible than with AltaVista –less noise in the corpus

References Volk, M. (2000). Scaling up using the WWW to resolve PP-attachment ambiguities. In Proceedings of Konvens, Ilmenau. Volk, M. (2001). Exploiting the WWW qs q corpus to resolve PP-attachment ambiguities. In Proceedings of Corpus Linguistics, Lancaster.

Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.

Similar presentations

Presentation on theme: "Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.

Similar presentations

Presentation on theme: "Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium."— Presentation transcript:

Similar presentations

About project

Feedback