Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,

Similar presentations


Presentation on theme: "A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,"— Presentation transcript:

1 A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Horacio Saggion Department of Computer Science, University of Sheffield 1(12)

2 Overview A “knowledge-poor” approach to pronominal anaphora resolution – inexpensive and fast yet useful for practical tasks Goal: Resolution of pronoun anaphora when the antecedent is a named entity – person, organisation, location, etc. Our approach relies on part-of-speech information and named entity recognition No syntax parsing, focus identification or deep semantic knowledge is used 2(12)

3 Corpus Analysis (1) Corpus data: bnews: ASR-transcribed broadcast news - approx. 60 000 words npaper: OCR-transcribed newspaper articles - approx. 61 000 words nwire: newswire - approx. 66 000 words Pronouns included in the analysis: personal – I, me, you, he, she, it, we, they, etc. possessive adjectives – my, your, her, his, its, etc. possessive pronouns – mine, yours, hers, his, its, etc. reflexive pronouns – myself, yourself, herself, himself, itself, etc. 3(12)

4 Corpus Analysis (2) Total pronouns: Avg. 4.2% (highest in broadcast news 5.6%; avg. 3.5% otherwise) Average is three times higher than previously reported in (Barbu&Mitkov 2001) because they used technical manuals Pronouns by type: Similar to previously reported results The most frequent pronouns change depending on the corpus type - bnews is different from npaper and nwire – I and you are much more important Pleonastic It: Lower frequency compared to other studies due to different domains Avg. 3.2% of all pronouns are pleonastic it occurrences or 17.5% of all it pronouns 4(12)

5 Coreference Module Design Modular design so new parts can be added easily. Currently: Quoted text module – identifies the quoted text segments to be used in the resolution of I, me, etc. Pleonastic It module – identifying the pleonastic occurrences of it in the text. Pronoun Coreference Resolution module Freely available as part of GATE from http://gate.ac.uk 5(12)

6 6(11) Pleonastic It Identification Pattern-based using patterns from (Lappin & Leass, 1994) extended with some new ones derived from our corpus and synonyms/antonyms from WordNet However, still 41.3% of pleonastic occurrences in the corpus are not matched by any pattern Not all patterns are detected correctly by the module so on average only 38% of all pleonastic it occurrences are identified correctly Hence there is scope for further improvement here, which in turn, will improve the performance on the resolution of it, its, etc.

7 7(12) Resolution of he, she, etc. 1. Inspect the context of the anaphor for candidate antecedents. Each Person entity is considered as a candidate. 2. For each candidate, perform a gender compatibility check. 3. Evaluate each candidate against the best candidate so far: - If the two candidates are anaphoric for the pronoun then choose the one that appears closer. - The same holds for the case where the two candidates are cataphoric relative to the pronoun. - If one is anaphoric and the other is cataphoric then choose the former, even if the latter appears closer to the pronoun.

8 Resolution of it, its, etc. Resolution is harder because there are fewer constraints, e.g., no gender The number of nominal antecedents is higher (33%) so a nominal anaphora resolution module is needed to improve performance here In 52% of the cases the most recent named entity of type Organization and Location was the correct antecedent. In 15% of the cases the most recent named entity was not the right antecedent and in half of these cases this is due to appositions (which we will handle in the future) No need to consider cataphoric named entities as potential antecedents 8(12)

9 9(12) Resolution of I, me, etc. Contrary to the other pronouns the antecedents here are mainly cataphoric. Resolved only if they occur in a quoted speech segment In 52% of all occurrences the antecedent is the closest named entity in the text following the quoted segment In 29% of all cases the antecedent is a named entity in the previous sentence In 3% of the cases the antecedent is in the same sentence, but before the quote From the remaining 16% (not covered currently), in 13% of the cases the antecedent is a nominal

10 Evaluation The evaluation corpus was 5% of the entire corpus with 4.5% of the pronouns No pronouns were excluded, so unhandled ones (like we and you) degrade the recall, while nominal antecedents degrade the precision 66% precision and 46% recall – comparable to other knowledge-poor approaches Precision/recall per pronoun type: he, she, her, etc. - 79.3% precision / 77.2% recall it, its, etc. – 43.5% precision / 51.7% recall I, me, myself, etc. – 77.8% precision / 62.2% recall Precision/recall are degraded partly by errors in the named entity recogniser – we get approx. 10% improvement if using human-marked named entities 10(12)

11 Conclusion We demonstrated that a very lightweight approach is useful in practical tasks like entity detection and tracking Further improvements can be achieved by resolving the nominals and detecting apposition Since it is freely available, it can be used as a baseline against which other approaches can be compared Unfortunately the ACE corpus used here cannot be made available, as it is a closed evaluation. Nor can we disclose how our approach ranked compared to other participating systems. 11(12)


Download ppt "A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,"

Similar presentations


Ads by Google