Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham and Yorick Wilks Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K.
Pastra et al., LREC 2002 The paradox NER results: close to human performance Reuse of NER resources: minimal We will focus on: Traditional rule-based NER systems NER in text Reuse of grammars for NER Manual adaptation of grammars
Pastra et al., LREC ) Grammar Formalism 2) Application Domain 3) Natural Language What is it that hinders grammar reuse? The use of Flexible System Architectures guarantees reusability of resources >>> But is this a “sine qua non” solution ? Does the lack of such architectures render reusability simply “not feasible” ?
Pastra et al., LREC 2002 Grammar Formalism (1) >> Current Practice: No standardised formalism >> Traditional pattern-matching languages: inappropriate for NER >> Norm: Use of AV notations (allow for reference to token attributes from multiple analysis levels). Translating formalisms: a time-effective solution? Time gained-information lost: is there a trade-off?
Pastra et al., LREC 2002 Grammar Formalism (2) The need: NER for SOCIS (not main task – limited time) The problem:Existing grammar in another formalism >> NEA – JAPE Similarities: Declarative, context-sensitive, non-det PM… >> NEA – JAPE Differences: BU rule invocation – FST cascades Appelt control mechanism - Appelt, First, Brill Rules augmented with PROLOG – JAVA Wildcards, “don’t care sequ”: not common Iterations, (!=) : different mechanisms
Pastra et al., LREC 2002 Grammar Formalism (3) The experiment: From the NEA notation to JAPE NEA notation: A => B\C/D JAPE: (B)(C) :label (D) :label.EntityType = {attr} one’s LHS another’s RHS same things handled in different ways differences in modules run before NER affect rules STILL: Original set in 2 months – SOCIS set in 1 week
Pastra et al., LREC 2002 Application Domain (1) Is there a core set of grammar rules that are always domain independent ? General purpose NER grammars: Developed to serve grammar reuse, but originated themselves from specific applications They separate specific from general information. MUSE: automatic resource switches ~ text features HaSIE: company reports on health and safety issues
Pastra et al., LREC 2002 Application Domain (2) The experiment: The gazetteers were enriched with police and crime related information All original domain-specific rules were deleted Original results with no modifications to the grammar : close to 90% Only 1 change to the core set and addition of rules From newswire text on Biotechnology to … Crime Scene Police Reports
Pastra et al., LREC 2002 Natural Language (1) Parameters to consider: The relation of A and B (close related or not) determines the extent of reuse Nature of NEs (formation, syntagmatic relations) unpredictable behaviour and structure finite set NER Grammar in language (A) + linguistic knowledge of NE in (B) = NER grammar for (B) ?
Pastra et al., LREC 2002 Natural Language (2) Romanian NE (compared to English): Rich inflection Flexible word order Different word order (e.g modifier follows noun) The experiment: Run NER grammar for English on Romanian text
Pastra et al., LREC 2002 Natural Language (3) 1 st experiment: Romanian Gaz + English grammar >> Overall Results: P = 0.82, R = 0.67 Low recall even for entity types rec with high P (e.g. Org 0.75P – 0.39R) 2 nd experiment: Romanian Gaz + Adapted grammar >> Overall Results: P = 0.95, R = 0.94 Corpus: 1MB of Romanian newspaper texts Manual marking of NEs – Romanian NER (3 weeks)
Pastra et al., LREC 2002 Natural Language (3) Entity TypePrecisionRecall Address0.81 Date Location Money Organisation Percent10.82 Person Identifier Overall Entity TypePrecisionRecall Address Date Location Money Organisation Percent10.99 Person Identifier Overall
Pastra et al., LREC 2002 Reuse of existing NER grammars is time effective and should be attempted even when the formalisms, applications and languages involved are different Conclusions Further issues to be addressed: Reuse of NER grammars for spoken NEs Reuse in statistical/ML NER approaches Automating grammar reuse