ML-based approaches to Named Entity Recognition for German newspaper texts ESSLLI 02 – Workshop on ML Aproaches for CL Marc Rössler University of Duisburg
Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work
Named Entity Recognition Named Entity Recognition deals with the detection and categorization of proper names The definition of the categories is application specific and could include e.g. names of products, drugs, rock bands, bibliographic references... NE Recognition is a key part of Information Extraction but is also important for Summarization, Information Filtering, Machine Translation, …
Named Entity Recognition Frequently used categories are ORGANIZATION, PERSON, LOCATION and sometimes temporal and numeric expressions These are the categories used at the Message Understanding Conferences (MUC), where Systems for NE Recognition were evaluated at CoNLL 2002, where a shared task on language independent Named Entity Recognition will be performed. Accuracy is measured with Recall, Precision and F- Measure
Problems in Named Entity Recognition The classification of NEs is often difficult or vague: New York Times, AOL – ORGANIZATION or PUBLICATION? The same word form can belong to different categories: Brazil (won the World Cup) – ORGANIZATION or LOCATION? Philip Morris – PERSON or ORGANIZATION? Kölner (Sparkasse) – LOCATION or ORGANIZATION? Bahn – ORGANIZATION or regular noun?
Evidence to recognize and categorize NEs According to McDonald (1996) there are two kinds of evidence to detect and categorize NEs: Internal Evidence is taken from within the NE, e.g. Ltd., Association, Consulting for ORGANIZATION Capitalization Peter, Smith, known person names External Evidence is provided by the context of the NEs, e.g. Mr., CEO, director for PERSON produces, shares for Organizations
On the impact of lists on NE Recognition Mikheev et al. (99) report experiments on the impact of lists containing NEs and show that their system achieves an accuracy of 0.85 for ORGANIZATION and 0.9 for PERSON even without any lists. Problems of lists The lists and gazetteers have to be very, very large and still won‘t cover all occurring NEs, especially for ORGANIZATION A pure list lookup cannot deal with the mentioned ambiguities
Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work
ML-Approaches to NE Recognition Internal and external evidence can be modeled both with rules and with ML-approaches Approaches using learning algorithms like HMM (Bikel et al. 1999, Zhou and Su, 2002) Maximum Entropy modeling (Borthwick et al., 1998, Mikheev et al. 1998) are among the best performing systems and achieve overall accuracy (F-measure) of more than 90% for English.
NE Recognition as a classifier task Common Approach To which of the following categories does the token belong? token(PERSON | LOCATION | ORGANIZATION | NIL ) Does it start or continue a sequence of NE tokens belonging together? IOB notation: B begin, I continue, O not belonging to a category thatO (no NE) PeterPER_B (person name starts) Smith PER_I (person name continues) criedO(no NE)
Outline Named Entity (NE) Recognition ML-based approaches to NE Recogniton NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work
NE Recognition in German Not just NEs but also regular nouns are capitalized Additionally, adjectives derived from geographical names are only capitalized when they end on „-er“ ( Schweizer vs. deutsche ) This means that capitalization is not that valuable as it is for other languages
Approaches to NE Recognition in German Systems for NE Recognition in German are rare Volk and Clematide (2001) rule based system with large lists GERTWOL, a commercial morphological analyzer, providing about proper names accuracy of 0.78 for ORGANIZATION, 0.88 for PERSON and 0.85 for LOCATION Neumann and Piskorski (2000) NE Recognition as part of an information extraction system based on finite state machines and several knowledge sources reported accuracy of 0.89, but it is not clear, whether this describes the detection of NEs in contrast to regular nouns or the classification of NEs
Outline Named Entity (NE) Recognition ML-based approaches to NE Recogniton NE Recognition in German Basic idea of the approach Experiments with Markov Models Experiments with Maximum Entropy modeling Conclusions and further work
Basic idea of the approach Word forms are too specific and the model will never know every word form Therefore use part of speech tags By only using part of speech tags the evidence provided by some word forms is lost Therefore use a mixture of word forms and part of speech tags normalizing word forms without loosing too much evidence
Providing evidence for the statistical model Part of speech tagging is done with TnT (Brants 1998) with the pre-defined model for German Tagging accuracy is strongly dependent on whether a word was seen before or not (86,6 % vs. 97.7%) The chosen research corpus, the Computer Zeitung, contained an average of 18% unknown words TnT tries to distinct regular nouns from NEs, but this is error- prone for German The Tagset (STTS) is slightly extended POS-Tagging
Providing evidence for the statistical model These words can be gained automatically from an annotated corpus by extracting words that occur often near or within NEs To reduce the resulting list frequency threshold combined with a TF*IDF threshold to reduce noise and stop-words Words How to identify words that are introduced with their word form and not with their POS-Tags?
Example ARTADJAxNNGmbHKONJNExentwickelnADJA … DieWiesbadenerConceptGmbHundBMWentwickelnneue … ARTADJANN KONJNEVVFINADJA … Mixture of word forms and part of speech tags
Additional heuristic: learn-apply-forget All recognized NEs within one text unit (article) are stored The text unit is read again and the “learned” names are applied (including a check for genitive endings on “-s” ) After that the system “forgets” the learned NEs Volk and Clematide (2001): learn-apply-forget
The research corpus As a research corpus the Computer Zeitung, a weekly Computer Magazine was selected In about four issues all ORGANIZATION, PERSON and LOCATION were manually annotated by one person without any other annotation to check the results The corpus contains about words ORGANIZATION 2769 NE, 4249 tokens PERSON 824 NE, 1404 tokens LOCATION 1450 NE, 1506 tokens
Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work
Experiments with ML - approaches Experiments with two different techniques for statistical modeling were conducted: TnT (Brants 1998) Trigram Tagger that can be trained on any tagset Second order Markov Models OpenNLP.Maxent package (Baldridge et al.) Open Source, available at Maximum Entropy Modeling For all experiments a tenfold cross-validation was performed
Experiments with the Markov Models Basic question Is an approach based on a mixture of POS-tags and word forms feasible for NE Recognition in German? Three experiments were conducted Only the POS-tags as the taggers Input The proposed mixture of POS-tags and word forms, but avoiding words that are part of a particular NE (avoid “ IBM ”, but “ Ltd. ”) The proposed mixture of POS-tags and word forms without restrictions
Second order Markov Models TextTokenTags DieARTO WiesbadenerADJAxLOC_I ConceptNNORG_I GmbH ORG_B undKONJO BMWNExNExORG_I entwickeln neueADJAO ……… Example of the taggers input to generate emission probabilities of tokens and uni-/bi- and tri-gram probabilities of the tags
Results with Markov Models ORGPERLOCAvg. F- measure POS-Tags P: 36 R: 35P: 42 R: 69P: 42 R: 532 Mixture, no NE P: 54 R: 56P: 55 R: 89P: 69 R: 3255 Mixture, with NE P: 85 R: 62P: 71 R: 78P: 86 R: 7772 The figures are very low in general, but the experiments show that the approach provides evidence for the detection and categorization of some NEs However, the approach has to be enhanced
Enhancement of the approach The possibilities to enhance the second order Markov Model approach are limited, since the context is restricted to a trigram window Maximum Entropy offers more options to model, since the number of features is at least theoretically not limited First step: achieving comparable results with the Maximum Entropy modeling by using a similar input as the Markov Model was trained on
Maximum Entropy: N-gram experiments To receive an F-measure comparable to the Markov Model a trigram window of POS-tags and the previous outcome was necessary: prev_POS current_POS next_POS prev_OUTCOME Precision was usually higher and recall lower Best results were achieved with a 4-gram window of POS-tags and one previous outcome: prprev_POS prev_POS current_POS next_POS prev_OUTCOME
Enhancements of the approach All verbs except modal and auxiliary verbs are introduced with their word form after a very rudimentary stemming The Maxent model is built with General Iterative Scaling; 100 iterations and a cutoff of one To identify the words that are introduced with their word form, the following thresholds were settled: FrequencyTF*IDF surrounding NEs:> 3> 0.24 ORGANIZATION:> 6-- PERSON:> 4-- LOCATION:> 2--
Results of the first enhancements Overall accuracy rose, but recall is still not satisfying ORGPERLOCAvg. F- measure Best Markov P: 85, R: 62P: 71, R: 78P: 86, R: 7772 MaxEnt P: 78, R: 64P: 87, R: 80P: 90, R: gram window, previous outcome about 900 words and all verbs with a unique symbol 100 iterations, cutoff 1 learn-apply-forget filter
Further enhancement: the usage of lists Three lists were collected from the internet A list of person names, containing about 1000 German and 2500 English first names A list of location names, containing about 1300 entries A list of companies, containing about 3800 tokens No manually checking of the entries was done.
Bringing the lists into the model If a token is part of a list, this is introduced as a feature instead of it‘s POS-tag. prprev_POS prev_name cur_POS next_POS prev_OUTCOME means: is part of the name list This was due to the fact, that there were entries like „ Mark “ or „ Juli “, that occured very often in the text, but not as NE. Therefore the feature of „being part of the name list“ was weighted very low Such entries were identified by a frequency threshold and introduced with a unique symbol First, when experimenting with the name list, nearly no effect was seen:
Results of the usage of the lists The lists show a limited effect Figures are still lower than those of the rule-based system Recall has to be optimized for all categories The most difficult category is ORGANIZATION ORGPERLOCAvg. F- measure MaxEnt no lists P: 78, R: 64P: 87, R: 80P: 90, R: 687 Maxent lists P: 79, R: 66P: 93, R: 86P: 91, R: 7280 Volk et al. P: 76, R: 81P: 92, R: 86P: 81, R: 9184
Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusion and further work
The proposed approach seems feasible for German The approach still offers possibilities for optimization using more specific lists, especially for ORGANIZATION; extracting such lists by matching “sure-fire rules” against a larger, not annotated corpus of the domain including morphological knowledge leaving the word level and work on chunks (phrases)
THE END Marc Rössler University of Duisburg