ML-based approaches to Named Entity Recognition for German newspaper texts ESSLLI 02 – Workshop on ML Aproaches for CL Marc Rössler University of Duisburg.

Slides:



Advertisements
Similar presentations
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Advertisements

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Albert Gatt Corpora and Statistical Methods Lecture 9.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
A Language Independent Method for Question Classification COLING 2004.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
CRF &SVM in Medication Extraction
Social Knowledge Mining
Clustering Algorithms for Noun Phrase Coreference Resolution
Presentation transcript:

ML-based approaches to Named Entity Recognition for German newspaper texts ESSLLI 02 – Workshop on ML Aproaches for CL Marc Rössler University of Duisburg

Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work

Named Entity Recognition Named Entity Recognition deals with the detection and categorization of proper names The definition of the categories is application specific and could include e.g. names of products, drugs, rock bands, bibliographic references... NE Recognition is a key part of Information Extraction but is also important for Summarization, Information Filtering, Machine Translation, …

Named Entity Recognition Frequently used categories are ORGANIZATION, PERSON, LOCATION and sometimes temporal and numeric expressions These are the categories used at the Message Understanding Conferences (MUC), where Systems for NE Recognition were evaluated at CoNLL 2002, where a shared task on language independent Named Entity Recognition will be performed. Accuracy is measured with Recall, Precision and F- Measure

Problems in Named Entity Recognition The classification of NEs is often difficult or vague: New York Times, AOL – ORGANIZATION or PUBLICATION? The same word form can belong to different categories: Brazil (won the World Cup) – ORGANIZATION or LOCATION? Philip Morris – PERSON or ORGANIZATION? Kölner (Sparkasse) – LOCATION or ORGANIZATION? Bahn – ORGANIZATION or regular noun?

Evidence to recognize and categorize NEs According to McDonald (1996) there are two kinds of evidence to detect and categorize NEs: Internal Evidence is taken from within the NE, e.g. Ltd., Association, Consulting for ORGANIZATION Capitalization Peter, Smith, known person names External Evidence is provided by the context of the NEs, e.g. Mr., CEO, director for PERSON produces, shares for Organizations

On the impact of lists on NE Recognition Mikheev et al. (99) report experiments on the impact of lists containing NEs and show that their system achieves an accuracy of 0.85 for ORGANIZATION and 0.9 for PERSON even without any lists. Problems of lists The lists and gazetteers have to be very, very large and still won‘t cover all occurring NEs, especially for ORGANIZATION A pure list lookup cannot deal with the mentioned ambiguities

Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work

ML-Approaches to NE Recognition Internal and external evidence can be modeled both with rules and with ML-approaches Approaches using learning algorithms like HMM (Bikel et al. 1999, Zhou and Su, 2002) Maximum Entropy modeling (Borthwick et al., 1998, Mikheev et al. 1998) are among the best performing systems and achieve overall accuracy (F-measure) of more than 90% for English.

NE Recognition as a classifier task Common Approach To which of the following categories does the token belong? token(PERSON | LOCATION | ORGANIZATION | NIL ) Does it start or continue a sequence of NE tokens belonging together? IOB notation: B begin, I continue, O not belonging to a category thatO (no NE) PeterPER_B (person name starts) Smith PER_I (person name continues) criedO(no NE)

Outline Named Entity (NE) Recognition ML-based approaches to NE Recogniton NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work

NE Recognition in German Not just NEs but also regular nouns are capitalized Additionally, adjectives derived from geographical names are only capitalized when they end on „-er“ ( Schweizer vs. deutsche ) This means that capitalization is not that valuable as it is for other languages

Approaches to NE Recognition in German Systems for NE Recognition in German are rare Volk and Clematide (2001) rule based system with large lists GERTWOL, a commercial morphological analyzer, providing about proper names accuracy of 0.78 for ORGANIZATION, 0.88 for PERSON and 0.85 for LOCATION Neumann and Piskorski (2000) NE Recognition as part of an information extraction system based on finite state machines and several knowledge sources reported accuracy of 0.89, but it is not clear, whether this describes the detection of NEs in contrast to regular nouns or the classification of NEs

Outline Named Entity (NE) Recognition ML-based approaches to NE Recogniton NE Recognition in German Basic idea of the approach Experiments with Markov Models Experiments with Maximum Entropy modeling Conclusions and further work

Basic idea of the approach Word forms are too specific and the model will never know every word form Therefore use part of speech tags By only using part of speech tags the evidence provided by some word forms is lost Therefore use a mixture of word forms and part of speech tags normalizing word forms without loosing too much evidence

Providing evidence for the statistical model Part of speech tagging is done with TnT (Brants 1998) with the pre-defined model for German Tagging accuracy is strongly dependent on whether a word was seen before or not (86,6 % vs. 97.7%) The chosen research corpus, the Computer Zeitung, contained an average of 18% unknown words TnT tries to distinct regular nouns from NEs, but this is error- prone for German The Tagset (STTS) is slightly extended POS-Tagging

Providing evidence for the statistical model These words can be gained automatically from an annotated corpus by extracting words that occur often near or within NEs To reduce the resulting list frequency threshold combined with a TF*IDF threshold to reduce noise and stop-words Words How to identify words that are introduced with their word form and not with their POS-Tags?

Example ARTADJAxNNGmbHKONJNExentwickelnADJA … DieWiesbadenerConceptGmbHundBMWentwickelnneue … ARTADJANN KONJNEVVFINADJA … Mixture of word forms and part of speech tags

Additional heuristic: learn-apply-forget All recognized NEs within one text unit (article) are stored The text unit is read again and the “learned” names are applied (including a check for genitive endings on “-s” ) After that the system “forgets” the learned NEs Volk and Clematide (2001): learn-apply-forget

The research corpus As a research corpus the Computer Zeitung, a weekly Computer Magazine was selected In about four issues all ORGANIZATION, PERSON and LOCATION were manually annotated by one person without any other annotation to check the results The corpus contains about words ORGANIZATION 2769 NE, 4249 tokens PERSON 824 NE, 1404 tokens LOCATION 1450 NE, 1506 tokens

Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusions and further work

Experiments with ML - approaches Experiments with two different techniques for statistical modeling were conducted: TnT (Brants 1998) Trigram Tagger that can be trained on any tagset Second order Markov Models OpenNLP.Maxent package (Baldridge et al.) Open Source, available at Maximum Entropy Modeling For all experiments a tenfold cross-validation was performed

Experiments with the Markov Models Basic question Is an approach based on a mixture of POS-tags and word forms feasible for NE Recognition in German? Three experiments were conducted Only the POS-tags as the taggers Input The proposed mixture of POS-tags and word forms, but avoiding words that are part of a particular NE (avoid “ IBM ”, but “ Ltd. ”) The proposed mixture of POS-tags and word forms without restrictions

Second order Markov Models TextTokenTags DieARTO WiesbadenerADJAxLOC_I ConceptNNORG_I GmbH ORG_B undKONJO BMWNExNExORG_I entwickeln neueADJAO ……… Example of the taggers input to generate emission probabilities of tokens and uni-/bi- and tri-gram probabilities of the tags

Results with Markov Models ORGPERLOCAvg. F- measure POS-Tags P: 36 R: 35P: 42 R: 69P: 42 R: 532 Mixture, no NE P: 54 R: 56P: 55 R: 89P: 69 R: 3255 Mixture, with NE P: 85 R: 62P: 71 R: 78P: 86 R: 7772 The figures are very low in general, but the experiments show that the approach provides evidence for the detection and categorization of some NEs However, the approach has to be enhanced

Enhancement of the approach The possibilities to enhance the second order Markov Model approach are limited, since the context is restricted to a trigram window Maximum Entropy offers more options to model, since the number of features is at least theoretically not limited First step: achieving comparable results with the Maximum Entropy modeling by using a similar input as the Markov Model was trained on

Maximum Entropy: N-gram experiments To receive an F-measure comparable to the Markov Model a trigram window of POS-tags and the previous outcome was necessary: prev_POS current_POS next_POS prev_OUTCOME Precision was usually higher and recall lower Best results were achieved with a 4-gram window of POS-tags and one previous outcome: prprev_POS prev_POS current_POS next_POS prev_OUTCOME

Enhancements of the approach All verbs except modal and auxiliary verbs are introduced with their word form after a very rudimentary stemming The Maxent model is built with General Iterative Scaling; 100 iterations and a cutoff of one To identify the words that are introduced with their word form, the following thresholds were settled: FrequencyTF*IDF surrounding NEs:> 3> 0.24 ORGANIZATION:> 6-- PERSON:> 4-- LOCATION:> 2--

Results of the first enhancements Overall accuracy rose, but recall is still not satisfying ORGPERLOCAvg. F- measure Best Markov P: 85, R: 62P: 71, R: 78P: 86, R: 7772 MaxEnt P: 78, R: 64P: 87, R: 80P: 90, R: gram window, previous outcome about 900 words and all verbs with a unique symbol 100 iterations, cutoff 1 learn-apply-forget filter

Further enhancement: the usage of lists Three lists were collected from the internet A list of person names, containing about 1000 German and 2500 English first names A list of location names, containing about 1300 entries A list of companies, containing about 3800 tokens No manually checking of the entries was done.

Bringing the lists into the model If a token is part of a list, this is introduced as a feature instead of it‘s POS-tag. prprev_POS prev_name cur_POS next_POS prev_OUTCOME means: is part of the name list This was due to the fact, that there were entries like „ Mark “ or „ Juli “, that occured very often in the text, but not as NE. Therefore the feature of „being part of the name list“ was weighted very low Such entries were identified by a frequency threshold and introduced with a unique symbol First, when experimenting with the name list, nearly no effect was seen:

Results of the usage of the lists The lists show a limited effect Figures are still lower than those of the rule-based system Recall has to be optimized for all categories The most difficult category is ORGANIZATION ORGPERLOCAvg. F- measure MaxEnt no lists P: 78, R: 64P: 87, R: 80P: 90, R: 687 Maxent lists P: 79, R: 66P: 93, R: 86P: 91, R: 7280 Volk et al. P: 76, R: 81P: 92, R: 86P: 81, R: 9184

Outline Named Entity (NE) Recognition ML-based approaches to NE Recognition NE Recognition in German Basic idea of the approach Experiments Markov Models Maximum Entropy modeling Conclusion and further work

The proposed approach seems feasible for German The approach still offers possibilities for optimization using more specific lists, especially for ORGANIZATION; extracting such lists by matching “sure-fire rules” against a larger, not annotated corpus of the domain including morphological knowledge leaving the word level and work on chunks (phrases)

THE END Marc Rössler University of Duisburg