1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005 Research Group on Language Processing and Information Systems g PLSI
Research Group on Language Processing and Information Systems 2 Outline Named Entity Recognition task definition applications Machine learning approach Classifier combination Feature description and experimental evaluation for NE detection for NE classification NERUA at GeoCLEF Conclusions and future work
g PLSI Research Group on Language Processing and Information Systems 3 Named Entity Recognition – task definition Identification of proper names in text, using BIO scheme B starts an entity I continues the entity O words outside entity Classification into a predefined set of categories Person names Organizations (companies, governmental organizations, etc) Locations (cities, countries, etc) Miscellaneous (movie titles, sport events, etc) Adam_B-PER Smith_I-PER works_O for_O IBM_B-ORG,_O London_B-LOC._O
g PLSI Research Group on Language Processing and Information Systems 4 Information Extraction Question Answering Document classification Automatic indexing of books Increase accuracy of Internet search results (location Clinton/South Carolina vs. President Clinton) Named Entity Recognition – applications
g PLSI Research Group on Language Processing and Information Systems 5 Outline Named Entity Recognition task definition applications Machine learning approach Classifier combination Feature description and experimental evaluation for NE detection for NE classification NERUA at GeoCLEF Conclusions and future work
g PLSI Research Group on Language Processing and Information Systems 6 Machine learning approach Given: NER task tagged corpus Select classification methods Memory-based learning Maximum Entropy Hidden Markov Models Construct set of characteristics detection phase classification phase
g PLSI Research Group on Language Processing and Information Systems 7 Text Detection HMM TiMBL Classification HMM TiMBL MXE NER Text Voting NERUA:sistema de detección y clasificación de entidades utilizando aprendizaje automático, Ferrández et al.
g PLSI Research Group on Language Processing and Information Systems 8 Classification method 1 Memory-based learning (k-nearest neighbours) toolkit TiMBL package time performance quick training phase slow during testing features various types of features irrelevant features impede performance
g PLSI Research Group on Language Processing and Information Systems 9 Classification method 2 Maximum Entropy toolkit MaxEnt time performance slow training phase slow testing phase feature management string, missing values
g PLSI Research Group on Language Processing and Information Systems 10 Classification method 3 Hidden Markov Models toolkit ICOPOST time performance quick training phase quick testing phase feature management cannot handle as many features as the other two methods need corpus or label transformation
g PLSI Research Group on Language Processing and Information Systems 11 Outline Named Entity Recognition task definition applications Machine learning approach Classifier combination Feature description and experimental evaluation for NE detection for NE classification NERUA at GeoCLEF Conclusions and future work
g PLSI Research Group on Language Processing and Information Systems 12 Classifier combination Majority voting give each classifier one vote CL 1CL 2CL 3 PER ORGLOCORG PERLOC PERORGMISC Vote PER ORG LOC …
g PLSI Research Group on Language Processing and Information Systems 13 Outline Named Entity Recognition task definition applications Machine learning approach Classifier combination Feature description and experimental evaluation for NE detection for NE classification NERUA at GeoCLEF Conclusions and future work
g PLSI Research Group on Language Processing and Information Systems 14 Features for NE detection Contextual anchor word (e.g. the word to be classified); words in a [-3,…,+3] window ; Orthographic capitalization at position 0,[-3,..,+3]; whole anchor word in capitals (ex. IBM) position of anchor word in a sentence Substring extraction 2 and 3 letter extraction from left and right side of the anchor word Gazetteer list word at position 0,+1,+2,+3 seen in the list Trigger word list word at position 0,[-3,..,+3] seen in the list Using Language Resource Independent Detection for Spanish NER, Kozareva et al., RANLP’05
g PLSI Research Group on Language Processing and Information Systems 15 Results for NE detection SpanishBIBIO TMB-ALL TMB-CO TMB-COS HMM Voting 1,2, Data Size TrainTest Sp tokens Sp entities Pt tokens Pt entities PortugueseBIBIO TMB-CO TMB-COS HMM Voting
g PLSI Research Group on Language Processing and Information Systems 16 Index Named Entity Recognition task definition applications Machine learning approach Classifier combination Feature description and experimental evaluation for NE detection for NE classification NERUA at GeoCLEF Conclusions and future work
g PLSI Research Group on Language Processing and Information Systems 17 Features for NE classification Contextual whole entity first word of the entity second word of the entity if present words around the entity in [-3,…,+3] window Orthographic position of anchor word in a sentence capital, lowercase or other symbol Gazetteer list part of entity in the list whole entity in the list whole entity is not in any of these lists Trigger lists anchor word words in [-1,+1] window
g PLSI Research Group on Language Processing and Information Systems 18 Results for NE classification Classification LOCMISCORGPER MxE TMB MxE TMB HMM Voting 1,2, F-score for Spanish classification
g PLSI Research Group on Language Processing and Information Systems 19 Outline Named Entity Recognition – task definition, applications Machine learning approach Classifier combination Feature description and experimental evaluation for NE detection for NE classification NERUA at GeoCLEF Conclusions and future work
g PLSI Research Group on Language Processing and Information Systems 20 NERUA at GeoCLEF LanguageRunResult EnglishIRn+NERUA34.95 IRn+Dramneri29.77 Spanish-EnglishIRn+NERUA26.06 IRn+Dramneri23.65 English used directly the feature sets constructed for Spanish NERUA outperformed the rule-based system Dramneri although both consulted the same gazetteer and trigger word lists NERUA took more processing time University of Alicante at GeoCLEF 2005, Ferrández et al., CLEF’05
g PLSI Research Group on Language Processing and Information Systems 21 Conclusions and future work We found a language resource independent feature set for NE detection 92.96% of Spanish entities 78.86% of Portuguese entities Classifier combination has improved NE classification Good coverage over PER, LOC and ORG classes is maintained Machine learning systems may outperform rule-based systems, however they need more processing time and hand-labeled resources which are not available for all languages
g PLSI Research Group on Language Processing and Information Systems 22 Future work Find discriminative features for MISC class Resolve NER leaning upon unlabeled data Divide the four categories into more detailed ones Adapt the system for other languages Study ways of automatic gazetteer construction
23 Thank you for the attention! ¿Questions? Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005 Research Group on Language Processing and Information Systems g PLSI