Download presentation
Presentation is loading. Please wait.
Published byLee Lewis Modified over 8 years ago
1
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction 10-707 Project Reyyan Yeniterzi
2
Introduction Named Entity Recognition (NER) aims to locate and classify the named entities state-of-the-art NER systems are available for several languages a limited amount of study has been conducted for Turkish. we present the first CRF-based NER system for Turkish
3
Turkish Turkish is a morphologically complex language with very productive inflectional and derivational processes. Many local and non-local syntactic structures in English translate to Turkish words with complex morphological structures. weto makeflavor to be able acquireif are going +lantat+abil +dır +se+ecek+k if we are going to be able to make [something] acquire flavor tatlandırabileceksek
4
Related Work Cucerzan and Yarowsky, 1999 a language independent EM-style bootstrapping algorithm use word internal and contextual information of entities Tur et all, 2003 a statistical approach (HMM) data sparseness issues due to the agglutinative structure of the Turkish use the morphological form of the word instead of the surface form Kucuk and Yazici, 2009 the first rule-based NER system for Turkish information sources such as dictionaries, list of well known entities and context patters
5
Approach Conditional Random Fields (CRF) CRF++, an open source CRF sequence labeling toolkit Lexical model using only the word tokens in their surface form may encounter data sparseness problems Morphological forms of the words Contextual evidences around the named entities
6
Data Set - I the newspaper articles data set train set used in (Tür et all, 2003) test set not available split the data in two for the evaluation purposes 90% for training 10% for testing
7
Data Set - II Three types of named entities Organization Person Location # words# person# organization# location Train445,49821,70114,51012,138 Test47,3442,4001,5951,402
8
Data Set - III named entities are marked with ENAMEX tag a type of SGML tag TYPE attribute
9
Experiments Lexical Model PrecisionRecallF-Measure Person0.960.730.83 Organization0.950.730.83 Location0.960.810.88
10
Ongoing and Future Work building the morphological features the morphological analyses of the words is done currently working on disambiguating these will use the POS tags and lemmas of the words building the contextual features performing error analyses
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.