Presentation is loading. Please wait.

Presentation is loading. Please wait.

A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction 10-707 Project Reyyan Yeniterzi.

Similar presentations


Presentation on theme: "A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction 10-707 Project Reyyan Yeniterzi."— Presentation transcript:

1 A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction 10-707 Project Reyyan Yeniterzi

2 Introduction  Named Entity Recognition (NER)  aims to locate and classify the named entities  state-of-the-art NER systems are available for several languages  a limited amount of study has been conducted for Turkish.  we present the first CRF-based NER system for Turkish

3 Turkish  Turkish is a morphologically complex language with very productive inflectional and derivational processes.  Many local and non-local syntactic structures in English translate to Turkish words with complex morphological structures. weto makeflavor to be able acquireif are going +lantat+abil +dır +se+ecek+k if we are going to be able to make [something] acquire flavor tatlandırabileceksek

4 Related Work  Cucerzan and Yarowsky, 1999  a language independent EM-style bootstrapping algorithm  use word internal and contextual information of entities  Tur et all, 2003  a statistical approach (HMM)  data sparseness issues due to the agglutinative structure of the Turkish  use the morphological form of the word instead of the surface form  Kucuk and Yazici, 2009  the first rule-based NER system for Turkish  information sources such as dictionaries, list of well known entities and context patters

5 Approach  Conditional Random Fields (CRF)  CRF++, an open source CRF sequence labeling toolkit  Lexical model  using only the word tokens in their surface form  may encounter data sparseness problems  Morphological forms of the words  Contextual evidences around the named entities

6 Data Set - I  the newspaper articles data set  train set used in (Tür et all, 2003)  test set not available  split the data in two for the evaluation purposes  90% for training  10% for testing

7 Data Set - II  Three types of named entities  Organization  Person  Location # words# person# organization# location Train445,49821,70114,51012,138 Test47,3442,4001,5951,402

8 Data Set - III  named entities are marked with ENAMEX tag  a type of SGML tag  TYPE attribute

9 Experiments  Lexical Model PrecisionRecallF-Measure Person0.960.730.83 Organization0.950.730.83 Location0.960.810.88

10 Ongoing and Future Work  building the morphological features  the morphological analyses of the words is done  currently working on disambiguating these  will use the POS tags and lemmas of the words  building the contextual features  performing error analyses


Download ppt "A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction 10-707 Project Reyyan Yeniterzi."

Similar presentations


Ads by Google