Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK
Kvilda, Outline of the talk The term ‘named entities’ Named entities in Czech Named entity classification Data annotation Quantitative characteristics of the data Experiments in automatic named entity recognition Future work
Kvilda, The term ‘named entities’ English term ‘named entities’ (NE) words and word sequences which have not a common lexical meaning: –proper nouns e.g., person names, names of institutions, products, towns –numeric expressions which have other meaning than that of quantity e.g., telephone number, page number NE processing is of crucial importance for NLP –question answering, information extraction, machine translation NE task ‘born’ in MUC conference in 1995
Kvilda, Named entities in Czech ‘pojmenované entity’ – direct equivalent of ‘named entities’ up to now, NE task has not be solved for Czech now: within the project 1ET (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů) some examples from Czech –jeho hlava (his head) vs. pan Hlava (Mr. Hlava), k jeho hlavě (to his head) vs. k panu Hlavovi (to Mr. Hlava) –289 stran (289 pages) vs. na straně 289 (on page 289)
Kvilda, Named entity classification NE-type, NE-super-type, NE-container; special tags 1st version for the 1st round of annotation (focused on proper nouns): –42 NE-types: pf, ps,... –7 NE-super-types: a, g, i, m, o, p, t –4 NE-containers: A, C, P, T 2nd version for the 2nd round of annotation (extended to numeric expressions): –62 NE-types: pf, ps,... na, np,... –10 NE-super-types: a, c, g, i, m, n, o, p, q, t –4 NE-containers: A, C, P, T
Kvilda, Named entity classification Types of person names
Kvilda, Named entity classification NE-containers
Kvilda, Named entity classification Special tags
Kvilda, Data annotation NE-type, NE-container; special tags; spam; NE-instance 2 rounds of annotation 1st round –2,000 sentences from SYN2000 corpus –randomly selected from 5,364,071 sentences found, query: ([word=“.*[a-z0-9]”] [word=“.*[A-Z].*”]) –2 parallel annotations, 3rd ‘unifying’ annotation –defect sentences eliminated, annotation of another 100 sent. –-> 2,010 sentences = train and test data 2nd round –2,000 sentences from SYN2005 corpus –randomly selected from 1,356,321 sentences found, query: [word=“.*[0-9].*”] –1 annotation, not yet revised
Kvilda, Data annotation Example of annotated text
Kvilda, Quantitative characteristics of the data 2,010 sentences –51,921 tokens –11,644 NE-instances train:dtest:etest ~ 8:1:1 in the train data –1,608 sentences –41,710 tokens –6,109 NE-instances
Kvilda, Quantitative characteristics of the data Tags of all NE-instance in the train data
Kvilda, Quantitative characteristics of the data Tags of all NE-instance in the train data
Kvilda, Experiments in automatic NE recognition
Kvilda, Future work