Presentation is loading. Please wait.

Presentation is loading. Please wait.

Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK.

Similar presentations


Presentation on theme: "Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK."— Presentation transcript:

1 Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský {sevcikova,zabokrtsky}@ufal.mff.cuni.cz ÚFAL MFF UK

2 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz2 Outline of the talk The term ‘named entities’ Named entities in Czech Named entity classification Data annotation Quantitative characteristics of the data Experiments in automatic named entity recognition Future work

3 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz3 The term ‘named entities’ English term ‘named entities’ (NE) words and word sequences which have not a common lexical meaning: –proper nouns e.g., person names, names of institutions, products, towns –numeric expressions which have other meaning than that of quantity e.g., telephone number, page number NE processing is of crucial importance for NLP –question answering, information extraction, machine translation NE task ‘born’ in MUC conference in 1995

4 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz4 Named entities in Czech ‘pojmenované entity’ – direct equivalent of ‘named entities’ up to now, NE task has not be solved for Czech now: within the project 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů) some examples from Czech –jeho hlava (his head) vs. pan Hlava (Mr. Hlava), k jeho hlavě (to his head) vs. k panu Hlavovi (to Mr. Hlava) –289 stran (289 pages) vs. na straně 289 (on page 289)

5 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz5 Named entity classification NE-type, NE-super-type, NE-container; special tags 1st version for the 1st round of annotation (focused on proper nouns): –42 NE-types: pf, ps,... –7 NE-super-types: a, g, i, m, o, p, t –4 NE-containers: A, C, P, T 2nd version for the 2nd round of annotation (extended to numeric expressions): –62 NE-types: pf, ps,... na, np,... –10 NE-super-types: a, c, g, i, m, n, o, p, q, t –4 NE-containers: A, C, P, T

6 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz6 Named entity classification Types of person names.........

7 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz7 Named entity classification NE-containers

8 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz8 Named entity classification Special tags

9 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz9 Data annotation NE-type, NE-container; special tags; spam; NE-instance 2 rounds of annotation 1st round –2,000 sentences from SYN2000 corpus –randomly selected from 5,364,071 sentences found, query: ([word=“.*[a-z0-9]”] [word=“.*[A-Z].*”]) –2 parallel annotations, 3rd ‘unifying’ annotation –defect sentences eliminated, annotation of another 100 sent. –-> 2,010 sentences = train and test data 2nd round –2,000 sentences from SYN2005 corpus –randomly selected from 1,356,321 sentences found, query: [word=“.*[0-9].*”] –1 annotation, not yet revised

10 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz10 Data annotation Example of annotated text

11 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz11 Quantitative characteristics of the data 2,010 sentences –51,921 tokens –11,644 NE-instances train:dtest:etest ~ 8:1:1 in the train data –1,608 sentences –41,710 tokens –6,109 NE-instances

12 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz12 Quantitative characteristics of the data Tags of all NE-instance in the train data

13 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz13 Quantitative characteristics of the data Tags of all NE-instance in the train data

14 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz14 Experiments in automatic NE recognition

15 Kvilda, 25. 1. 2007{sevcikova,zabokrtsky}@ufal.mff.cuni.cz15 Future work


Download ppt "Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK."

Similar presentations


Ads by Google