Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi

Similar presentations


Presentation on theme: "Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi"— Presentation transcript:

1 Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

2 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Overview of the project Objectives Partners Resources Methods Results Conclusions

3 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Project objectives Hungarian EUROVOC version –only a draft version planned at first –an authorative full-scale system Automatic indexing of documents –using the technology developed at JRC –prototype system for one domain

4 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Partners Project consortium: –HAS RIL (coordinator) –MorphoLogic Kft. (partner) Collaborators: –JRC, Ispra –Hungarian Parliament –Ministry of Justice

5 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Resources NLP toolset (RIL) Digital dictionaries, software technology (MorphoLogic) Indexing technology (JRC Ispra) Terminology database, translation, supervision expertise (Justice Ministry) Coordination funding of Hungarian EUROVOC (Hungarian Parliament)

6 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop EUROVOC translation Done by the Translation Coordination Unit of the Ministry of Justice Team coordinating the massive effort of preparing the Hungarian translation of Acquis Communitaire Maintaining an online Terminological Database

7 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Terminological Database

8 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Translation process English, French, German & Spanish EUROVOC versions in xml files Automatic lookup of Terminological Database (cc. 20% coverage) Notepad2 xml-aware editor used micro-thesauri translated first, corresponding descriptors second pool of experts consulted when needed

9 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Indexing strategies Corpus: Hungarian translation of Acquis Communitaire Two approaches 1.To translate English associate terms (possible short-cut?) 2.To reconstruct the generation of associate terms by running the JRC technology on the Hungarian data

10 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Translation of associate terms Hypothesis: –relation between English associate term and EUROVOC descriptor is language independent –hence Hungarian equivalent of English term will also serve as appropriate associate term in Hungarian texts

11 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Online dictionary lookup MorphoLogic Online English-Hungarian dictionaries applied 24.7 % direct match suspension of payments Zahlungseinstellung cessation de paiement suspensión de pagos kifizetések felfüggesztése

12 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Manual check of automatic assignments Equivalence cannot be judged on its own merits: the Hungarian equivalent must be the one occuring in the texts the Hungarian terms must be looked up in the translation corpus as well parallel corpus aligned at least on the document level must be compiled

13 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Manual check sales promotion Absatzförderung promotion commerciale promoción comercial eladásösztönzés Even frequency lists are useful: Reklám149 Promóció 60 Eladásösztönzés 1

14 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Manual check toxic substance Giftstoff substance toxique sustancia tóxica toxikus anyagok mérgező anyagok Even frequency lists are useful: Equally frequent

15 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Generation of Hungarian associate-lists Tasks 1.Compile corpus of Hungarian translation of Acquis Communitaire 2.Tag and lemmatize words 3.Compile list of stop words 4.Run automatic indexing tools (JRC)

16 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Hungarian Acquis Communautaire corpus 8308 files HUN tokens21,899,924 EN tokens20,394,088

17 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop English stop-word list English stop word list: 1720 items –function words –"EUspeak" objective, arrangements, committee –Some strange multiword strings necessary_to_comply_with_this_directive forward_this_resolution_to_the_commission

18 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Hungarian stop-word list 1.translated English items 2.checked their occurrence in HU CELEX 3.generated unigram,bigram and trigram frequency lists from HU CELEX corpus 4.checked first 3000 items on each list and added to the stwd list if needed 5.double checked infrequent items on English translation list and replaced translation with synonyms

19 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Hungarian stop-word list single word entries1265 multi-word entries 752 Total2017

20 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Automatic indexing run 1 7971 texts divided into 3 sets: (total length of 65702474 chars) 1. 202 optimisation (evaluation set) 2. 179 final evaluation (test set) 3.7590 the training set

21 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Precision/recall in terms of number of Eurovoc descriptors RankPrecisionRecallPrec RTRec RTF1-measure 180.00016.28682.85717.23827.0627090127329 267.14325.14377.14328.57136.5857540472011 363.81032.85775.71439.23843.3778884210744 459.04838.28670.47643.71446.4526625434072 557.76244.09570.04850.19050.0115925267777 655.57147.52468.33353.14351.23344883845 752.17048.47665.40854.09550.255209745047 849.97649.90562.85755.52449.9404747649703 948.58751.81062.14357.90550.1467667360579 1046.61952.28660.14358.38149.2901477983924

22 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Evaluation in terms of rank

23 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Precision/Recall graph :

24 JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop Conclusions First run already yields results comparable to other languages scope for fine-tunig/filtering process interesting to compare results gained from the two approaches


Download ppt "Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi"

Similar presentations


Ads by Google