JRC-Ispra, 17.09.04, Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.

Slides:



Advertisements
Similar presentations
U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Advertisements

Mind the lexical gap- Eurovoc Luxembourg, November 2010 Automatic Eurovoc indexing of parliamentary documentation Live demostration Victoria Fernández.
European Thesaurus on International Relations and Area Studies A multilingual terminological tool on international affairs Axel Huckstorf Stiftung Wissenschaft.
JRC-Ispra, , Slide 1 Introduction – Presentation of the Programme Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic.
JRC-Ispra, , Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.
Paragon Software Group presents PenReader. Paragon Software Group – International Holding Founded in 1994 Location Germany (HQ), NL, Russia, USA, Japan.
Yemelia International Language Services Translations Translations Translations Interpreting InterpretingInterpreting Multi-lingual IT Presentations Multi-lingual.
EU Institutions “To Understand Europe You Have to Be a Genius or French.” --Madeleine Albright, US Secretary of State, 1998.
MIG-KOMM-EU Multilingual intercultural business communication in Europe University of Bucharest Faculty of Foreign Languages and Literatures German Studies.
What is eEuroInclusion?: eEuroInclusion is a European Project funded under a special call for projects relating to ‘Language Learning and Linguistic Diversity’.
DEVELOPMENT OF CASCOT 5.0 (a multi-language text coding tool) Presentation to the DASISH project meeting, Gothenburg, November 2014 Peter Elias Margaret.
Curricular exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
 They speak German  8.47 million of people live there.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Eleni Galiotou, Dept. of Informatics
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
23 October 2014 • AIPLA Annual Meeting Washington, DC Pierre Véron
Minority Language Conference Hanasaari-The Swedish- Finnish Cultural Centre November 27th and 28th 2008.
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
1 EU & languages Elisabetta Gibertini Michela Sgarbi Mirjam Arula Hanna-Liis Karp.
Languages in Action Translating for the European Commission
Translating for the European Commission Vilnius, 7 June 2013 Miroslav Adamiš Director DGT.
LANGUAGE AND PATENTS Gillian Davies Montréal, July 2005.
Multi-language CASCOT Margaret Birch and Ritva Ellison Institute for Employment Research.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
SQAS 2011 – System changes Marc Twisk – SQAS Manager.
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.
Medical Device Localisation Michael Kemmann ADAPT Localization Services.
Automatic Eurovoc Indexing: Results and Evaluations Bruno Pouliquen Lang Tech group, JRC, European Commission Ispra-Italy
Frankfurt Book Fair Clare Hart, President & CEO Frankfurt, Germany October 2000.
School improvement based on
IATE EU tool for translation-oriented terminology work
2013 Court of Justice of the European Union Language arrangements at the Court of Justice of the European Union Interpretation - Translation.
CASCOT AND THE CODING OF OCCUPATIONS IN EUROPEAN SURVEYS Demonstration of CASCOT Presentation for the InGRID Workshop Amsterdam, February 2014 Ritva.
ST/PRM3-EU | | © Robert Bosch GmbH reserves all rights even in the event of industrial property rights. We reserve all rights of disposal such as copying.
Contemporary World. The European Union Since the end of WWII and the Cold War, European countries have gradually developed a feeling of collective identity.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Harmonisation across countries in SHARE Workshop on Harmonisation of Social Survey Data for Cross-National Comparison Prague 19.
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
© Melanie Fiedler, Attorney at law 2005 Sofia The Community Trade Mark The functions of a trade mark distinguishing the goods or services of one undertaking.
ISPRA 2004 Automatic Eurovoc indexing an Experiment in the Czech Parliament Anna Lhotská, Václav Sklenář Office of the Chamber of Deputies, Parliament.
1 European Association for Language Testing and Assessment
Introduction to the European Union. The European Union Foundation Purpose.
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language.
1 Standardisation supporting cultural diversity: From 5 to 28 STF QD Expanding the language coverage of the ETSI spoken command vocabulary standard. Mike.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
EU Terminology in the Age of Digital Communication
Towards integrating European research information
HOW CONSOLIDATED TEXTS GET THEIR LEGAL FORCE ESTONIAN EXPERIENCE
Mitubishi Chemical Holdings Group
EU Terminology: Building text-related & translation-oriented projects for IATE 20th European Symposium on Languages for Special Purposes – University.
Sales Presenter Available now
Mitubishi Chemical Holdings Group
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.

EU and multilingualism
Dissemination Working Group Luxemburg 25 & 26 October User support
Mitubishi Chemical Holdings Group
Workshop of “Best practices exchanges” Luxemburg February 2011 User support – New organisation Norbert REINERT/ Henric ANSELM.
Sales Presenter Available now Standard v Slim
Statistics Explained goes multilingual
Lars Ballieu Christensen Advisor, Ph.D., M.Sc. Tanja Stevns

Presentation transcript:

JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September

JRC-Ispra, , Slide 2 Analysis Danish Dutch English Finnish French German (Greek) Italian Portuguese Spanish Swedish (Lithuanian) (Bulgarian) (Hungarian) Eurovoc indexing – Extend language coverage Czech Croatian Latvian Lithuanian Polish Slovak Soon also Albanian Romanian Russian Slovene Display Danish Dutch English Finnish French German Greek Italian Portuguese Spanish Swedish

JRC-Ispra, , Slide 3 Incentive for collaboration Mutual benefit –We can provide tools and results to you (to non-commercial Member State organisations) –JRC will be able to Eurovoc-index documents for news analysis, etc. No payments by the JRC are foreseen How to go ahead? / What to do next?  We need Eurovoc-indexed texts in your languages (or translations of Eurovoc-indexed texts!) (Acquis Communautaire)

JRC-Ispra, , Slide 4 Format to provide training texts to the JRC Ideally: Plain text (not MS-Word, RTF, PDF, etc.) UTF-8 character encoding With CELEX code With Eurovoc descriptor code (mentioning Eurovoc version) XML format, structured Linguistically pre-processed and structured: –lemmatised –annexes / signatures separate –title separate –stop word lists MANY texts: –80,000 English texts were enough to train ca descriptors (out of 6000)!

JRC-Ispra, , Slide 5 Descriptor distribution in Spanish EP/EC texts

JRC-Ispra, , Slide 6 Descriptor distribution in Spanish EP/EC texts

JRC-Ispra, , Slide 7 Descriptor distribution in Spanish Congress texts

JRC-Ispra, , Slide 8 Descriptor distribution in Hungarian texts

JRC-Ispra, , Slide 9 Procedure You provide us with –A big XML file containing the documents –A stop word list We will give back to you –A subset of documents (evaluation set) Same format Additional information on automatic Eurovoc descriptors assigned –Some statistics on descriptor usage frequency, etc. –An online browser interface to see the assignment results –A validation interface

JRC-Ispra, , Slide 10 training Descriptor profiles Descriptor profiles Descriptor profiles Descriptor Your corpus pre processing assignment Training set pre processing Evaluation set Eurovoc Assignment export 95% 5%

JRC-Ispra, , Slide 11 XML format

JRC-Ispra, , Slide 12

JRC-Ispra, , Slide 13

JRC-Ispra, , Slide 14 Results of descriptor assignment - interface

JRC-Ispra, , Slide 15 Results of descriptor assignment - XML PRESIDENCY OF THE EC COUNCIL EUROPEAN UNION PRESIDENT SOCIAL POLICY PRINCIPLE OF SUBSIDIARITY...

JRC-Ispra, , Slide 16 Results of descriptor assignment - validation Numeric feedback?

JRC-Ispra, , Slide 17 Arranging the collaboration of scientific partners The JRC will be able to provide the tool and indexing results. The JRC does not have specific funds to pay for this work. Possibilities for collaboration between parliament and scientists –informal collaboration without payment –formal collaboration (contract, payment) –apply for a project with national or EU funding (example: Hungary) –M.Sc. Theses (e.g. Lithuanian), internships (e.g. Estonian), … –… We would like to have lemmatisers for the new languages. If necessary, we can train system without linguistic pre-processing.

JRC-Ispra, , Slide 18 Pre-processing of the texts (by scientists?) Linguistic pre-processing, needed for each language: –General and corpus-specific list of stop words (several thousand!) –For highly inflected languages: some lemmatiser or stemmer –Multi-word term mark-up for disambiguation purposes? Further text processing –Some document structuring to separate title, text, footer and annex –Conversion to XML –Conversion to UTF-8

JRC-Ispra, , Slide 19 Dealing with different versions of Eurovoc Problem has not yet been solved: request for your input En training material was indexed with versions 3.1 and 4 Challenge: new descriptors need new training material  delay Re-training required

JRC-Ispra, , Slide 20 Dealing with different versions of Eurovoc (2) Case 1: New descriptor  Search old and new documents for related documents for re-training Case 2: New name for old descriptor  Replace the descriptor name: OLD_NAME  NEW_NAME Case 3: New place in hierarchy  No problem Case 4: Disappearing descriptor  Will no longer be assigned

JRC-Ispra, , Slide 21 Dealing with different versions of Eurovoc (2) Case 5: Several descriptors are conflated  No problem Case 6: A descriptor is split into two or more  Re-training required (see Case 1) NEW_NAME_1 OLD_NAME NEW_NAME_2 NEW_NAME_3 OLD_NAME_1 OLD_NAME_2 NEW_NAME OLD_NAME_3

JRC-Ispra, , Slide 22 Dealing with different versions of Eurovoc (3) Changes between Eurovoc versions should not only be described in free text. They should be formalised in a machine-readable way (e.g. in XML, in table format, …). This should be done centrally for the thesaurus (i.e. for all thesaurus languages), rather than separately for each language!

JRC-Ispra, , Slide 23 Appeal to Eurovoc community / EP / OPOCE Make Eurovoc available to the wide public in machine-readable form Formalise the version differences (e.g. XML) Make Eurovoc-indexed texts available to the scientific community –Controlled by licences, if necessary –E.g. via the Evaluations and Language resources Distribution Agency ELDA See “ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.” –Wealth of ‘parallel texts’ to train multilingual text analysis applications Machine Translation Multilingual Named Entity Recognition Multilingual classification Multi-document summarisation … Automatic indexing  The benefit is yours!

JRC-Ispra, , Slide 24