Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.

Slides:



Advertisements
Similar presentations
1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Chapter 5: Introduction to Information Retrieval
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
DEVELOPMENT OF CASCOT 5.0 (a multi-language text coding tool) Presentation to the DASISH project meeting, Gothenburg, November 2014 Peter Elias Margaret.
K.U. Leuven Leuven Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
Using language services to enrich the LOs' descriptions Dr. Vassilis Protonotarios University of Alcala, Spain 10 th Strategic Seminar / Conference 6-7.
CerOrganic European Conference – Athens, 6/12/2011 Giannis Stoitsis, Alexios Dimitropoulos Agro-Know Technologies.
Presented by Zeehasham Rasheed
ICAIL 2007 DESI Workshop Panel presentation Marie-Francine Moens Centre for Law and ICT/ Department of Computer Science Katholieke Universiteit Leuven,
Overview of Search Engines
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
Revitalizing radical social work in 21st century: practical opportunities for social change Ana Miljenović, prof. Nino Žganec Study Centre for Social Work,
Recent international developments in Energy Statistics United Nations Statistics Division International Workshop on Energy Statistics September 2012,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Title of the Poster. “Digital library services and their impact with reference to a developing country: The case of the Faculty of Health Sciences library,
Priorities in the Study of Information Sciences Faculty of Humanities and Social Sciences, University of Zagreb, Croatia Ph.D. Sanja Seljan, associate.
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
F. Petitjean, M-L Charron, S. Ferron (EHESP School of Public Health), C. Stock (Inist-CNRS) GL15 – Bratislava (SK), December 2, 2013.
Pržno, Republic of Montenegro 8 October 2007 TRANSLATION FOR EU ACCESSION TRANSLATION FOR EU ACCESSION Jasminka Novak, Head of Service Independent Service.
Project meeting Zagreb Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
CLARIN work packages. Conference Place yyyy-mm-dd
Promoting ICT in Developing Regions Love Ekenberg Professor of Computer Science Acting Director SPIDER Dept. of Computer and Systems Sciences Stockholm.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Compiling, processing and accessing the collection of legal regulations of the Republic of Croatia T. Didak Prekpalaj, T. Horvat, D. Miletić, D. Mokriš.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.
Poznan 19 April Kwietnia 2005 Poznan FP6 The Next Calls for 2005 Technology Platforms FP7.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Jean Monnet Activities in Erasmus+ Programme - Information on Jean Monnet Activities - Next Call for proposals Selection results Call 2015 (EAC/A04/2014)
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Jean Monnet Activities in Erasmus+ Programme - Information on Jean Monnet Activities - Next Call for proposals 2016 (EAC/A04/2015) - Selection results.
Towards integrating European research information
Information Retrieval in Practice
Presented by Martine Deprez Head of Unit, EC - SG/A1 – Development and Advice Carine Smets Team Leader e-TrustEx Business – EC - SG/A1 – Development.
Search Engine Architecture
Tools for Natural Language Processing Applications
Proposal for piloting the VIRTA publication information service at the European level Janne Pölönen, Hanna-Mari Puuska and Gunnar Sivertsen.
Thai AGROVOC Ontology Base for Agricultural Information Retrieval
CADIAL search engine at INEX
Creating eContent Localisation Resources
University of Modena and Reggio Emilia
Christian Ansorge Arona, 09/04/2014
Overview & Applications Welcome!
Presented by: Prof. Ali Jaoua
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
ITS 2.0 Enriched Terminology Annotation Showcase
The INTERACT Website: Important source of information for the ETC Community Karen Vandeweghe, Communications Manager, IS Bratislava 27 January 2010.
Energy Statistics Compilers Manual
Hierarchical, Perceptron-like Learning for OBIE
INSPIRE MIG-T Meeting Paris, October
European Masters Program Language & Communication Technologies
Presentation transcript:

Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb marko.tadic@ffzg.hr Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven marie-francine.moens@law.kuleuven.ac.be Leuven, 2007-05-22

Talk overview document indexing and computer aided document indexing project AIDE CADIS workstation: features project CADIAL eCADIS workstation: additional features machine learning techniques future developments conclusions Leuven, 2007-05-22

Computer Aided Document Indexing attachment of descriptors from a controlled thesaurus to a document descriptors = labels representing the content of a document necessary for document retrieval in many document collections parliamentary documentation legislation technical documentation … usually done manually tedious, error prone, slow (max. 30-40 documents/day) could computers be of any help in this process? if we build a Computer Aided Document Indexing System (CADIS) Leuven, 2007-05-22

Project AIDE in Croatia idea for a project September 2004 interdisciplinary collaboration of 3 institutions Croatian Information Documentation Referral Agency (HIDRA) Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS) Faculty of Electrical Engineering and Computing University of Zagreb Institute of Linguistics (ZZL) Faculty of Humanities and Social Sciences University of Zagreb Leuven, 2007-05-22

AIDE – collaborating institutions HIDRA collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia coordinator Maja Cvitaš, M.A. ZEMRIS research in the field of artificial intelligence, neural networks, machine learning, data and text mining coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc. ZZL computational linguistic research and building language technologies for Croatian coordinator prof. Marko Tadić Leuven, 2007-05-22

AIDE – project objective Development of intelligent system for automatic indexing of the official documentation of the Republic of Croatia with descriptors from Eurovoc thesaurus Leuven, 2007-05-22

AIDE – how? AIDE = Automatic Indexing of Documents with Eurovoc automatic indexing, how? program which “learns to index” documents conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09 at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003) compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors situation with Croatian documentation in 2004-09 there were only few hundreds of documents indexed manual indexing: painfully slow how could we speed up the manual indexing? Leuven, 2007-05-22

AIDE – activities investigate and develop algorithms in the field of computational linguistics/language technologies include that knowledge into the Computer Aided Document Indexing System (CADIS) demonstration of CADIS in European parliament (2006-03-10) Leuven, 2007-05-22

CADIS: two parallel windows Eurovoc browser window Document window Leuven, 2007-05-22

Document Window Leuven, 2007-05-22

Leuven, 2007-05-22

CADIS features Enhanced user interface list of descriptors literary appearing in document Leuven, 2007-05-22

CADIS features Descriptors and non-descriptors marked in document Leuven, 2007-05-22

CADIS features Lists of n-grams Leuven, 2007-05-22

CADIS features Integration of corpus analysis greyed n-grams are statistically relevant in the corpus i.e. collocations Leuven, 2007-05-22

CADIS features Manual marking of significant n-grams important step towards further refinment of automatic indexing Leuven, 2007-05-22

Eurovoc browser window Leuven, 2007-05-22

AIDE – activities investigate and develop algorithms in the field of computational linguistics/language technologies include that knowledge into the Computer Aided Document Indexing System (CADIS) demonstration of CADIS in European parliament (2006-03-10) ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006 joint project proposal with Katholieke Universiteit Leuven for CADIAL project Leuven, 2007-05-22

CADIAL project Computer Aided Document Indexing for Accessing Legislation a joint Flemish-Croatian project Department International Flanders, grant no. KRO/009/06 partners: Katholieke Universiteit Leuven (prof. Marie-Francine Moens) University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić) started: 2007-03 duration: 2 years web: www.cadial.org the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia new version of CADIS (eCADIS) is one of modules in this project planned as a web-based service Leuven, 2007-05-22

CADIAL project 2 used the 10,000 manually indexed documents to train the system for automatic indexing of documents in Croatian used the 20,000 manually indexed documents from Acquis to train the system for automatic indexing of documents in English included that training data into the next version: eCADIS (-version) Leuven, 2007-05-22

eCADIS () features Automatic suggestion of relevant descriptors i.e. automatic indexing application of machine learning techniques Leuven, 2007-05-22

eCADIS () features Compare it to manually attached indexes… Leuven, 2007-05-22

eCADIS () features Manual marking of inappropriate suggestions another step in further refinment of automatic indexing Leuven, 2007-05-22

eCADIS () on document in English Leuven, 2007-05-22

eCADIS () on document in English Automatic suggestion of relevant descriptors i.e. automatic indexing Leuven, 2007-05-22

eCADIS () on document in English Compare it to manually attached indexes… Leuven, 2007-05-22

Training the classifiers already existing classifiers profile classifier (Steinberger 2003) K-nearest neighbours binary classifiers SVM, Logistic Regression, Rocchio, Bayes, … classifiers used for the preliminary training ca 3500 independent binary classifiers need to be further evaluated Logistic Regression used for 10,000 documents in Croatian SVM used for 20,000 documents in English features tokens, lemmas, stems, character n-grams various feature selection methods and their combinations: 2, ig, mi… Leuven, 2007-05-22

Further development of eCADIS training with new features and feature selection methods collocations, word n-grams, chunks new measures for evaluation of results sensitive to thesaurus hierarchy web-interface for eCADIS for inclusion into the CADIAL system eCADIS for other languages now only Croatian and English (-version) covered usable for other languages as it is, but without the linguistic module less efficient no list of lemmas, but types poor statistics for n-grams cooperation with language technology experts in different languages for development of linguistic modules Leuven, 2007-05-22

Further development of eCADIS … eCADIS for other languages training the automatic indexing system for other languages enables automatic suggestions of relevant descriptors in new, unseen documents analysis of manual markings descriptors, word n-grams, suggestions promote the use of eCADIS in other countries beyond the scope of CADIAL project e.g. Belgium (Flanders) linguistic module for Dutch and French needed computational lingustics expertise training data from Acquis can be used to make an automatic indexing system for Dutch and French machine learning expertise Leuven, 2007-05-22

Conclusion CADIAL a joint Flemish-Croatian project sponsored by Flemish government better public access to Croatian official documentation faster and improved document indexing automatic content metadata generation (Semantic Web) easier document retrieval and exploration of legislation multilingual access via standardized EU thesaurus Eurovoc a test-case for the usage of such a system in Flanders Web information on CADIAL project and eCADIS www.cadial.org contact: bojana.dalbelo@fer.hr marie-france.moens@law.kuleuven.ac.be Leuven, 2007-05-22

Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb marko.tadic@ffzg.hr Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven marie-francine.moens@law.kuleuven.ac.be Leuven, 2007-05-22