Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.

Slides:



Advertisements
Similar presentations
1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić
Advertisements

1 EnviroInfo 2006, 05/09/06 Graz Automatic Concept Space Generation in Support of Resource Discovery in Spatial Data Infrastructures Paul Smits, Anders.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
A partnership of Truman Presidential Museum & Library, Truman Institute, and the MU Design Team at CTIE Project Whistlestop.
DEVELOPMENT OF CASCOT 5.0 (a multi-language text coding tool) Presentation to the DASISH project meeting, Gothenburg, November 2014 Peter Elias Margaret.
K.U. Leuven Leuven Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
Multilingual multimedia thesaurus for conservation and restoration collaborative networked model of construction Lucijana Leoni University of Dubrovnik.
Using language services to enrich the LOs' descriptions Dr. Vassilis Protonotarios University of Alcala, Spain 10 th Strategic Seminar / Conference 6-7.
CerOrganic European Conference – Athens, 6/12/2011 Giannis Stoitsis, Alexios Dimitropoulos Agro-Know Technologies.
Eleni Galiotou, Dept. of Informatics
EURIDICE project Evaluation of image database use in online learning environment 11/
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
ICAIL 2007 DESI Workshop Panel presentation Marie-Francine Moens Centre for Law and ICT/ Department of Computer Science Katholieke Universiteit Leuven,
Overview of Search Engines
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.
Revitalizing radical social work in 21st century: practical opportunities for social change Ana Miljenović, prof. Nino Žganec Study Centre for Social Work,
Recent international developments in Energy Statistics United Nations Statistics Division International Workshop on Energy Statistics September 2012,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Priorities in the Study of Information Sciences Faculty of Humanities and Social Sciences, University of Zagreb, Croatia Ph.D. Sanja Seljan, associate.
F. Petitjean, M-L Charron, S. Ferron (EHESP School of Public Health), C. Stock (Inist-CNRS) GL15 – Bratislava (SK), December 2, 2013.
Pržno, Republic of Montenegro 8 October 2007 TRANSLATION FOR EU ACCESSION TRANSLATION FOR EU ACCESSION Jasminka Novak, Head of Service Independent Service.
K.U. Leuven Leuven Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.
Project meeting Zagreb Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
15/11/2011EVA Minerva Jerusalem1 Linked Heritage : Coordination of standards and technologies for the enrichment of Europeana Marie-Véronique Leroi Ministry.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
The physics departments and documents network EUNIS Conference, Bled, June 29 th -July 2 nd 2004 Michael Schlenker: Dynamic.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Expanding the Accessibility and Impact of Language Technologies for Supporting Education (TFlex): Edinburgh Effort Dr. Myroslava Dzikovska, Prof. Johanna.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Fundamentals of Information Systems, Third Edition2 Principles and Learning Objectives Artificial intelligence systems form a broad and diverse set of.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
1 Building the Privacy culture, starts with the youngsters and their education 20 th and 21 st June 2013 Zagreb, Croatia.
CLARIN work packages. Conference Place yyyy-mm-dd
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
ELOGMAR-M Review Meeting, Shenzhen, 31/03/ First Review Meeting - Web-based and Mobile Solutions for Collaborative Work Environment with Logistics.
Compiling, processing and accessing the collection of legal regulations of the Republic of Croatia T. Didak Prekpalaj, T. Horvat, D. Miletić, D. Mokriš.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
GEMET GEneral Multilingual Environmental Thesaurus leading the way to federated terminologies Stefan Jensen, Head of information services group with input.
Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Jean Monnet Activities in Erasmus+ Programme - Information on Jean Monnet Activities - Next Call for proposals Selection results Call 2015 (EAC/A04/2014)
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
M O N T E N E G R O Negotiating Team for the Accession of Montenegro to the European Union Working Group for Chapter 10 – Information society and media.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Towards integrating European research information
Presented by Martine Deprez Head of Unit, EC - SG/A1 – Development and Advice Carine Smets Team Leader e-TrustEx Business – EC - SG/A1 – Development.
Proposal for piloting the VIRTA publication information service at the European level Janne Pölönen, Hanna-Mari Puuska and Gunnar Sivertsen.
Thai AGROVOC Ontology Base for Agricultural Information Retrieval
CADIAL search engine at INEX
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.
INSPIRE MIG-T Meeting Paris, October
Presentation transcript:

Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven

Leuven, Talk overview  document indexing and computer aided document indexing  project AIDE  CADIS workstation: features  project CADIAL  eCADIS workstation: additional features  machine learning techniques  future developments  conclusions

Leuven, Computer Aided Document Indexing  document indexing  attachment of descriptors from a controlled thesaurus to a document  descriptors = labels representing the content of a document  necessary for document retrieval in many document collections  parliamentary documentation  legislation  technical documentation  …  usually done manually  tedious, error prone, slow (max documents/day)  could computers be of any help in this process?  if we build a Computer Aided Document Indexing System (CADIS)

Leuven, Project AIDE in Croatia  idea for a project  September 2004  interdisciplinary collaboration of 3 institutions  Croatian Information Documentation Referral Agency (HIDRA)  Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS) Faculty of Electrical Engineering and Computing University of Zagreb  Institute of Linguistics (ZZL) Faculty of Humanities and Social Sciences University of Zagreb

Leuven, AIDE – collaborating institutions  HIDRA  collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia  coordinator Maja Cvitaš, M.A.  ZEMRIS  research in the field of artificial intelligence, neural networks, machine learning, data and text mining  coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc.  ZZL  computational linguistic research and building language technologies for Croatian  coordinator prof. Marko Tadić

Leuven, AIDE – project objective Development of intelligent system for automatic indexing of the official documentation of the Republic of Croatia with descriptors from Eurovoc thesaurus

Leuven, AIDE – how?  AIDE = Automatic Indexing of Documents with Eurovoc  automatic indexing, how?  program which “learns to index” documents  conference in Joint Research Center of EC (JRC), Ispra, Italy,  at least 10,000 manually indexed documents  3-5 descriptors per document  documents per descriptor  indexed documents stored in XML format  Steinberger (2003)  compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors  situation with Croatian documentation in  there were only few hundreds of documents indexed  manual indexing: painfully slow  how could we speed up the manual indexing?

Leuven, AIDE – activities  investigate and develop algorithms in the field of computational linguistics/language technologies  include that knowledge into the Computer Aided Document Indexing System (CADIS)  demonstration of CADIS in European parliament ( )

Leuven, CADIS: two parallel windows Document window Eurovoc browser window

Leuven, Document Window

Leuven,

CADIS features  Enhanced user interface  list of descriptors literary appearing in document

Leuven, CADIS features  Descriptors and non-descriptors marked in document

Leuven, CADIS features  Lists of n-grams

Leuven, CADIS features  Integration of corpus analysis  greyed n-grams are statistically relevant in the corpus i.e. collocations

Leuven, CADIS features  Manual marking of significant n-grams  important step towards further refinment of automatic indexing

Leuven, Eurovoc browser window

Leuven, AIDE – activities  investigate and develop algorithms in the field of computational linguistics/language technologies  include that knowledge into the Computer Aided Document Indexing System (CADIS)  demonstration of CADIS in European parliament ( )  ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006  joint project proposal with Katholieke Universiteit Leuven for CADIAL project

Leuven, CADIAL project  Computer Aided Document Indexing for Accessing Legislation  a joint Flemish-Croatian project  Department International Flanders, grant no. KRO/009/06  partners:  Katholieke Universiteit Leuven (prof. Marie-Francine Moens)  University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić)  started:  duration: 2 years  web:  the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia  new version of CADIS (eCADIS) is one of modules in this project  planned as a web-based service

Leuven, CADIAL project 2  used the 10,000 manually indexed documents to train the system for automatic indexing of documents in Croatian  used the 20,000 manually indexed documents from Acquis to train the system for automatic indexing of documents in English  included that training data into the next version: eCADIS (  -version)

Leuven, eCADIS (  ) features  Automatic suggestion of relevant descriptors i.e. automatic indexing  application of machine learning techniques

Leuven, eCADIS (  ) features  Compare it to manually attached indexes…

Leuven, eCADIS (  ) features  Manual marking of inappropriate suggestions  another step in further refinment of automatic indexing

Leuven, eCADIS (  ) on document in English

Leuven, eCADIS (  ) on document in English  Automatic suggestion of relevant descriptors i.e. automatic indexing

Leuven, eCADIS (  ) on document in English  Compare it to manually attached indexes…

Leuven, Training the classifiers  already existing classifiers  profile classifier (Steinberger 2003)  K-nearest neighbours  binary classifiers  SVM, Logistic Regression, Rocchio, Bayes, …  classifiers used for the preliminary training  ca 3500 independent binary classifiers  need to be further evaluated  Logistic Regression used for 10,000 documents in Croatian  SVM used for 20,000 documents in English  features  tokens, lemmas, stems, character n-grams  various feature selection methods and their combinations:  2, ig, mi…

Leuven, Further development of eCADIS  training with new features and feature selection methods  collocations, word n-grams, chunks  new measures for evaluation of results  sensitive to thesaurus hierarchy  web-interface for eCADIS for inclusion into the CADIAL system  eCADIS for other languages  now only Croatian and English (  -version) covered  usable for other languages as it is, but without the linguistic module less efficient  no list of lemmas, but types  poor statistics for n-grams  cooperation with language technology experts in different languages for development of linguistic modules

Leuven, Further development of eCADIS  … eCADIS for other languages  training the automatic indexing system for other languages  enables automatic suggestions of relevant descriptors in new, unseen documents  analysis of manual markings  descriptors, word n-grams, suggestions  promote the use of eCADIS in other countries beyond the scope of CADIAL project  e.g. Belgium (Flanders)  linguistic module for Dutch and French needed  computational lingustics expertise  training data from Acquis can be used to make an automatic indexing system for Dutch and French  machine learning expertise

Leuven, Conclusion  CADIAL  a joint Flemish-Croatian project sponsored by Flemish government  better public access to Croatian official documentation  faster and improved document indexing  automatic content metadata generation (Semantic Web)  easier document retrieval and exploration of legislation  multilingual access via standardized EU thesaurus Eurovoc  a test-case for the usage of such a system in Flanders  Web information on CADIAL project and eCADIS   contact:  

Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven