Download presentation
Presentation is loading. Please wait.
Published byHugh Duane Franklin Modified over 9 years ago
1
Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb marko.tadic@ffzg.hr Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven marie-francine.moens@law.kuleuven.ac.be
2
Leuven, 2007-05-22 Talk overview document indexing and computer aided document indexing project AIDE CADIS workstation: features project CADIAL eCADIS workstation: additional features machine learning techniques future developments conclusions
3
Leuven, 2007-05-22 Computer Aided Document Indexing document indexing attachment of descriptors from a controlled thesaurus to a document descriptors = labels representing the content of a document necessary for document retrieval in many document collections parliamentary documentation legislation technical documentation … usually done manually tedious, error prone, slow (max. 30-40 documents/day) could computers be of any help in this process? if we build a Computer Aided Document Indexing System (CADIS)
4
Leuven, 2007-05-22 Project AIDE in Croatia idea for a project September 2004 interdisciplinary collaboration of 3 institutions Croatian Information Documentation Referral Agency (HIDRA) Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS) Faculty of Electrical Engineering and Computing University of Zagreb Institute of Linguistics (ZZL) Faculty of Humanities and Social Sciences University of Zagreb
5
Leuven, 2007-05-22 AIDE – collaborating institutions HIDRA collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia coordinator Maja Cvitaš, M.A. ZEMRIS research in the field of artificial intelligence, neural networks, machine learning, data and text mining coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc. ZZL computational linguistic research and building language technologies for Croatian coordinator prof. Marko Tadić
6
Leuven, 2007-05-22 AIDE – project objective Development of intelligent system for automatic indexing of the official documentation of the Republic of Croatia with descriptors from Eurovoc thesaurus
7
Leuven, 2007-05-22 AIDE – how? AIDE = Automatic Indexing of Documents with Eurovoc automatic indexing, how? program which “learns to index” documents conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09 at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003) compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors situation with Croatian documentation in 2004-09 there were only few hundreds of documents indexed manual indexing: painfully slow how could we speed up the manual indexing?
8
Leuven, 2007-05-22 AIDE – activities investigate and develop algorithms in the field of computational linguistics/language technologies include that knowledge into the Computer Aided Document Indexing System (CADIS) demonstration of CADIS in European parliament (2006-03-10)
9
Leuven, 2007-05-22 CADIS: two parallel windows Document window Eurovoc browser window
10
Leuven, 2007-05-22 Document Window
11
Leuven, 2007-05-22
12
CADIS features Enhanced user interface list of descriptors literary appearing in document
13
Leuven, 2007-05-22 CADIS features Descriptors and non-descriptors marked in document
14
Leuven, 2007-05-22 CADIS features Lists of n-grams
15
Leuven, 2007-05-22 CADIS features Integration of corpus analysis greyed n-grams are statistically relevant in the corpus i.e. collocations
16
Leuven, 2007-05-22 CADIS features Manual marking of significant n-grams important step towards further refinment of automatic indexing
17
Leuven, 2007-05-22 Eurovoc browser window
18
Leuven, 2007-05-22 AIDE – activities investigate and develop algorithms in the field of computational linguistics/language technologies include that knowledge into the Computer Aided Document Indexing System (CADIS) demonstration of CADIS in European parliament (2006-03-10) ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006 joint project proposal with Katholieke Universiteit Leuven for CADIAL project
19
Leuven, 2007-05-22 CADIAL project Computer Aided Document Indexing for Accessing Legislation a joint Flemish-Croatian project Department International Flanders, grant no. KRO/009/06 partners: Katholieke Universiteit Leuven (prof. Marie-Francine Moens) University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić) started: 2007-03 duration: 2 years web: www.cadial.org the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia new version of CADIS (eCADIS) is one of modules in this project planned as a web-based service
20
Leuven, 2007-05-22 CADIAL project 2 used the 10,000 manually indexed documents to train the system for automatic indexing of documents in Croatian used the 20,000 manually indexed documents from Acquis to train the system for automatic indexing of documents in English included that training data into the next version: eCADIS ( -version)
21
Leuven, 2007-05-22 eCADIS ( ) features Automatic suggestion of relevant descriptors i.e. automatic indexing application of machine learning techniques
22
Leuven, 2007-05-22 eCADIS ( ) features Compare it to manually attached indexes…
23
Leuven, 2007-05-22 eCADIS ( ) features Manual marking of inappropriate suggestions another step in further refinment of automatic indexing
24
Leuven, 2007-05-22 eCADIS ( ) on document in English
25
Leuven, 2007-05-22 eCADIS ( ) on document in English Automatic suggestion of relevant descriptors i.e. automatic indexing
26
Leuven, 2007-05-22 eCADIS ( ) on document in English Compare it to manually attached indexes…
27
Leuven, 2007-05-22 Training the classifiers already existing classifiers profile classifier (Steinberger 2003) K-nearest neighbours binary classifiers SVM, Logistic Regression, Rocchio, Bayes, … classifiers used for the preliminary training ca 3500 independent binary classifiers need to be further evaluated Logistic Regression used for 10,000 documents in Croatian SVM used for 20,000 documents in English features tokens, lemmas, stems, character n-grams various feature selection methods and their combinations: 2, ig, mi…
28
Leuven, 2007-05-22 Further development of eCADIS training with new features and feature selection methods collocations, word n-grams, chunks new measures for evaluation of results sensitive to thesaurus hierarchy web-interface for eCADIS for inclusion into the CADIAL system eCADIS for other languages now only Croatian and English ( -version) covered usable for other languages as it is, but without the linguistic module less efficient no list of lemmas, but types poor statistics for n-grams cooperation with language technology experts in different languages for development of linguistic modules
29
Leuven, 2007-05-22 Further development of eCADIS … eCADIS for other languages training the automatic indexing system for other languages enables automatic suggestions of relevant descriptors in new, unseen documents analysis of manual markings descriptors, word n-grams, suggestions promote the use of eCADIS in other countries beyond the scope of CADIAL project e.g. Belgium (Flanders) linguistic module for Dutch and French needed computational lingustics expertise training data from Acquis can be used to make an automatic indexing system for Dutch and French machine learning expertise
30
Leuven, 2007-05-22 Conclusion CADIAL a joint Flemish-Croatian project sponsored by Flemish government better public access to Croatian official documentation faster and improved document indexing automatic content metadata generation (Semantic Web) easier document retrieval and exploration of legislation multilingual access via standardized EU thesaurus Eurovoc a test-case for the usage of such a system in Flanders Web information on CADIAL project and eCADIS www.cadial.org www.cadial.org contact: bojana.dalbelo@fer.hr bojana.dalbelo@fer.hr marie-france.moens@law.kuleuven.ac.be marie-france.moens@law.kuleuven.ac.be
31
Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb marko.tadic@ffzg.hr Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven marie-francine.moens@law.kuleuven.ac.be
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.