Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.

Similar presentations


Presentation on theme: "Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering."— Presentation transcript:

1 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven Leuven,

2 Talk overview document indexing and computer aided document indexing
project AIDE CADIS workstation: features project CADIAL eCADIS workstation: additional features machine learning techniques future developments conclusions Leuven,

3 Computer Aided Document Indexing
attachment of descriptors from a controlled thesaurus to a document descriptors = labels representing the content of a document necessary for document retrieval in many document collections parliamentary documentation legislation technical documentation usually done manually tedious, error prone, slow (max documents/day) could computers be of any help in this process? if we build a Computer Aided Document Indexing System (CADIS) Leuven,

4 Project AIDE in Croatia
idea for a project September 2004 interdisciplinary collaboration of 3 institutions Croatian Information Documentation Referral Agency (HIDRA) Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS) Faculty of Electrical Engineering and Computing University of Zagreb Institute of Linguistics (ZZL) Faculty of Humanities and Social Sciences University of Zagreb Leuven,

5 AIDE – collaborating institutions
HIDRA collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia coordinator Maja Cvitaš, M.A. ZEMRIS research in the field of artificial intelligence, neural networks, machine learning, data and text mining coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc. ZZL computational linguistic research and building language technologies for Croatian coordinator prof. Marko Tadić Leuven,

6 AIDE – project objective
Development of intelligent system for automatic indexing of the official documentation of the Republic of Croatia with descriptors from Eurovoc thesaurus Leuven,

7 AIDE – how? AIDE = Automatic Indexing of Documents with Eurovoc
automatic indexing, how? program which “learns to index” documents conference in Joint Research Center of EC (JRC), Ispra, Italy, at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003) compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors situation with Croatian documentation in there were only few hundreds of documents indexed manual indexing: painfully slow how could we speed up the manual indexing? Leuven,

8 AIDE – activities investigate and develop algorithms in the field of computational linguistics/language technologies include that knowledge into the Computer Aided Document Indexing System (CADIS) demonstration of CADIS in European parliament ( ) Leuven,

9 CADIS: two parallel windows
Eurovoc browser window Document window Leuven,

10 Document Window Leuven,

11 Leuven,

12 CADIS features Enhanced user interface
list of descriptors literary appearing in document Leuven,

13 CADIS features Descriptors and non-descriptors marked in document
Leuven,

14 CADIS features Lists of n-grams Leuven,

15 CADIS features Integration of corpus analysis
greyed n-grams are statistically relevant in the corpus i.e. collocations Leuven,

16 CADIS features Manual marking of significant n-grams
important step towards further refinment of automatic indexing Leuven,

17 Eurovoc browser window
Leuven,

18 AIDE – activities investigate and develop algorithms in the field of computational linguistics/language technologies include that knowledge into the Computer Aided Document Indexing System (CADIS) demonstration of CADIS in European parliament ( ) ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006 joint project proposal with Katholieke Universiteit Leuven for CADIAL project Leuven,

19 CADIAL project Computer Aided Document Indexing for Accessing Legislation a joint Flemish-Croatian project Department International Flanders, grant no. KRO/009/06 partners: Katholieke Universiteit Leuven (prof. Marie-Francine Moens) University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić) started: duration: 2 years web: the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia new version of CADIS (eCADIS) is one of modules in this project planned as a web-based service Leuven,

20 CADIAL project 2 used the 10,000 manually indexed documents to train the system for automatic indexing of documents in Croatian used the 20,000 manually indexed documents from Acquis to train the system for automatic indexing of documents in English included that training data into the next version: eCADIS (-version) Leuven,

21 eCADIS () features Automatic suggestion of relevant descriptors i.e. automatic indexing application of machine learning techniques Leuven,

22 eCADIS () features Compare it to manually attached indexes…
Leuven,

23 eCADIS () features Manual marking of inappropriate suggestions
another step in further refinment of automatic indexing Leuven,

24 eCADIS () on document in English
Leuven,

25 eCADIS () on document in English
Automatic suggestion of relevant descriptors i.e. automatic indexing Leuven,

26 eCADIS () on document in English
Compare it to manually attached indexes… Leuven,

27 Training the classifiers
already existing classifiers profile classifier (Steinberger 2003) K-nearest neighbours binary classifiers SVM, Logistic Regression, Rocchio, Bayes, … classifiers used for the preliminary training ca 3500 independent binary classifiers need to be further evaluated Logistic Regression used for 10,000 documents in Croatian SVM used for 20,000 documents in English features tokens, lemmas, stems, character n-grams various feature selection methods and their combinations: 2, ig, mi… Leuven,

28 Further development of eCADIS
training with new features and feature selection methods collocations, word n-grams, chunks new measures for evaluation of results sensitive to thesaurus hierarchy web-interface for eCADIS for inclusion into the CADIAL system eCADIS for other languages now only Croatian and English (-version) covered usable for other languages as it is, but without the linguistic module less efficient no list of lemmas, but types poor statistics for n-grams cooperation with language technology experts in different languages for development of linguistic modules Leuven,

29 Further development of eCADIS
… eCADIS for other languages training the automatic indexing system for other languages enables automatic suggestions of relevant descriptors in new, unseen documents analysis of manual markings descriptors, word n-grams, suggestions promote the use of eCADIS in other countries beyond the scope of CADIAL project e.g. Belgium (Flanders) linguistic module for Dutch and French needed computational lingustics expertise training data from Acquis can be used to make an automatic indexing system for Dutch and French machine learning expertise Leuven,

30 Conclusion CADIAL a joint Flemish-Croatian project sponsored by Flemish government better public access to Croatian official documentation faster and improved document indexing automatic content metadata generation (Semantic Web) easier document retrieval and exploration of legislation multilingual access via standardized EU thesaurus Eurovoc a test-case for the usage of such a system in Flanders Web information on CADIAL project and eCADIS contact: Leuven,

31 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven Leuven,


Download ppt "Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering."

Similar presentations


Ads by Google