Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb
Bruxelles, Project AIDE idea for a project September 2004, conference at JRC, Ispra interdisciplinary collaboration of 3 institutions Croatian Information Documentation Referral Agency (HIDRA) Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS) Faculty of Electrical Engineering and Computing University of Zagreb Institute of Linguistics (ZZL) Faculty of Humanities and Social Sciences University of Zagreb
Bruxelles, AIDE – collaborating institutions HIDRA collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia coordinator Maja Cvitaš, M.A. ZEMRIS research in the field of artificial intelligence, neural networks, machine learning, data and text mining coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder ZZL computational linguistic research and building language technologies for Croatian coordinator prof. Marko Tadić
Bruxelles, AIDE – project objective Development of intelligent system for automatic indexing of the official documentation of the Republic of Croatia with descriptors from Eurovoc thesaurus
Bruxelles, AIDE – how? automatic indexing, how? program which “learns to index” Joint Research Center of EC (JRC), Ispra, Italy at least 10,000 manually indexed documents 3-5 descriptors per document documents per descriptor indexed documents stored in XML format Steinberger (2003) compiling a corpus of Croatian indexed documents for machine learning of automatic indexing with Eurovoc descriptors situation with Croatian documentation in there were only few hundreds of documents indexed manual indexing: painfully slow
Bruxelles, AIDE – how? how could we speed up the manual indexing? plan: to develop a workstation for computer aided document indexing conduct the research and development of algorithms in the field of computational linguistics/language technologies insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS)
Bruxelles, CADIS: two windows Document window Eurovoc browser window
Bruxelles, Document Window
Bruxelles,
CADIS features Enhanced user interface list of descriptors appearing in document
Bruxelles, CADIS features Descriptors and non-descriptors marked in document
Bruxelles, CADIS features Lists of n-grams
Bruxelles, CADIS features Integration of corpus analysis greyed n-grams are statistically relevant in the corpus
Bruxelles, CADIS features Manual marking of significant n-grams — important step towards automatic indexing
Bruxelles, Eurovoc browser window
Bruxelles, Further development CADIS for other languages? already for Croatian and English usable for other languages without linguistic module cooperation needed with respective language technology experts for development of linguistic module for other languages partners for EU project proposals for the next step AIDE research on machine learning and text-mining use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia
Bruxelles,
Bruxelles, Conclusion CADIS is unique in Europe Web info at: HIDRA: ZEMRIS: textmining.zemris.fer.hrtextmining.zemris.fer.hr for download contact:
Bruxelles, Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb