CADIAL search engine at INEX Jure Mijić1, Marie-Francine Moens2, Bojana Dalbelo Bašić1 1Faculty of Electrical Engineering and Computing jure.mijic@fer.hr, bojana.dalbelo@fer.hr 2Department of Computer Science, Katholieke Universiteit Leuven sien.moens@cs.kuleuven.be INEX 2008 Schloss Dagstuhl Conference Center, Wadern, Germany 2008-12-16 ITI2008 Cavtat 2008-06-25
Presentation overview What is CADIAL project? System overview Ranking model Ad hoc results Conclusion Future work INEX 2008 Dagstuhl 2008-12-16
What is CADIAL project? Bilateral project between the Government of Flanders and the Ministry of Science, Education and Sports of the Republic of Croatia Aims of the CADIAL project: Provide access to a collection of Croatian legislative documents Enable the use of the Eurovoc thesaurus, an EU standard thesaurus for document indexing and retrieval INEX 2008 Dagstuhl 2008-12-16
System overview Built with expandability in mind Supports multiple information retrieval models Supports morphological normalization modules An indexer tool is used for document indexing Input documents are in XML format Output is an index database (a base structure for every search engine model) Index database is upgraded with additional data required by the model (various statistical information) INEX 2008 Dagstuhl 2008-12-16
Ranking model Language model Additional features Element priors based on element location and depth Smoothing on document and collection level Additional features Support for CAS queries Support for +/- keyword operators Simple overlapping element removal Stemming INEX 2008 Dagstuhl 2008-12-16
Ad hoc results Our runs: Three CO runs One returning only documents Two returning elements Three CAS runs with various smoothing factors No. Run iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP 1 co-document-lc6 0.6389 0.5949 0.5051 0.4699 0.2551 2 cas-element-ld5-lc4 0.6684 0.5530 0.4048 0.3248 0.1440 3 co-element-ld2-lc5 0.6907 0.5417 0.4007 0.2920 0.0994 4 co-element-ld2-lc1 0.6718 0.5241 0.3922 0.2963 0.0929 5 cas-element-ld2-lc5 0.6494 0.5203 0.3569 0.2593 0.1134 6 cas-element-ld1-lc6 0.6642 0.5063 0.3652 0.2610 0.1133 INEX 2008 Dagstuhl 2008-12-16
Ad hoc results INEX 2008 Dagstuhl 2008-12-16
Conclusion Retrieving whole documents performed better than element retrieval at higher levels of recall CAS queries performed slightly better that CO queries Higher smoothing at the document level contributed to better performance INEX 2008 Dagstuhl 2008-12-16
Future work Other smoothing techniques Pseudo relevance feedback Incorporating link evidence Information extraction methods INEX 2008 Dagstuhl 2008-12-16
The End Thank you INEX 2008 Dagstuhl 2008-12-16
Language model INEX 2008 Dagstuhl 2008-12-16