Similarity Measures for Query Expansion in TopX Caroline Gherbaoui Universität des Saarlandes Naturwissenschaftlich-Technische Fak. I Fachrichtung Informatik Max-Planck-Institut für Informatik AG 5 - Datenbanken und Informationssysteme Prof. Dr. Gerhard Weikum
Overview background knowledge similarity measures for the query expansion evaluation of the computed similarity values changes in TopX conclusion
Background top-k query processing provides k most relevant results query expansion extends source query terms word sense disambiguation extracts correct meaning ontology amount of terms with their meanings and semantic relations
Word Sense Disambiguation „java, coffee“ „java “ „island“ „coffee“ „programming language“ …
Query Expansion „COFFEE“„drink, espresso“
TopX top-k retrieval engine text and XML data word sense disambiguation query expansion ontology
TopX – WordNet Ontology lexicon for the English language hierarchical relations one relation one direction ~160,000 words ~120,000 synsets ~210,000 relations
TopX – YAGO Ontology Wikipedia and WordNet hierarchical and not hierarchical relations one relation two directions ~2,100,000 words ~2,200,000 concepts ~6,000,000 relations
Similarity Measures Dice similarity the already used measure in TopX NAGA similarity applied measure for YAGO Best WordNet similarity measure with best result among WordNet measures
Dice Similarity Measure sdfsdf measures the intersection of two regions
NAGA Similarity Measure sdfasfsdf combination of the confidence of a relation and the informativeness of a relation
Best WordNet Similarity Measure sdfsdfsdf product of the transfer function of the path length and the transfer function of the concept depth
Evaluation
DICE measure applicable also on the YAGO ontology NAGA measure applicable with omitting of the forward direction Best WordNet measure not applicable due to the density of YAGO
Changes for TopX tuning of some procedures Dijkstra algorithm word sense disambiguation query expansion extension of configuration file
Conclusion larger knowledge base more flexibility increased complexity further measure for the similarity computation NAGA similarity
Questions?