Advanced Information Systems Laboratory Department of Computer Science and Systems Engineering GI-DAYS MÜNSTER A software tool for thesauri management, browsing and supporting advanced searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria Münster, June 2003
15-ene-152 Contents Introduction Architecture of THManager application Basic capabilities Enhanced capabilities Conclusions
15-ene-153 Introduction to thesauri „ A thesaurus is a set of terms that describe the vocabulary of a controlled indexing language, formally organized so that the a priori relationships between concepts (for example synonymous terms, broader terms, narrower terms and related terms) are made explicit“ [ISO 2788] Used to improve the precision and recall of information retrieval in digital libraries provide a uniform and consistent vocabulary for indexing metadata ("description of the data holdings“) supply users with a suitable vocabulary for the retrieval. expansion of users queries by automatically adding new terms to the query
15-ene-154 Introduction to thesauri A thesaurus management tool becomes a vital component in the development of any kind of digital library One of the main objectives of Spatial Data Infrastructures is to provide the discovery, evaluation and access to spatial data for a community of users. an SDI can be considered as digital library specialised in geographic information resources. A thesaurus management tool will be also a vital component for the development of SDIs.
15-ene-155 Level 3. Application Level 2. GUI Level 1. Model Level 0. Database Thesaurus management Import/export Thesaurus.model Keywords expansion Keywords Thesaurus -100% SQL (basic) -Oracle IntermediaText (enhanced) WordNet files Metadata records Thesaurus.gui Generic GUI components for thesauri visualization Architecture of THManager application Lexicon WordNetPolisemy Polisemy extraction Branch disambiguation ThesaurusMngmt ThManager basic enhanced >
15-ene-156 Basic Capabilities Edition of thesauri according to ISO norms Broader (BT), narrrower terms (NT) Related terms (RT), preferred terms (PT) Scope notes (SN), Synonyms (SYN,USE) Language translations (TR) Visualization of thesauri Hierarchical, alphabetical Search of terms Multilingual access support Browsing according to the language selected by users Import/Export Text file proprietary formats
15-ene-157 Browsing /Edition
15-ene-158 Import/export formats Formats Dot based notation sucession of narrower terms + additional relationships (SYN,TR,...) Hierarchical Numbering of terms It should use more standardized formats: RDFS/XML,...
15-ene-159 Enhanced capabilities Thesauri are intended for the homogeneous classification of resources They are used to fill metadata keywords However, there is still heterogeneity in metadata keywords Metadata creators use different thesauri in different application domains If metadata catalogs provide access to general public Queries may not contain same terms as keywords in metadata records A possible solution to fill the semantic gap Disambiguation of thesauri (and queries) in relation with the concepts of an upper level ontology
15-ene-1510 Enhanced capabilities Additional tools around semantic disambiguation Browsing WordNet as another thesaurus Searching polysemic senses in WordNet Thesauri disambiguation Automatic Expansion of Keywords Other knowledge representation models Thesaurus 1 Thesaurus 2 Thesaurus N Controlled list 1 Controlled list 2 Controlled list N WordNet
15-ene-1511 Browsing WordNet WordNet is structured in a hierarchy of synsets Synsets are defined as set of synonyms representing a particular concept (sense) WordNet libraries and files are accessed by JNI
15-ene-1512 Searching polysemic senses in WordNet Functionality provided by Polisemy package Compound terms are partioned if no synset is found If adjectives found, associated nouns are also searched to reduce number of not-found words
15-ene-1513 Thesauri Disambiguation Unsupervised disambiguation method The senses of every thesaurus term are searched in WordNet. The hierarchical structure of the thesaurus is used as the word context for a voting algorithm to find the closest sense Thesauri are partitioned into branches (trees formed by BT/NT terms whose root has no BT) accident source environmental accident major accident traffic accident work accident technological accident shipping accident nuclear accident core meltdown oil sick accident explosion leakage administration...
15-ene-1514 Thesauri Disambiguation II Voting algorithm to obtain the disambiguated synset of a term a Every synset s associated to the rest of terms in the branch votes (proximity weight) for the synsets of term “a” Main weight: number of subsummers in WordNet hierarchy Matches in WordNet hierarchy of ancestors Discounting factors: Synset depth Branch distance Polisemy of term associated with synset “s”
15-ene-1515 Thesauri disambiguation III Annotation of disambiguated synsets
15-ene-1516 Automatic expansion of keywords with new disambiguated thesauri Comparison between the initial collection of synsets and the synsets of a new term
15-ene-1517 Expansion of keywords II
15-ene-1518 Conclusions & future lines ThManager is a flexible tool to manage thesauri It provides enhanced functionality for the improvement of classifications. This tool can be easily integrated in other tools It is used by a metadata edition tool (also presented here) to select the appropriate term for the distinct metadata fields. Future lines: Creation of a thesaurus Web Service providing some of the functionality offered by this tool. thesaurus browsing, WordNet polysemy extraction, keywords expansion,... Concept based retrieval Exploit the semantic disambiguation of thesauri to test different information retrieval strategies for geographic data catalogs. It is possible to index metadata records according to a unified system: the disambiguated WordNet synsets
15-ene-1519 Advanced Information Systems Laboratory