EcoTerm IV NBII/EioNet Demo of Federated KOS Search Mike Frame Vienna, Austria April 2007
Discussion Topics… Project Background NBII Thesaurus GEMET Thesaurus Prototype Client Sample Query Results Including no, 1, or both thesauri Overall Findings
Biocomplexity Thesaurus
EIONET GEMET Thesaurus
NBII/EIONET Thesaurus Web-service 1 Background - collaboration through Ecoinformatics TWG Primary Goal – access distributed multi-lingual thesauri Results – SKOS web-service & client
Latest Client & Service capabilities Access to both NBII and GEMET Single language capability Results are provided by source All documentation is completed
Demo Client
Initial Challenges Identified Thesaurus scope, intent, purpose, and coverage is different NBII = sub-discipline of environment Endangered species Broader Terms:Species, Special status species, TaxaSpeciesSpecial status species Taxa EIOINET = broad environment Broader Terms:environmental protectionenvironmental protection
Current State Users Most aren’t aware of the underlying vocabulary Vocabulary are often unique to organization and more for “categorization” than retrieval Goal Include all Vocabularies and let Search Engine handle results
Demonstration Search Retrieval Created a demonstration datasets NBII Cataloged Resources ~30,000 web-sites, publications, images, maps, etc. Xml structured data – controlled subject NBII FGDC Metadata ~22,000 resources on research studies elements Semi-structured with no controlled vocabulary
NBII Catalog Records Based on the Dublin Core + 18 elements, of which 10 are mandatory In place since 2002 Used by distributed content managers
NBII Metadata CH
Process Added thesaurus capabilities to Development Search Engine for: NBII Thesaurus EIONET GEMET Thesaurus Used BT, RT, NT relationships & weighting Performed sample queries within the test repositories for: No thesaurus GEMET only aided searching NBII only aided searching GEMET+NBII aided searching (X)
Test Repository 1 NBII Resource Catalog (Dublin Core)
No Thesauri – “invasive species”
NBII Thesaurus – “invasive species”
GEMET Thesaurus – “invasive species”
No Thesauri – “Endangered Species”
NBII Thesaurus – “endangered species”
GEMET Only – “endangered species”
No Thesaurus – “rare species”
NBII Thesaurus – “rare species”
GEMET Thesaurus – “rare species”
GEMET Thesaurus – “rare species” (expanded degrees of relevance)
No Thesauri – “protected species”
NBII Thesaurus – “protected species”
GEMET Thesaurus – “protected species”
Results – NBII Catalog Resources termNoneNBIIGEMET “invasive species” “endangered species” “rare species” “rare species” (expanded) “”protected species”
Results – NBII Resource Catalog
Test Repository 2 NBII FGDC Metadata
Sample Queries – No vocabularies Metadata CH “ invasive species”
Sample Queries – NBII only Metadata CH “ invasive species”
Sample Queries – GEMET only Metadata CH “ invasive species”
Sample Queries – No vocabularies Metadata CH “endangered species”
Sample Queries – NBII only Metadata CH “endangered species”
Sample Queries – GEMET only Metadata CH “ endangered species”
No Thesauri – Metadata CH “rare species”
NBII Thesaurus – Metadata CH “rare species”
GEMET Thesaurus – Metadata CH “rare species”
Sample Queries – No vocabularies Metadata CH “protected species”
Sample Queries – NBII only Metadata CH “ protected species”
Sample Queries – GEMET only Metadata CH “ protected species”
Results – FGDC Metadata termNoneNBIIGEMET “invasive species” “endangered species” “rare species” “protected species”
Results – NBII Resource Catalog
Overall Results General Findings Assumption that a Thesaurus improves “number” of results is valid Degree does vary by the term and mappings Since users search from a # of perspectives, backgrounds, expertise, multiple thesaurus do improve the number of results
Overall Results Using only GEMET Terminology Terms not included in the NBII thesaurus that were in GEMET improved search results GEMET strength of broad coverage aided searches In General for the Metadata repository Results varied somewhat, but often same top 10 results
Overall Results General Findings With “No thesaurus” test results produced poorer #1 results Thesaurus results for the structured set ordered results list more differently than unstructured set (Metadata)
Issues “integrating” multi-scope and purpose thesauri presents challenges: Can’t turn the effort into a thesaurus project Degrees of relevance of terms is an issue Concept matching or different intent Differing classification (RT vs. NT) across thesauri Differing “weighting” algorithms
Further Study Options 1.) Take multiple thesauri “as is” 2.) Do some “attempted” concept matching i.e. “endangered animal species” – “endangered animal” 3.) If not match is present, add term and relationship as is 4.) Obtain terms from XMDR
Further Study Options – cont. Follow-up with additional repositories Repeat with other query terms Re-look at weighting algorithms Do queries with subset of terms Repeat with completely integrated thesaurus as compared to>>>>>>> Repeat queries with machine integration Complete By June
Questions, Comments,
GEMET Control file endangered species,category of endangered species[.2],endangered animal species[0.8],endangered plant species[0.8] protected species,category of endangered species[0.2],endangered species [0.2] rare species,category of endangered species[0.2],extinct species[0.2],vanished species[0.2]