Evaluating Ontology-Mapping Tools: Requirements and Experience Natalya F. Noy Mark A. Musen Stanford Medical Informatics Stanford University
Types Of Ontology Tools There is not just ONE class of ONTOLOGY TOOLS Ontology Tools Development Tools Protégé-2000, OntoEdit OilEd, WebODE, Ontolingua Mapping Tools PROMPT, ONION, OBSERVER, Chimaera, FCA-Merge, GLUE
Evaluation Parameters for Ontology-Development Tools Interoperability with other tools Ability to import ontologies from other languages Ability to export ontologies to other languages Expressiveness of the knowledge model Scalability Extensibility Availability and capabilities of inference services Usability of tools
Evaluation Parameters For Ontology-Mapping Tools Can try to reuse evaluation parameters for development tools, but: Ontology Tools Development ToolsMapping Tools Different tasks, inputs, and outputs Similar tasks, inputs, and outputs
Development Tools Domain knowledge Ontologies to reuse Requirements Domain ontology Create an ontology InputOutputTask
Mapping Tools: Tasks C=Merge(A, B) AB iPROMPT, Chimaera Map(A, B) AB Anchor-PROMPT, GLUE FCA-Merge AB Articulation ontology ONION
Mapping Tools: Inputs Classes Shared instances Instance data DL definitions Slots and facets Slots and facets iPROMPTChimaeraGLUEFCA-MergeOBSERVER
Mapping Tools: Outputs and User Interaction GUI for interactive merging iPROMPT, Chimaera Lists of pairs of related terms Anchor-PROMPT, GLUE FCA-Merge List of articulation rules ONION
Can We Compare Mapping Tools? Yes, we can! We can compare tools in the same group How do we define a group?
Architectural Comparison Criteria Input requirements Ontology elements Used for analysis Required for analysis Modeling paradigm Frame-based Description Logic Level of user interaction: Batch mode Interactive User feedback Required? Used?
Architectural Criteria (cont’d) Type of output Set of rules Ontology of mappings List of suggestions Set of pairs of related terms Content of output Matching classes Matching instances Matching slots
From Large Pool To Small Groups Space of mapping tools Architectural criteria Performance criterion (within a single group)
Resources Required For Comparison Experiments Source ontologies Pairs of ontologies covering similar domains Ontologies of different size, complexity, level of overlap “Gold standard” results Human-generated correspondences between terms Pairs of terms, rules, explicit mappings
Resources Required (cont’d) Metrics for comparing performance Precision (how many of the tool’s suggestions are correct) Recall (how many of the correct matches the tool found) Distance between ontologies Use of inference techniques Analysis of taxonomic relationships (a-la OntoClean) Experiment controls Design Protocol Suggestions that the tool produced Operations that the user performed Suggestions that the user followed
Where Will The Resources Come From? Ideally, from researchers that do not belong to any of the evaluated projects Realistically, as a side product of stand- alone evaluation experiments
Evaluation Experiment: iPROMPT iPROMPT is A plug-in to Protégé-2000 An interactive ontology-merging tool iPROMPT uses for analysis Class hierarchy Slots and facet values iPROMPT matches Classes Slots Instances
Evaluation Experiment 4 users merged the same 2 source ontologies We measured Acceptability of iPrompt’s suggestions Differences in the resulting ontologies
Sources Input: two ontologies from the DAML ontology library CMU ontology: Employees of academic organization Publications Relationships among research groups UMD ontology: Individals CS departments Activities
Experimental Design User’s expertise: Familiar with Protégé-2000 Not familiar with PROMPT Experiment materials: The iPROMPT software A detailed tutorial A tutorial example Evaluation files Users performed the experiment on their own. No questions or interaction with developers.
Experiment Results Quality of iPROMPT suggestions: Recall: 96.9% Precision: 88.6% Resulting ontologies Difference measure: fraction of frames that have different name and type Ontologies differ by ~30%
Limitations In The Experiment Only 4 participants Variability in Protégé expertise Recall and precision figures without comparison to other tools are not very meaningful Need better distance metrics
Research Questions Which pragmatic criteria are most helpful in finding the best tool for a task How do we develop a “gold standard” merged ontology? Does such an ontology exist? How do we define a good distance metric to compare results to the gold standard? Can we reuse tools and metrics developed for evaluating ontologies themselves?