Empirical Studies of Knowledge Acquisition - or - Natasha and Mark do time at Leavenworth Natasha Fridman Noy Mark A. Musen Stanford University Stanford, California USA
Overview Protégé-2000 version 1.0 DARPA’s HPKB program Empirical evaluation of Protégé-2000 Where to we go from here?
Generations of Protégé systems at SMI PROTÉGÉ LISP-Machine system for rapid knowledge acquisition for clinical-trial specifications PROTÉGÉ-II NeXTSTEP system that allowed independent ontology editing and selection of alternative problem-solving methods Protégé/Win Finally, a Protégé system for the masses ... Protégé/Java (a.k.a. Protégé-2000) The subject of this talk ...
Protégé/2000 Represents the latest in a series of interactive tools for knowledge-system development Facilitates construction of knowledge bases in a principled fashion from reusable components Allows a variety of “plug ins” to facilitate customization in various dimensions Still needs a better name ...
Knowledge-base development with Protégé/2000 Build a domain ontology (a conceptual model of the application area) Custom-tailor GUI for acquisition of content knowledge Elicit content knowledge from application specialists Map domain ontology to appropriate problem solvers for automation of particular tasks
Building knowledge bases: The Protégé methodology Domain ontology to provide domain of discourse Knowledge-acquisition tool for entry of detailed content
Protégé/2000 Ontology-editing tab Add additional constraints on classes and attributes Developer can see knowledge organization clearly Easy to edit attributes and facets Classification problems become viewable
Generation of usable domain-specific KA tools Protégé/2000 system takes as input a domain ontology generates in real time a graphical KA tool Developers Tweak KA tool appearance by using direct-manipulation layout-editing facilities Add custom user-interface widgets when complexity of domain warrants more specialized visual metaphors
A great case for customized widgets: monitoring nuclear power plants
Some Advances in Protégé/2000 Much improved editing of ontologies creation and customization of knowledge-acquisition tools adaptation of system to new requirements But still no automated support for mapping of knowledge bases to problem-solving methods—yet! No more shuffling among different development tools!
Protégé-2000 adopts the OKBC knowledge model Protégé-2000 knowledge-bases are OKBC-compliant Protégé-2000 is not OKBC-generic There are some OKBC knowledge bases that Protégé-2000 cannot handle It’s very close, though! Differences are required to ease KA Instances are instances of exactly one class
The race to develop plug-ins GUI widgets for tables diagrams animation File I/O plug ins for interoperability with databases, other knowledge-based systems Tab plug-ins for embedded applications
Swapping components Each of the Protégé-2000 major components can be swapped out and replaced with a different one Knowledge model Storage User interface
Protégé-2000 plug-ins Will revolutionize development of KA tools Allow nearly every aspect of the system to be modifiable in a well-defined manner Allow multiple groups each to develop special-purpose plug ins for their own purposes Will lead to libraries of plug-ins to allow KA systems to be adapted in radical ways Are already being developed by a widely distributed user community!
But how do we know we’re making progress? Most KA systems are never evaluated There are no well-established evaluation approaches There are no benchmarks for comparison Most KA-tool users do not want to participate in evaluation experiments They have their own work to do Evaluation is time-consuming
Sisyphus experiments Have been organized by KA community Have involved shared tasks Office assignment Elevator configuration Rock and mineral classification Have done a better job of allowing comparison of knowledge-system architectures than of KA techniques
What is needed Empirical studies of subject-matter experts entering “real” knowledge Metrics for assessing Quality of entered knowledge Quantity of entered knowledge Usability of KA tools Environments where subject-matter experts can allocate the necessary time for these kinds of studies
We found a captive audience in Kansas ...
What the rest of the talk is about High-Performance Knowledge Bases Program Empirical evaluation of knowledge-based systems Why and How? How we designed, conducted, evaluated a usability experiment Extensions to Protégé Experiment results
High-performance knowledge bases (HPKB) program Enable developers to construct large knowledge bases Reuse the knowledge in multiple applications with diverse problem-solving methods in rapidly changing environments Foster collaboration among multiple teams of technology developers and integrators
Two challenge problems Crisis management challenge problem Managing and understanding information before confrontation Building systems to help warning analysts and policy makers Battlespace challenge problem Analyzing courses of actions for conformance with principles of warfare, resource allocation, feasibility and so on
Why does SMI care about HPKB Research challenges common to both: collaboration and knowledge sharing management of large knowledge bases knowledge-base development by subject-matter experts (SMEs) who are not experts in knowledge engineering empirical evaluation of the tools and knowledge bases Tools developed for HPKB are also applied in medical domains
Evaluating artificial-intelligence systems “Studying AI systems is not very different from studying moderately intelligent animals such as rats” — Paul R. Cohen, “Empirical Methods for Artificial Intelligence”
Designing an experiment Formulate a hypothesis What are we testing? Determine what exactly affects performance Remove various factors from the system and compare results Create conditions for controlled experiment Script sessions Design tasks carefully
Knowledge-acquisition experiment Evaluate how subject-matter experts (in this case, military experts) can use Protégé to develop and maintain knowledge bases
The problem Knowledge is not static The world changes What we know about the world changes
Large-scale changes in military doctrine From presentation by COL Mike Smith (http://192.111.52.19/jadd/fm1005/)
Domain experts need to interact with knowledge bases Understand the knowledge base Know what it contains (and what it doesn’t) Perform quality control Remove or change outdated knowledge Acquire new knowledge Extend the knowledge base to cover new areas of expertise
Specific goals for the experiment Hypothesis 1 Subject-matter experts can use Protégé-2000 effectively for knowledge acquisition Hypothesis 2 Highly custom-tailored tools for the specific domain improve knowledge-acquisition rate and quality
Domain: Opposing-force unit organization Source: Opposing Force (OPFOR) Battle book: force structure for opposing force Why this domain? The OPFOR information is used by intelligence analysts in planning battles The OPFOR information is changing and needs to be verified and updated by intelligence analysts
Information represented in the knowledge base
Protégé-2000
HPKB tab
Purpose of the experiment Compare Protégé-2000 and HPKB Tab Protégé-2000 general-purpose tool for knowledge-base design and maintenance allows automatic generation of forms for browsing and entering knowledge HPKB Tab Battlespace-analysis-specific addition to Protégé to collect unit-related information
Experiment methodology Ablation experiment
Experiment time line Group 1 Group 2 Day 1 (use Protégé-2000) Day 1 (use HPKB tab) Morning: training session Afternoon: experiment 1 Morning: training session Afternoon: experiment 1 Compare Protégé-2000 to HPKB tab Day 2 (use HPKB tab) Day 2 (use Protégé-2000) Morning: training session Afternoon: experiment 2 Morning: training session Afternoon: experiment 2 Day 3 (use HPKB tab) Day 3 (use Protégé-2000) Test retention of skills Afternoon: experiment 3 Afternoon: experiment 3
Tasks Task design Tasks included Seven tasks each day – from easy to more difficult Each task starts on a new version of ontology Sets of tasks for all three days are similar Tasks included Verifying what is in the knowledge base Correcting the wrong information Making information more specific Creating new classes of units
Example of a task (task 4) Verify that all Artillery subunits of Mechanized Infantry Brigade (IFV)(DIV) have their organization chart specified. You need to verify that each artillery unit mentioned in the chart for Mechanized Infantry Brigade (IFV)(DIV) has its own chart defined. All subunits of other types are now fully specified and you do not need to verify this fact. Only study the artillery subunits. For each artillery unit that does not have the chart defined, or does not have it checked (that is, it may be not fully specified), create or complete the chart.
Preparing for evaluation For each task, define a set of evaluation criteria in advance What constitutes a correct answer? What to do if there is more than one answer ? What do we measure? Logging capability Keep logs of all steps for each user Still hard to measure quality – some of the analysis had to be done manually Usability questionnaires
Evaluation criteria Knowledge-acquisition rate Ability to find errors Quality of knowledge entry Subjective opinion
Evaluating quality of knowledge entry How many errors SMEs found in the knowledge base How many wrong steps SMEs took (vs. correct steps) How many terms SMEs correctly added to the knowledge base Have the SMEs noticed their errors themselves and where they able to recover How long did it take for a user to recover from an error
Knowledge-acquisition rate (Days 1-3) 2 4 6 HPKB Tab outperforms Protégé-2000 by 43%
KA rate improves substantially with learning
Knowledge base verification: finding errors 93% of errors found Knowledge base contained a small number of errors for each task. The subjects had to find all the errors. On average, the subjects using HPKB tab performed 26% better than the subjects using Protégé-2000
Quality of knowledge entry: wrong steps versus correct steps
Removing the “hangover effect” Wrong steps: 1%
Task 6: enter a large amount of data
Error recovery rate Average number of steps to recover from an error: 3.5
Creating new classes 14 new classes to create Observations All the classes were placed in correct places On the first two days subjects created additional categories to hold groups of similar classes Explored (and changed) the hierarchy on their own
Retention of skills experiment: knowledge-acquisition rate
Retention of skills experiment Results Number of errors found increased to 81% with Protégé was 72% with HPKB Tab Correctness 93% of the steps were correct
User satisfaction
Testing the hypothesis: Protégé-2000 versus HPKB tab KA rate is 43% higher with HPKB tab On the first day, the quality of knowledge entry is significantly better with HPKB tab
Summary of results Very small amount of training No help at all on day 3 Knowledge-acquisition rate improves substantially with learning Subjects found up to 93% of errors Very low error rate: 6 % (almost 1% with HPKB Tab if you discount hangover effect) One week later: still works….
Lessons learned Preparation, preparation, preparation Do not expect anything What you think is going to be hard is actually easy What you think is easy, turns out to be hard Dry-run is very important Test the tasks Test the software Test the metrics collection mechanism
Lessons learned (2) Do not under-estimate the human factor You need to break the ice Design a valid experiment “Our system does 5 apples per millennium” Carefully designed tasks Scripts for training sessions
Lessons learned (3) Leavenworth is not as bad as you would expect Or is it?