Brian A. Carlsen Apelon, Inc. Tools For Classification Integration Networked Knowledge Organization Systems/Services Workshop June 28, 2001
2 Presentation Outline State of the UMLS Metathesaurus State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Further Approaches
3 State of the UMLS Metathesaurus Concept orientation, concept persistance Growth to over 800,000 concepts and over 60 vocabulary families Over 1000 users worldwide Uses of the Metathesaurus Natural Language Processing Natural Language Processing Knowledge Representation Knowledge Representation Patient Record Systems Patient Record Systems Linking Patient Data to Knowledge Sources Linking Patient Data to Knowledge Sources Automated Indexing/ Retrieval Automated Indexing/ Retrieval
4 Concept and Name Counts By Release Year
5 English Word, String Counts by Release Year
6 Outline State of the UMLS Metathesaurus Life-cycle of a Source Life-cycle of a Source Tools and Processes Challenges Further Approaches
7 Life-cycle of a Source: Inversion Source arrives in “machine readable” format* Many formats are used, including PDF, Clipper dump files, WordPerfect files, unit-record formats, and relational flat files. Source undergoes “inversion” Requires a human Input is this machine readable file Process is source-specific Output is a common relational flat-file format used internally.
8 Life-cycle of a Source: Insertion A “Recipe” is created Test insertion to validate recipe Insertion and matching. Load common format into database Match to existing content algorithmically Use string normalization Determine SAFE vs. UNSAFE matches Prepare data for editing Process is fully undoable
9 Life-cycle of a Source: Editing Predicate-based partitioning Workflow management Review ALL content for new sources Review UNSAFE content for updates Human Review QA Driven Editing Source-specific QA Feedback QA Conservation of Mass QA
10 Life-cycle of a Source: Release Synchronize editing changes State-based model Release data in desired format Full release/partial release Transform base release “MetamorphoSys” Remove unlicensed data Create “Content Views”
11 Outline State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Tools and Processes Challenges Further Approaches
12 Tools and Processes: Overview Humans vs. Computers Humans are good at making content decisions Computers are good at automating tasks Tools vs. Processes Tools enable computers to automate tasks Processes keep humans productive.
13 Tools and Processes: Pre-Editing No common data representation Source-by-source conversion to common format Perl, Unix tools What would a common format need? Represent terms and attributes Represent within-source relationships Represent hierarchies Represent external-source relationships Represent classifications (e.g. Concept)
14 Tools and Processes: Editing Workflow Management Report Generation State Model vs. Action Model Actions represented as new states vs. Single state + actions as data Human Editing Interface enabling “high level cognitive editing” LVG: String Normalization Automated Editing Save vs. Unsafe, Integrities
15 Tools and Processes: Release License Agreements Content Views e.g. Indexing View Filter by Semantic Type Filter by Language Alternative Release Formats Updates MetamorphoSys
16 Outline State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Challenges Further Approaches
17 Challenges: Ambiguity Ambiguous Strings e.g. “Cold” Solution: Disambiguating strings, Preferred Names with “face validity”, Integrity checks when merging. Not fully specified Strings e.g. “Head of Pancreas” within “Malignant Neoplasm of Pancreas” Solution: Fully specified preferred name.
18 Challenges: What is a Classification? A classification is any grouping of terms with a consistent semantics. Thesauri typically group terms by meaning into concepts (synonymy). Alternatives Neighborhoods (e.g. Descriptors in MeSH). Near-synonymy No classification (identity or term classification). Lexical Connecting relationships/attributes to classifiers
19 Challenges: Precedence Concepts (or other classifications) generally have a preferred name A thesaurus will have terms from different sources competing for precedence Source precedence should be a user-level choice Preferred name should not be used as a proxy for concept-ness Every level of classification should have a preferred term Preferred name exists primarily for “face validity”
20 Challenges: Update Model Constituent sources of a thesaurus will be updated Editing cycle Updated sources will require editing Typically overlap is > 90% Overlap can safely replace the old version’s content Safe replacements should not be edited Ideally, source providers would indicate replacement otherwise it must be computed Release Release changes
21 Outline State of the UMLS Metathesaurus Life-cycle of a Source Tools and Processes Challenges Further Approaches Further Approaches
22 Further Approaches: Description Logic What is it? Concepts (or other classifications) are axioms Relationships (roles) are theorems The transitive closure of the roles across the concepts is computed to ensure no violations. e.g. A isa B, B isa C, C isa A (!violation) When is it useful? In formalized, static domains like Anatomy When is it not useful? Performance > formalism In dynamic, loosely coupled domains like Genomics
23 Further Approaches: Standards XML Standardized Terminology/Ontology Representation XML is the most likely candidate Ideally would support Links to external sources Relationships between different levels of classification Update model Description Logic Metadata Standardized Thesaurus Representation XML Repository Standard Object Representations
24 Conclusion: Lessons Learned Use the Web Use current technology Use Description Logic where appropriate Make editing intuitive Automate tasks “A well-understood, reproducible, automated process that succeeds 95% of the time is a vast improvement over a poorly-understood, labor-intensive process that is believed to succeed 100% of the time. “ Review UNSAFE automated tasks. Stop automating when marginal utility falls below a threshold.