Battling Scylla and Charybdis: The Search for Redundancy and Ambiguity in the 2001 UMLS Metathesuarus James J. Cimino Department of Medical Informatics.

Slides:



Advertisements
Similar presentations
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
Advertisements

Experience with Using the UMLS Semantic Network to Coordinate Controlled Terminologies for a Large Clinical Data Repository James J. Cimino Department.
Ontological analysis of the semantic types Anand Kumar MBBS, PhD IFOMIS, University of Saarland, Germany. BIOMEDICALONTOLOGYBIOMEDICALONTOLOGY.
ECO R European Centre for Ontological Research Ontology-based Error Detection in SNOMED-CT ® Werner Ceusters European Centre for Ontological Research Universität.
THE UNIFIED AIRWAY A CPMC Regional CME Event - An Integrated Approach Saturday October 1, 2011.
Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu.
Summary Issues and Suggestions Workshop on The Future of the UMLS Semantic Network NLM, April 8, 2005 Olivier Bodenreider Lister Hill National Center for.
Retrieval of Similar Electronic Health Records using UMLS Concept Graphs Laura Plaza and Alberto Díaz Universidad Complutense de Madrid.
The Role of the UMLS in Vocabulary Control CENDI Conference “Controlled Vocabulary and the Internet” Stuart J. Nelson, MD.
The Role of Standard Terminologies in Facilitating Integration James J. Cimino, M.D. Departments of Biomedical Informatics and Medicine Columbia University.
Clinical computing and the repository George Hripcsak Jim Cimino Pete Stetson.
Terminology Tools: State of the Art and Practical Lessons James J. Cimino Department of Medical Informatics Columbia University New York, New York, USA.
Technology and the Future of Medicine James J. Cimino, M.D. ‘81.
VT. From Basic Formal Ontology to Medicine Barry Smith and Anand Kumar.
Supporting Medical Decision Making with Electronic Medical Records James J. Cimino Departments of Medicine and Medical Informatics Columbia University.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
What is Workflow?. Defining workflow Definitions of workflow vary. Here are a couple: –The flow of work through space and time, where work is comprised.
The Medical Entities Dictionary Columbia University Department of Biomedical Informatics.
Harnessing World Wide Web Technology and Standardized Terminology to Improve Decision Making for Patients and Providers James J. Cimino Departments of.
The Medical Entities Dictionary - © James J. Cimino The Columbia University/ NewYork Presbyterian Hospital Medical Entities Dictionary (The MED)
Using Patient Data to Retrieve Health Knowledge James J. Cimino, Mark Meyer, Nam-Ju Lee, Suzanne Bakken Columbia University AMIA Fall Symposium October.
Let’s Test Drive Nursing Resource Center. Nursing Resource Center Gale’s Nursing Resource Center supports nursing programs at two-year community colleges,
Stefan Schulz, Thorsten Seddig, Susanne Hanser, Albrecht Zaiß, Philipp Daumke Checking coding completeness by mining discharge summaries.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2007 National Library of Medicine National Institutes of Health U.S. Dept. of Health.
Automated Classification of Medical Questions Using Semantic Parsing Techniques Paul E. Pancoast, MD Arthur B. Smith, MS Chi-Ren Shyu, PhD University of.
Supercourse Environmental Exposure Assessment And Biomarkers Wael Al-Delaimy, MD, PhD.
1 The Refined Semantic Network James Geller Yehoshua Perl New Jersey Institute of Technology.
Chapter 4 The Relational Model.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Introdução à Medicina Faculdade de Medicina da Universidade do Porto Introdução à Medicina Home monitoring in respiratory chronic diseases: systematic.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2005 May 16 & 17, 2005 Rachel Kleinsorge.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
Database Processing: Fundamentals, Design and Implementation, 9/e by David M. KroenkeChapter 2/1 Copyright © 2004 Please……. No Food Or Drink in the class.
Integrating Clinical Systems by Integrating Controlled Vocabularies James J. Cimino, MD Center for Medical Informatics Columbia University New York, New.
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
1 Enriching and Designing Metaschemas for the UMLS Semantic Network Department of Computer Science New Jersey Institute of Technology Yehoshua Perl James.
Component 3-Terminology in Healthcare and Public Health Settings Unit 11-Respiratory System This material was developed by The University of Alabama at.
Controlled Medical Terminologies: What can they do for me? James J. Cimino, M.D. Department of Medical Informatics Columbia University.
“Nursing Interventions and Outcomes in Three Older Populations” Effectiveness Study Project Consultant Meeting University of Iowa College of Nursing February.
Recent advances in the field of Family Medicine classifications ICPC into WHO-FIC J K Soler Wonca International Classification Committee.
The Gene Ontology: a real-life ontology, progress and future. Jane Lomax EMBL-EBI.
UMLS Unified Medical Language System. What is UMLS? A Unified knowledge representation system Project of NLM Large scale Distributed First launched in.
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
Use of the UMLS in Patient Care James J. Cimino, M.D. Center for Medical Informatics Columbia University.
SSO: THE SYNDROMIC SURVEILLANCE ONTOLOGY Okhmatovskaia A, Chapman WW, Collier N, Espino J, Conway M, Buckeridge DL Ontology Description The SSO was developed.
Consistency between Metathesaurus and Semantic Network Workshop on The Future of the UMLS Semantic Network NLM, April 8, 2005 Olivier Bodenreider Lister.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Winter 2011SEG Chapter 11 Chapter 1 (Part 1) Review from previous courses Subject 1: The Software Development Process.
The UMLS Semantic Network Alexa T. McCray Center for Clinical Computing Beth Israel Deaconess Medical Center Harvard Medical School
Experience with Using the UMLS Semantic Network to Coordinate Controlled Terminologies for a Large Clinical Data Repository James J. Cimino Department.
Protein Building Blocks.  Proteins assist your body to grow strong bones, teeth, hair, tissues and muscles therefore are present in all living tissue.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Detection of underspecifications in SNOMED CT concept definitions using language processing 1 Federal Technical University of Paraná (UTFPR), Curitiba,
1 Semantic Network Issues in UMLS Study Yehoshua Perl, James Geller.
Joined up ontologies: incorporating the Gene Ontology into the UMLS.
1 Alberta Health Services Capital Health Palliative Care Program Clinical Vocabulary Pilot Project Project Update Friday April 24, 2009 Dennis Lee & Francis.
Logical Database Design and the Rational Model
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Respiratory Functions and Diseases
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Efficient Remediation of Terms Inactivated by Dictionary Updates
IDEF1X Standard IDEF1X (Integrated Definition 1, Extended) was announced as a national standard in 1993 It defines entities, relationships, and attributes.
Introduction to Applied and Theoretical Ontology Barry Smith
Department of Medical Informatics
Ontological analysis of the semantic types
Presentation transcript:

Battling Scylla and Charybdis: The Search for Redundancy and Ambiguity in the 2001 UMLS Metathesuarus James J. Cimino Department of Medical Informatics Columbia University

2001 Metathesaurus 99 sources (92 in 2000) 1,734,707 strings (1,598,176 in 2000) 797,360 concepts (730,155 in 2000)

Lumping vs. Splitting Cold (infection) Cold (temperature) COLD (COPD) COLD (temperature) Cold (infection) Cold (temperature) COLD (COPD) COLD (temperature) Ambiguity! Redundancy!

Three Auditing Methods Ambiguity through of multiple semantic types Redundancy through semantic string matching Inconsistency in parent-child semantic types

Previous Results: 1995 Possible ambiguity 1,817 Possible redundancy 5,031 Actually redundancy 3,274 Parent-Child problems 544 * Cimino JJ. Auditing the Unified Medical Language System with semantic methods. Journal of the American Medical Informatics Association; 1998;5: *

Tools and Rules Simple Metathesaurus data model Normalized word index “Mutually exclusive semantic types” “Mutual concept subsumption”

Simple Metathesaurus Data Model L : S : “COLD ” C : Chronic Obstructive Airway Disease L : S : “Chronic Obstructive Airway Disease” L : S : “Chronic Obstructive Lung Disease” Semantic type: T04: Disease or Syndrome S : “COLD”

Simple Metathesaurus Data Model S : “COLD ” C : Chronic Obstructive Airway Disease S : “Chronic Obstructive Airway Disease” S : “Chronic Obstructive Lung Disease” Semantic type: T04: Disease or Syndrome S : “COLD”

Simple Metathesaurus Data Model “COLD ” C : Chronic Obstructive Airway Disease “Chronic Obstructive Airway Disease” “Chronic Obstructive Lung Disease” Semantic type: T04: Disease or Syndrome “COLD”

Simple Metathesaurus Data Model C : Chronic Obstructive Airway Disease Semantic type: T04: Disease or Syndrome COLD Chronic Obstructive Airway Disease Chronic Obstructive Lung Disease COLD C : Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a)

UMLS Semantic Types Physical Object Organism Substance Food Alga Plant Invertebrate Animal

Mutually Inclusive Semantic Types Physical Object Organism Animal Invertebrate Plant Alga Substance Food

Mutually Exclusive Semantic Types Physical Object Organism Animal Invertebrate Plant Alga Substance Food

Rules for Multiple Semantic Types 3. Concepts can have two Substance types, except: a) Element, Ion or Isotope and Chemicals Viewed Structurally b) Inorganic Chemical and Organic Chemicals 5. Concepts can have two Conceptual Entity types, except: Molecular Sequence and Geographic Area Molecular Sequence and Body Location or Region Geographic Area and Body Location or Region 7. Concepts can have two Event types, except: Diagnostic Procedure and Laboratory Procedure 8. Concepts can have two types that ancestors/descendants

Detection of Ambiguity by Mutually Exclusive Semantic Types If a concept has multiple semantic types And if any pair of the types are mutually exclusive Then the concept may have multiple meanings (ambiguity) Or the semantic type assignment is incorrect

Ambiguity Examples C : Euglena gracilis Alga and Invertebrate C : Fourth lumbar vertebra Body Part, Organ, or Organ Component and Disease or Syndrome C : Toxicodendron Plant and Disease or Syndrome C : Crown-Rump Length Organism Attribute and Diagnostic Procedure C : Cell Movement Cell Function and Biomedical Occupation or Discipline C : Lice Infestations Invertebrate and Disease or Syndrome C : Chronically Ill Disease or Syndrome and Patient or Disabled Group

Normalized Word Index UMLS Normalized Word Index –e.g., “lungs”  “lung” –293,004 words Keyword synonyms –e.g., “lung”  “pulmonary” –9,650 mappings Translated strings Built word index

Word Normalization C : Chronic Obstructive Airway Disease Semantic type: T04: Disease or Syndrome COLD Chronic Obstructive Airway Disease Chronic Obstructive Lung Disease COLD C : Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome Parent-Child (is-a)

Word Normalization Parent-Child (is-a) C : Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome C : Chronic Obstructive Airway Disease Semantic type: T04: Disease or Syndrome cold 3 chronic obstructive airway disease chronic obstructive lung disease cold

Word Normalization Parent-Child (is-a) C : Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome C : Chronic Obstructive Airway Disease Semantic type: T04: Disease or Syndrome cold 3 chronic obstructive airway disease chronic obstructive pulmonary disease cold

Word Normalization Parent-Child (is-a) C : Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome C : Chronic Obstructive Airway Disease Semantic type: T04: Disease or Syndrome cold 3 chronic obstructive airway disorder chronic obstructive pulmonary disorder cold

Word Normalization Parent-Child (is-a) C : Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome C : Chronic Obstructive Airway Disease Semantic type: T04: Disease or Syndrome cold three chronic obstructive airway disorder chronic obstructive pulmonary disorder cold

Word Index airway chronic cold disorder obstructive pulmonary three Parent-Child (is-a) C : Respiratory Tract Diseases Semantic type: T04: Disease or Syndrome C : Chronic Obstructive Airway Disease Semantic type: T04: Disease or Syndrome cold three chronic obstructive airway disorder chronic obstructive pulmonary disorder cold

Mutual String Subsumption 1) If Concept A has String A1 And all words in A1 are in Concept B’s word list Then B subsumes A1 2) If B subsumes any string in A And A subsumes any string in B Then A and B are mutually subsumptive

Mutual String Subsumption common cold cold two cold C : Common Cold T04: Disease or Syndrome cold common two C : cold temperature cold temperature cold one cold T070: Natural Phenomenon or Process cold one temperature C : Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold T04: Disease or Syndrome airway chronic cold disorder obstructive pulmonary three

Mutual String Subsumption C : cold temperature cold temperature cold one cold T070: Natural Phenomenon or Process common cold cold two cold C : Common Cold T04: Disease or Syndrome C : Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold T04: Disease or Syndrome cold common two cold one temperature airway chronic cold disorder obstructive pulmonary three

Mutual String Subsumption C : cold temperature cold temperature cold one cold T070: Natural Phenomenon or Process common cold cold two cold C : Common Cold T04: Disease or Syndrome C : Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold T04: Disease or Syndrome cold common two cold one temperature airway chronic cold disorder obstructive pulmonary three

Mutual String Subsumption C : cold temperature cold temperature cold one cold T070: Natural Phenomenon or Process common cold cold two cold C : Common Cold T04: Disease or Syndrome C : Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold T04: Disease or Syndrome cold common two cold one temperature airway chronic cold disorder obstructive pulmonary three

Detection of Redundancy by String Subsumption If A and B are mutually subsumptive And semantic types of A and B are mutually inclusive Then A and B may be redundant

Detection of Redundancy by String Subsumption C : cold temperature cold temperature cold one cold T070: Natural Phenomenon or Process common cold cold two cold C : Common Cold T04: Disease or Syndrome C : Chronic Obstructive Airway Disease chronic obstructive airway disorder chronic obstructive pulmonary disorder cold three cold T04: Disease or Syndrome cold common two cold one temperature airway chronic cold disorder obstructive pulmonary three

Redundancy Examples C : NPS-R-467 (Organic Chemical) C : NPS R-467 (Organic Chemical) C : des-Arg(10)-(Leu(9))kallidin (Amino Acid, Peptide or Protein) C : kallidin, des-Arg(10)-(Leu(9))-) (Amino Acid, Peptide or Protein) C : Congenital diverticulum of esophagus (Congenital Abnormality) C : Congenital esophageal pouch (Congenital Abnormality)

Incorrect synonymy (MeSH translations) C : Dolphins has synonyms “ORCA” (Span.) and "FALSA BALEIA ASSASSINA“ (Port.) so it is mutually subsumptive with C : Whale, False Killer which has synonym "FALSA ORCA" (Span.) Redundancy False Positives Partial names as synonyms: C : Central Diabetes Insipidus has “Diabetes Insipidus” as synonym so it is mutually subsumptive with C : Diabetes Insipidus

Detecting Semantic Type Problems through Parent-Child Relations If Concept A is Parent of Concept B And Concept A has semantic type X And Concept B has semantic type Y And if X and Y are different And X is not an ancestor of Y (in Semantic Net) Then one (or both) semantic types are wrong Or the parent-child relation is wrong

Detecting Semantic Type Problems through Parent-Child Relations OK Wrong Type or Wrong Concept OK Nonspecific Semantic Type Cartilaginous Fish (vertebrate) Shark (vertebrate) Dogfish (fish) Stingray (animal) Skate (manufactured object)

Parent-Child Examples C : Elbow has type Body Location or Regions which is in the Conceptual Entity hierarchy Is parent of: C : Right elbow has type Body Part, Organ, or Organ Component which is in the Physical Object hierarchy

Results: 1995 VS Possible ambiguity 1,817 Possible redundancy 5,031 Actually redundant 3,274 Parent-Child problems 544 Number of concepts:222,927797,359 (3.6x) Parent-Child relations100,586607,043 (6.0x) 8,082 38,140 not done 2,868

Results: 1995 VS Possible ambiguity 1,817 (0.82%) 8,082 (1.01%) Possible redundancy 5,031 (2.26%) 38,140 (4.78%) Actually redundant 3,274 (1.47%) not done Parent-Child problems 544 (0.54%) 2,868 (0.47%) Number of concepts:222,927797,359 (3.6x) Parent-Child relations100,586607,043 (6.0x)

Discussion: Ambiguity Detection Small number (1.01%) is a good sign Allows focusing manual review Semantic type definitions need to be clarified Semantic type assignment rules need to be clarified

Discussion: Redundancy Detection Specificity is worse, without improved sensitivity Normalized string index is part of the reason “Incomplete” names are a bigger part of the reason Manual review will be relatively inefficient Incorrect mappings detected, especially foreign language

Discussion: Parent-Child Relations Mostly detects errors in semantic type assignment Strict hierarchy in Semantic Net causes problems

Conclusions Specific “answers” not possible –Domain expertise needed for assessment of chemical names –Assessments are necessarily subjective –NLM gets to make the rules –NLM hasn’t finished making the rules Methods provide focus for manual review Methods highlight where clearer definitions are needed The results show the UMLS is doing well at a difficult task

Acknowledgments NLM: Bill Hole, Alexa McCray and Betsy Humphreys Home: Rachel and Rebecca Cimino