Quality Assurance of NCI Thesaurus by Mining Structural-Lexical Patterns Terminology Quality Evaluation S60 Rashmie Abeysinghe Joint work with Michael A. Brooks, Jeffery Talbert, Licong Cui University of Kentucky
Disclosure Licong Cui is part of the startup called Synamtics Inc. AMIA 2017 | amia.org
Outline NCI Thesaurus Terminology Quality Assurance Non-lattice Subgraphs Structural-Lexical Features Containment Union Intersection Union-Intersection Inference-Union Inference-Contradiction Results Evaluation Conclusion and Future Directions AMIA 2017 | amia.org
NCI Thesaurus (NCIt) National Cancer Institute (NCI) Thesaurus First published in 2000 Contains over 118,000 concepts Hierarchically organized in 19 domains Abnormal Cell Anatomic Structure, System, or Substance Biological Process Disease, Disorder or Finding Molecular Abnormality etc. maintained by a multidisciplinary team of editors. 900 concepts added each month. covers terminology for clinical care, translational and basic research, public information and administrative activities. AMIA 2017 | amia.org
Terminology Quality Assurance (TQA) Essential part of terminology management lifecycle Manual review: labor-intensive and time-consuming Automating TQA is an active area of research Missing Relation! AMIA 2017 | amia.org
Non-lattice Subgraphs Lattice – a desirable property for a well-formed terminology* Lattice – a DAG such that any two nodes have a unique maximal common descendant as well as a unique minimal common ancestor A non-lattice subgraph Upper Bounds (U) Lower Bounds (L) *Zhang GQ, Bodenreider O. Large-scale, exhaustive lattice-based structural auditing of SNOMED CT. AMIA Annual Symposium Proc. 2010;922-26. AMIA 2017 | amia.org
Structural-Lexical Features Considering the label of a concept as a set of words in lower case: Containment*: Union*: Intersection*: Union-Intersection*: Inference-Union: Inference-Contradiction 𝑈 𝑖 ⊂ 𝑈 𝑗 𝑜𝑟 𝐿 𝑖 ⊂ 𝐿 𝑗 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑘 𝐿 𝑖 ∩ 𝐿 𝑗 = 𝑈 𝑘 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑠 ∩𝐿 𝑡 𝑈 𝑠 U (𝐿 𝑖 ∩ 𝐿 𝑗 )= 𝐿 𝑘 *Cui L, Zhu W, Tao S, Case JT, Bodenreider O, Zhang GQ. Mining non-lattice subgraphs for detecting missing hierarchical relations and concepts in SNOMED CT. JAMIA. 2017 Jul 1;24(4):788-798 AMIA 2017 | amia.org
Containment 𝐿 𝑗 ⊂ 𝐿 𝑖 𝑈 𝑖 ⊂ 𝑈 𝑗 𝑜𝑟 𝐿 𝑖 ⊂ 𝐿 𝑗 𝐿 𝑖 𝐿 𝑗 𝑈 𝑖 ⊂ 𝑈 𝑗 𝑜𝑟 𝐿 𝑖 ⊂ 𝐿 𝑗 Non-lattice subgraph 𝐿 𝑗 ⊂ 𝐿 𝑖 𝐿 𝑖 𝐿 𝑗 AMIA 2017 | amia.org
Containment 𝑈 𝑖 ⊂ 𝑈 𝑗 𝑜𝑟 𝐿 𝑖 ⊂ 𝐿 𝑗 Suggested Fix AMIA 2017 | amia.org
Union 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑘 𝑈 𝑖 𝑈 𝑗 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑘 Non-lattice subgraph malignant, testicular, non-seminomatous, germ, cell, tumor 𝐿 𝑘 AMIA 2017 | amia.org
Union 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑘 Suggested Fix AMIA 2017 | amia.org
Intersection 𝐿 𝑖 ∩ 𝐿 𝑗 = 𝑈 𝑘 𝑈 𝑘 𝐿 𝑖 ∩ 𝐿 𝑗 = 𝐿 𝑖 𝐿 𝑗 𝐿 𝑖 ∩ 𝐿 𝑗 = 𝑈 𝑘 Non-lattice subgraph 𝑈 𝑘 𝐿 𝑖 ∩ 𝐿 𝑗 = splenic, lymphoblastic, lymphoma 𝐿 𝑖 𝐿 𝑗 AMIA 2017 | amia.org
Intersection 𝐿 𝑖 ∩ 𝐿 𝑗 = 𝑈 𝑘 Suggested Fix AMIA 2017 | amia.org
Union-Intersection 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑠 ∩𝐿 𝑡 𝑈 𝑖 𝑈 𝑗 𝑈 𝑖 U 𝑈 𝑗 = Non-lattice subgraph 𝑈 𝑖 𝑈 𝑗 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑠 ∩ 𝐿 𝑡 = localized, adult liver, carcinoma localized, adult liver, carcinoma 𝐿 𝑠 𝐿 𝑡 AMIA 2017 | amia.org
Union-Intersection 𝑈 𝑖 U 𝑈 𝑗 = 𝐿 𝑠 ∩𝐿 𝑡 Suggested Fix AMIA 2017 | amia.org
Inference-Union =𝐿 𝑖 𝑈 𝑠 U (𝐿 𝑖 ∩ 𝐿 𝑗 )= 𝐿 𝑘 𝑈 𝑠 𝐿 𝑖 ∩ 𝐿 𝑗 = Non-lattice subgraph 𝑈 𝑠 𝐿 𝑖 ∩ 𝐿 𝑗 = gallbladder, papillary 𝑈 𝑠 U (𝐿 𝑖 ∩ 𝐿 𝑗 )= gallbladder, papillary, neoplasm =𝐿 𝑖 𝐿 𝑖 𝐿 𝑗 AMIA 2017 | amia.org
Inference-Union 𝑈 𝑠 U (𝐿 𝑖 ∩ 𝐿 𝑗 )= 𝐿 𝑘 Suggested Fix AMIA 2017 | amia.org
Inference-Contradiction Non-lattice subgraph anaplastic : neoplastic large anaplastic : neoplastic large AMIA 2017 | amia.org
Inference-Contradiction Suggested Fix AMIA 2017 | amia.org
Five Patterns! Union, Union-Intersection, Inference-Union, Inference-Contradiction, Containment AMIA 2017 | amia.org
Results In total 8,143 non-lattice subgraphs were identified 809 of those exhibited lexical patterns 678 single patterns 131 multiple patterns AMIA 2017 | amia.org
Evaluation AMIA 2017 | amia.org
Evaluation Single-pattern non-lattice subgraphs: 44% Multiple-pattern non-lattice subgraphs: 88% Overall: 66% AMIA 2017 | amia.org
Conclusion We investigated a hybrid approach to identifying potential errors in NCIt Remediations were automatically suggested An effective way for error detection and correction Applicable to other biomedical terminologies AMIA 2017 | amia.org
Future Work Investigate larger non-lattice subgraphs for evaluation Using concept synonyms to complement concept labels Finding new patterns to uncover more errors AMIA 2017 | amia.org
Acknowledgement This work was supported by National Institutes of Health National Center for Advancing Translational Sciences through grant UL1TR001998 National Science Foundation through grant IIS-1657306 I would like to thank Dr. Licong Cui for the guidance AMIA 2017 | amia.org
Email me at: rashmie.abeysinghe@uky.edu Thank you! Email me at: rashmie.abeysinghe@uky.edu