Semantic Interpretation of Medical Text Barbara Rosario, SIMS Steve Tu, UC Berkeley Advisor: Marti Hearst, SIMS
Semantic Interpretation of Medical Text More accurate representation of the content of the input text Enhance text with information (concept, relationships) drawn from a medical knowledge source Determine semantic meaning of the words (and bigger constructs) and the relationships between them.
Combine Statistical and Symbolic Methods Use of knowledge bases, semantic hierarchies, medical knowledge, rules Use of statistic methods and machine learning techniques
Statistical methods Disambiguation Detection of semantic patterns Classification of semantically related constructs Degrees (weights, probabilities)
First Experiment: Noun Compounds and MeSH Interpretation of noun compounds is crucially semantic Noun compounds extracted from a collection of titles and abstracts of medical journals found in Medline MeSH (Medical Subject Headings) concepts for the labels
Input: Medline Text File Preprocessing Tagger Noun Compound Extraction Semantic Labeling Output: Semantic Labelled Noun Compounds MeSH
MeSH Tree Structures (main) 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]
MeSH Tree Structures (node A expanded) 1. Anatomy [A] Body Regions [A01] + Musculoskeletal System [A02] + Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07] + Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] + Hemic and Immune Systems [A15] + Embryonic Structures [A16] + Body Regions [A01] Abdomen [A01.047] Groin [A ] Inguinal Canal [A ] Peritoneum [A ] + Retroperitoneal Space[A ] Umbilicus [A ] Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] Pelvis [A01.673] + Perineum [A01.719] Skin [A01.835] + Thorax [A01.911] + Viscera [A01.960]
Mapping Nouns to MeSH Concepts Ex: migraine headache recurrence migraine C C C headache C C C recurrence C
More Nouns Compounds migraine headache recurrence C C C blood plasma perfusion A A E migraine headache pain C C G brain stem neurons A E A rat liver mitochondria B A A plasma arginine vasopressin A D D rat thyroid cells B A A11 growth hormone secretion G D A blood urea nitrogen A D D breast cancer cells A C04 A11 cancer cell lines C04 A11 G
Attachment and Semantic Interpretation Attachment classification “acute migraine treatment” [[N N] N] (LA) “intra-nasal migraine treatment” [N [N N]] (RA) To bootstrap semantic interpretation Decision tree (Quinlan )
Levels of Descriptions migraine headache recurrence (LA) C C C Feature vector Only TreeC, C, C Level 1C, 10, C, 23, C, 23 Level 2C, , C, , C, Level 3C, , C, , C, Level 4C, , C, , C,
Decision Tree Classification Training before pruning Training after pruning Testing before pruning Testing after pruning Only Tree 15.8 %16.4%17.3% Level %11.8%15.4 % Level 2 7.9%8.6%21.2%17.3% Level 3 7.9%10.5%26.9%17.3% Level 4 8.6%9.9%25.0%19.2%
Expressiveness of Decision Trees first noun tree = B: ra (33.0/3.7) first noun tree = E: ra (2.0/1.6) first noun tree = F: la (0.0) first noun tree = G: la (4.0/0.3) first noun tree = A: | second noun tree = B: la (0.0) | second noun tree = D: la (4.0/0.3) | second noun tree = E: la (10.0/0.4) | second noun tree = F: la (0.0) | second noun tree = G: la (6.0/1.6) | second noun tree = A: | | first tree position <= 4 : ra (7.0/1.6) | | first tree position > 4 : la (36.0/5.8) | second noun tree = C: | | third noun tree = A: ra (9.0/0.3) | | third noun tree = B: la (0.0) | | third noun tree = D: la (1.0/0.3) | | third noun tree = E: la (5.0/0.3) | | third noun tree = F: la (0.0) | | third noun tree = G: ra (2.0/1.6) | | third noun tree = C: | | | third tree position <= 21 : ra (5.0/2.6) | | | third tree position > 21 : la (5.0/0.3) first noun tree = C: …..
Semantic Interpretation Use decision tree paths for the detection of clusters of noun compounds with the same semantic interpretation
Ex: ACA: breast cancer cells A C04 A11 ra bladder cancer cells A C04 A11 ra colon carcinoma cells A C A11 ra prostate tumor cells A C04 A11 ra prostate cancer tissue A C04 A10 ra lung cancer cells A C04 A11 ra colon cancer cells A C04 A11 ra brain tumor tissue A C04 A10 ra colon cancer tissues A C04 A10 ra bladder tumor cells A C04 A11 ra Interpretation: noun3 exhibits noun2 in noun1
Ex: ACE: muscle disease diagnosis A C E01 la breast cancer prognosis A C04 E la breast cancer treatment A C04 E02 la hip fracture treatment A C E02 la cell cancer treatment A11 C04 E02 la brain tumor treatment A C04 E02 la colon adenocarcinoma xenograft A C E colon carcinoma xenograft A C E colon carcinoma xenografts A C E neck cancer xenografts A C04 E Interpretation: 1: noun3 diagnoses noun2 in noun1 2: noun3 treats noun2 in noun1
From MeSH to UMLS Unified Medical Language System, project at U.S National Library of Medicine 3 UMLS Knowledge Sources Metathesaurus Semantic Network SPECIALIST lexicon and programs
Metathesaurus Most extensive of UMLS sources 730,000 concepts representing more then 1,500,000 strings in over 60 vocabularies and classifications Organized by concept or meaning. In essence, its purpose is to link alternative names and views of the same concept together and to identify useful relationships between different concepts. Relationships in the Metathesaurus come from the sources themselves or are created by the Metathesaurus editors.
Semantic Network Consistent categorization of all concepts represented in the UMLS Metathesaurus and the important relationships between them. Every concept has been assigned a semantic type. The semantic types (134) are the nodes in the Network, and the relationships between them are the links (54) High level semantic structure
"Biologic Function" Hierarchy
Noun Compounds, again Very preliminary studies… Can we use the information of the Semantic Net for the semantic interpretation on the noun compounds? Are semantic types and relationships good descriptors? Are they useful for disambiguation and classification?
Mapping of Noun Compounds NC: peptide CRF receptor antagonists C |C |C |C | Amino Acid, Peptide, or Protein|Hormone|Receptor|Pharmacologic Substance| A |A |A |A | rel_12.1 (Amino Acid, Peptide, or Protein, Hormone) = interacts_with: A R3.1.5 A rel_13.1 (Amino Acid, Peptide, or Protein, Receptor) = interacts_with: A R3.1.5 A rel_14.1 (Amino Acid, Peptide, or Protein, Pharmacologic Substance) = interacts_with: A R3.1.5 A rel_23.1 (Hormone, Receptor) = interacts_with: A R3.1.5 A rel_24.1 (Hormone, Pharmacologic Substance) = interacts_with: A R3.1.5 A rel_34.1 (Receptor, Pharmacologic Substance) = interacts_with: A R3.1.5 A
Mapping of Noun Compounds NC: day hospital treatment C |C |C ,C | Temporal Concept|Health Care Related Organization|Functional Concept;Therapeutic or Preventive Procedure| A2.1.1|A2.7.1|A2.1.4;B | rel_12.1 (Temporal Concept, Health Care Related Organization) = NOT found in SemNet rel_13.1 (Temporal Concept, Functional Concept) = NOT found in SemNet rel_13.2 (Temporal Concept, Therapeutic or Preventive Procedure) = NOT found in SemNet rel_23.1 (Health Care Related Organization, Functional Concept) = NOT found in SemNet rel_23.2 (Health Care Related Organization, Therapeutic or Preventive Procedure) = location_of: R2.1
Mapping of Noun Compounds NC: brain serotonin metabolism C |C |C ,C | Body Part, Organ, or Organ Component|Neuroreactive Substance or Biogenic Amine|Organism Function;Functional Concept| A |A |B ;A2.1.4| rel_12.1 (Body Part, Organ, or Organ Component, Neuroreactive Substance or Biogenic Amine) = produces R3.2.1 rel_13.1 (Body Part, Organ, or Organ Component, Organism Function) = location_of R2.1 rel_13.2 (Body Part, Organ, or Organ Component, Functional Concept) = NOT found in SemNet rel_23.1 (Neuroreactive Substance or Biogenic Amine, Organism Function) = disrupts R3.1.3 rel_23.2 (Neuroreactive Substance or Biogenic Amine, Functional Concept) = NOT found in SemNet
Mapping Words - Semantic Types, Semantic Relationships Semantic types correctly assigned (on 246 nc, 738 nouns): 59% Semantic types disambiguated by the relationships Doesn’t disambiguate: 42.7% Disambiguates wrong: 17.3% Disambiguates correctly: 40%
(Some of) Future Work Explore in more depth UMLS sources What form the best basis for automatic semantic interpretation of noun phrases? Semantic types? Metathesaurus concepts?(and what parts of them) Just MeSH concepts? Machine Learning algorithms to help choose a good representation of medical terms
Future Work Machine learning algorithms for classification Can we (and how) generalize patterns found for noun compounds to other syntactic structures? How can we best formally represent semantics? How can we combine symbolic rules with statistical methods? How can we deal with non medical words? Can the system help us disambiguate them? Should we use other ontologies (ex WordNet)?