1 Literature-Based Knowledge Discovery using Natural Language Processing Dimitar Hristovski, 1 PhD, Carol Friedman, 2 PhD, Thomas C Rindflesch, 3 PhD, Borut Peterlin, 4 MD PhD 1 Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia 2 Department of Biomedical Informatics, Columbia University, New York 3 National Library of Medicine, Bethesda, Maryland 4 Division of medical genetics, UMC, Slajmerjeva 3, Ljubljana, Slovenia
2 Part 1: Co-occurrence based LBD
3 Motivation Overspecialization Information overload Large databases Need and opportunity for computer supported knowledge discovery
4 Literature-based Discovery (LBD) A method for automatically generating hypotheses (discoveries) from literature Hypotheses have form: Concept1 –Relation– Concept2 Example: Fish oil –Treats– Raynaud’s disease
5 Background Swanson’s LBD paradigm: Concept X (Disease) e.g. Raynaud’s Concepts Y (Pathologycal or Cell Function, …) e.g. Blood viscosity Concepts Z (Drugs, …) e.g. Fish oil New Relation? e.g. Treats
6 Biomedical Discovery Support System (BITOLA) Goal: –discover potentially new relations (knowledge) between biomedical concepts –to be used as research idea generator and/or as –an alternative way to search Medline System user (researcher or intermediary): –interactively guides the discovery process –evaluates the proposed relations
7 Extending and Enhancing Literature Based Discovery Goal: –Make literature based discovery more suitable for disease candidate gene discovery –Decrease the number of candidate relations Method: –Integrate background knowledge: Chromosomal location of diseases and genes Gene expression location Disease manifestation location
8 System Overview Knowledge Base Concepts Association Rules Background Knowledge (Chromosomal Locations, …) Discovery Algorithm User Interface Databases (Medline, LocusLink, HUGO, OMIM, …) Knowledge Extraction
9 Terminology Problems during Knowledge Extraction Gene names Gene symbols MeSH and genetic diseases
10 Detected Gene Symbols by Frequency type| II| III| component| CT| AT| ATP| IV| CD4|99657 p53|89357 MR|88682 SD|85889 GH|84797 LPS| |67272 E2| |63521 AMP|61862 TNF|59343 RA|58818 CD8|57324 O2|56847 ACTH|54933 CO2|53171 PKC|51057 EGF|50483 T3|49632 MS|46813 A2|44896 ER|43212 upstream|41820 PRL|41599
11 Gene Symbol Disambiguation Find MEDLINE docs in which we can expect to find gene symbols Example of false positive: –Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390 –breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support
12 Binary Association Rules X Y (confidence, support) If X Then Y (confidence, support) Confidence = % of docs containing Y within the X docs Support = number (or %) of docs containing both X and Y The relation between X and Y not known. Examples: –Multiple Sclerosis Optic Neuritis (2.02, 117) –Multiple Sclerosis Interferon-beta (5.17, 300)
13 Discovery Algorithm Concept X (Disease) Concepts Y (Pathologycal or Cell Function, …) Concepts Z (Genes) Chromosomal Region Chromosomal Location Candidate Gene? Match Manifestation Location Expression Location Match
14 Ranking Concepts Z X Y1Y1 Y2Y2 Y3Y3 YiYi YjYj … … Z1Z1 Z2Z2 Z3Z3 ZkZk ZnZn
15 Problem Size Full Medline analyzed (cca 15,000,000 recs) 87,000,000 association rules between 180,000 biomedical concepts
16 Bilateral Perisylvian Polymicrogiria - BPP (OMIM: ) Polymicrogyria of the cerebral cortex is a developmental abnormality characterized by excessive surface convolution Clinical characteristics: –Mental retardation –Epilepsy –Pseudobulbar palsy (paralysis of the face, throat, tongue and the chewing process) X linked dominant inheritance
17 18 gene candidates 15 gene candidates Tissue specific expression 2 gene candidates: L1CAM and FLNA relation between semantic types Cell Movement and Gene or gene products Sublocalisation in the Xq genes in Xq28
18 User Interface “cgi-bin” version
19 Automatically search for supporting Medline Citations
20 Part 1: Summary and Conclusions Discovery support system (BITOLA) presented The system can be used as: –Research idea generator, or –Alternative method of searching Medline Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery
21 System Availability URL:
22 Part 2: Exploring Semantic Relations for LBD
23 Current LBD Systems Co-occurrence based Concepts –Title/Abstract Words/Phrases –MeSH –UMLS –Genes... UMLS Semantic types used for filtering Semantic relations between concepts NOT used
24 Drawbacks of Current LBD Not all co-occurrences represent a relation Users have to read many Medline citations when reviewing candidate relations Many spurious (false-positive) relations and hypotheses produced No explanation of proposed hypotheses
25 Enhancing the LBD paradigm Use semantic relations obtained from –two NLP systems (BioMedLee and SemRep) to augment –co-occurrence based LBD system (BITOLA)
26 Methods
27 Discovery Patterns Discovery pattern: Set of conditions to be satisfied for the generation of new hypotheses Conditions are combinations of semantic relations between concepts Maybe_Treats pattern in this research – has two forms: – Maybe_Treats1 – Maybe_Treats2
28 Maybe_Treats Discovery Pattern Disease X Maybe_Treats2 Change1 Change2 Treats Substance Y1 (or Body meas., Body funct.) Substance Y2 (or Body meas., Body funct.) Drug Z1 (or substance) Disease X2 Drug Z2 (or substance) Opposite_Change1 Same Change2 Maybe_Treats1
29 Maybe_Treats1 and Maybe_Treats2 Goal: Propose potentially new treatments Can work in concert: –Propose different treatments ( complementary ) –Propose same treatments using different discovery reasoning ( reinforcing )
30 Multiple Usages of Maybe_Treats Given Disease X as input: –find new treatments Z Given Drug Z as input: –find diseases X that can be treated Given Disease X and Drug Z as input: –test whether Z can be used to treat X
31 Semantic Relations Used Associated_with_change and Treats used to extract known facts from the literature Then Maybe_Treats1 and Maybe_Treats2 predict new treatments based on the known extracted facts
32 Associated_with_change One concept associated with a change in another concept, for example: Associated_with(Raynaud’s, Blood viscosity, increase): –“Local increase of blood viscosity during cold-induced Raynaud's phenomenon.” –“Increased viscosity might be a causal factor in secondary forms of Raynaud's disease, …” BioMedLee (Friedman et al) used to extract Associated_with_change
33 Treats Used to extract drugs known to treat a disease Major purpose in our approach: –Eliminate drugs already known to be used to treat a disease –Find existing treatments for similar diseases TREATS(Amantadine,Huntington): –“…treatment of Huntington’s disease with amantadine…” Treats extracted by SemRep (Rindflesch et al)
34 Results
35 Huntington Disease Inherited neurodegenerative disorder All 5511 Huntington citations (Jan.2006) processed with BioMedLee and SemRep 35 interesting concepts assoc.with change selected and corresponding citations ( ) processed
36 Insulin for Huntington Disease Assoc_with(Huntington,Insulin,decrease): –“Huntington's disease transgenic mice develop an age-dependent reduction of insulin mRNA expression and diminished expression of key regulators of insulin gene transcription, …” Insulin also decreased in diabetes mellitus Therapies used to regulate insulin in diabetes might be used for Huntington
37 Capsaicin for Huntington Assoc_with(Huntington,Substance P,decrease): –“In Huntington's disease brains decreased Substance P staining was found in …” Assoc_with(Capsaicin,Substance P,increase): –“Capsaicin also attenuated the increase in Substance P content in sciatic nerve, …” Capsaicin maybe treats Huntington because Substance P is decreased in Huntington and Capsaicin increases Substance P.
38 Huntington Results - Summary Huntington (Disease X) Maybe_Treats2 Decrease Treats Substance P (Substance Y1) Insulin (Substance Y2) Capsaicin (Drug Z1) Diabetes M (Disease X2) Insulin regulation ther. (Z2) Increase Decrease Maybe_Treats1
39 Example: Parkinson disease as starting concept. Bellow shown some related concepts changed in association to Parkinson
40 Potential Treatments for Parkinson (e.g. gabapentine)
41 Showing Supporting Sentences with highlighted concepts and relations
42 Gabapentine for Parkinson Assoc_with(Parkinson,gamma-aminobutyric acid(GABA),decrease): –“…studies indicate that patients with Parkinson's disease have decreased basal ganglia gamma-aminobutyric acid function… ” Assoc_with(GABA,Gabapentine,increase): –“ Gabapentin, probably through the activation of glutamic acid decarboxylase, leads to the increase in synaptic GABA. ” Explanation: Gabapentine maybe treats Parkinson because GABA is decreased in Parkinson and Gabapentine increases GABA.
43 Part 2: Conclusions A new method to improve LBD presented Based on discovery patterns and semantic relations extracted by BioMedLee and SemRep, coupled with BITOLA LBD Easier for the user to evaluate smaller number of hypotheses Two potentially new therapeutic approaches for Huntington proposed and one for Parkinson Raynaud’s—Fish oil discovery replicated
44 The future of Literature-based Discovery Development of specific discovery patterns based on semantic relations and further integrated with co-occurrence-based LBD
45 Link, References and some propaganda Hristovski D, Peterlin B, Mitchell JA and Humphrey SM. Using literature- based discovery to identify disease candidate genes. Int. J. Med. Inform Vol. 74(2–4), pp. 289–298. Selected for Yearbook of Medical Informatics 2006 Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic relations for literature-based discovery. In Proc AMIA 2006 Symp; p Ahlers C, Hristovski D, Kilicoglu H, Rindflesch TC. Using the Literature-Based Discovery Paradigm to Investigate Drug Mechanisms. In Proc AMIA 2007 Symp; p “Distinguished Paper Award AMIA2007” Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Literature-Based Knowledge Discovery using Natural Language Processing. To appear as a chapter in the first LBD book in 2008