1 Literature-Based Knowledge Discovery using Natural Language Processing Dimitar Hristovski, 1 PhD, Carol Friedman, 2 PhD, Thomas C Rindflesch, 3 PhD,

Slides:



Advertisements
Similar presentations
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Advertisements

Mining Association Rules from Microarray Gene Expression Data.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Semantic Medline: Multi-Document Summarization and Visualization Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S. Lister.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Investigating the Importance of non-coding transcripts.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)
Social Pharmacy and Pharmacoepidemiology Lister Hill National Center for Biomedical Communications Text-based Discovery in Biomedicine The Architecture.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.
Drug Discovery Process
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Exploitation of Structured Knowledge Sources for Question Answering: Future Aspects Stefan Schulz Markus Kreuzthaler Ulrich Andersen.
Introduction to Basic Science Emily L. Lowe, Ph.D. Microbiology, Immunology and Molecular Genetics UCLA.
QCOM Library Resources Rick Wallace, Nakia Woodward, Katie Wolf.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
1 How to find literature - A very short introduction SMED 8004 Medicine and Health Library October 2014.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Knowledge Discovery in the Digital Library Access tools for mining science ICSTI Public Workshop Presented by: Bernard Dumouchel, Director-General February.
Biomedical Research Objective 2 Biomedical Research Methods.
Regulation of Gene Expression: An Overview  Transcriptional  Tissue-specific transcription factors  Direct binding of hormones, growth factors, etc.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
Literature Based Discovery Dimitar Hristovski Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana,
This material was developed by Oregon Health & Science University, funded by the Department of Health and Human Services, Office of the National Coordinator.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
Agent-based methods for translational cancer multilevel modelling Sylvia Nagl PhD Cancer Systems Science & Biomedical Informatics UCL Cancer Institute.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
Correlating Knowledge Using NLP: Relationships between the concepts of blood cancers, stem cell transplantation, and biomarkers Katy Zou and Weizhong Zhu.
Search Engine Architecture
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
School of something FACULTY OF OTHER Facing Complexity Using AAC in Human User Interface Design Lisa-Dionne Morris School of Mechanical Engineering
Overview of Bioinformatics 1 Module Denis Manley..
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
1 Semantic Relations for Interpreting DNA Microarray Data and for Novel Hypotheses Generation Dimitar Hristovski, 1 PhD, Andrej Kastrin, 2 Borut Peterlin,
Semantic Relation Discovery by Using Co-occurrence Information Background: MEDLINE contains high quality semantic metadata covering more than 22 million.
The UMLS Semantic Network Alexa T. McCray Center for Clinical Computing Beth Israel Deaconess Medical Center Harvard Medical School
Clinical Decision Support Systems Dimitar Hristovski, Ph.D. Institute of Biomedical.
Aiding Biomedical Researchers with Tools to Assist Discovery Neil R. Smalheiser May 18, 2006.
Α-synuclein transgenic mouse models of Parkinson’s disease Michelle Maurer December 2015.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Finding genes in the genome
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
MEDLINE®/PubMed® PubMed for Trainers, Fall 2015 U.S. National Library of Medicine (NLM) and NLM Training Center An introduction.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
TDM in the Life Sciences Application to Drug Repositioning *
Nervous System Disorders
TITIN ANDRI WIHASTUTI SCHOOL OF NURSING FACULTY OF MEDICINE
Genetic Engineering and Animal Research
NeurOn: Modeling Ontology for Neurosurgery
RaJoLink: Creative Knowledge Discovery by Literature Outlier Detection
Lindsay & Gordon’s Discovery Support Systems Model
Search Engine Architecture
Literature review Dr.Rehab F Gwada.
Cristian Ferent and Alex Doboli
Ingenuity Knowledge Base
A Short Tutorial on Causal Network Modeling and Discovery
Patterns of Heredity 4.1 Living things inherit traits in patterns. 4.2
Objective 2 Biomedical Research Methods
Presentation transcript:

1 Literature-Based Knowledge Discovery using Natural Language Processing Dimitar Hristovski, 1 PhD, Carol Friedman, 2 PhD, Thomas C Rindflesch, 3 PhD, Borut Peterlin, 4 MD PhD 1 Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia 2 Department of Biomedical Informatics, Columbia University, New York 3 National Library of Medicine, Bethesda, Maryland 4 Division of medical genetics, UMC, Slajmerjeva 3, Ljubljana, Slovenia

2 Part 1: Co-occurrence based LBD

3 Motivation Overspecialization Information overload Large databases Need and opportunity for computer supported knowledge discovery

4 Literature-based Discovery (LBD) A method for automatically generating hypotheses (discoveries) from literature Hypotheses have form: Concept1 –Relation– Concept2 Example: Fish oil –Treats– Raynaud’s disease

5 Background Swanson’s LBD paradigm: Concept X (Disease) e.g. Raynaud’s Concepts Y (Pathologycal or Cell Function, …) e.g. Blood viscosity Concepts Z (Drugs, …) e.g. Fish oil New Relation? e.g. Treats

6 Biomedical Discovery Support System (BITOLA) Goal: –discover potentially new relations (knowledge) between biomedical concepts –to be used as research idea generator and/or as –an alternative way to search Medline System user (researcher or intermediary): –interactively guides the discovery process –evaluates the proposed relations

7 Extending and Enhancing Literature Based Discovery Goal: –Make literature based discovery more suitable for disease candidate gene discovery –Decrease the number of candidate relations Method: –Integrate background knowledge: Chromosomal location of diseases and genes Gene expression location Disease manifestation location

8 System Overview Knowledge Base Concepts Association Rules Background Knowledge (Chromosomal Locations, …) Discovery Algorithm User Interface Databases (Medline, LocusLink, HUGO, OMIM, …) Knowledge Extraction

9 Terminology Problems during Knowledge Extraction Gene names Gene symbols MeSH and genetic diseases

10 Detected Gene Symbols by Frequency type| II| III| component| CT| AT| ATP| IV| CD4|99657 p53|89357 MR|88682 SD|85889 GH|84797 LPS| |67272 E2| |63521 AMP|61862 TNF|59343 RA|58818 CD8|57324 O2|56847 ACTH|54933 CO2|53171 PKC|51057 EGF|50483 T3|49632 MS|46813 A2|44896 ER|43212 upstream|41820 PRL|41599

11 Gene Symbol Disambiguation Find MEDLINE docs in which we can expect to find gene symbols Example of false positive: –Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390 –breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support

12 Binary Association Rules X  Y (confidence, support) If X Then Y (confidence, support) Confidence = % of docs containing Y within the X docs Support = number (or %) of docs containing both X and Y The relation between X and Y not known. Examples: –Multiple Sclerosis  Optic Neuritis (2.02, 117) –Multiple Sclerosis  Interferon-beta (5.17, 300)

13 Discovery Algorithm Concept X (Disease) Concepts Y (Pathologycal or Cell Function, …) Concepts Z (Genes) Chromosomal Region Chromosomal Location Candidate Gene? Match Manifestation Location Expression Location Match

14 Ranking Concepts Z X Y1Y1 Y2Y2 Y3Y3 YiYi YjYj … … Z1Z1 Z2Z2 Z3Z3 ZkZk ZnZn

15 Problem Size Full Medline analyzed (cca 15,000,000 recs) 87,000,000 association rules between 180,000 biomedical concepts

16 Bilateral Perisylvian Polymicrogiria - BPP (OMIM: ) Polymicrogyria of the cerebral cortex is a developmental abnormality characterized by excessive surface convolution Clinical characteristics: –Mental retardation –Epilepsy –Pseudobulbar palsy (paralysis of the face, throat, tongue and the chewing process) X linked dominant inheritance

17 18 gene candidates 15 gene candidates Tissue specific expression 2 gene candidates: L1CAM and FLNA relation between semantic types Cell Movement and Gene or gene products Sublocalisation in the Xq genes in Xq28

18 User Interface “cgi-bin” version

19 Automatically search for supporting Medline Citations

20 Part 1: Summary and Conclusions Discovery support system (BITOLA) presented The system can be used as: –Research idea generator, or –Alternative method of searching Medline Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery

21 System Availability URL:

22 Part 2: Exploring Semantic Relations for LBD

23 Current LBD Systems Co-occurrence based Concepts –Title/Abstract Words/Phrases –MeSH –UMLS –Genes... UMLS Semantic types used for filtering Semantic relations between concepts NOT used

24 Drawbacks of Current LBD Not all co-occurrences represent a relation Users have to read many Medline citations when reviewing candidate relations Many spurious (false-positive) relations and hypotheses produced No explanation of proposed hypotheses

25 Enhancing the LBD paradigm Use semantic relations obtained from –two NLP systems (BioMedLee and SemRep) to augment –co-occurrence based LBD system (BITOLA)

26 Methods

27 Discovery Patterns Discovery pattern: Set of conditions to be satisfied for the generation of new hypotheses Conditions are combinations of semantic relations between concepts Maybe_Treats pattern in this research – has two forms: – Maybe_Treats1 – Maybe_Treats2

28 Maybe_Treats Discovery Pattern Disease X Maybe_Treats2 Change1 Change2 Treats Substance Y1 (or Body meas., Body funct.) Substance Y2 (or Body meas., Body funct.) Drug Z1 (or substance) Disease X2 Drug Z2 (or substance) Opposite_Change1 Same Change2 Maybe_Treats1

29 Maybe_Treats1 and Maybe_Treats2 Goal: Propose potentially new treatments Can work in concert: –Propose different treatments ( complementary ) –Propose same treatments using different discovery reasoning ( reinforcing )

30 Multiple Usages of Maybe_Treats Given Disease X as input: –find new treatments Z Given Drug Z as input: –find diseases X that can be treated Given Disease X and Drug Z as input: –test whether Z can be used to treat X

31 Semantic Relations Used Associated_with_change and Treats used to extract known facts from the literature Then Maybe_Treats1 and Maybe_Treats2 predict new treatments based on the known extracted facts

32 Associated_with_change One concept associated with a change in another concept, for example: Associated_with(Raynaud’s, Blood viscosity, increase): –“Local increase of blood viscosity during cold-induced Raynaud's phenomenon.” –“Increased viscosity might be a causal factor in secondary forms of Raynaud's disease, …” BioMedLee (Friedman et al) used to extract Associated_with_change

33 Treats Used to extract drugs known to treat a disease Major purpose in our approach: –Eliminate drugs already known to be used to treat a disease –Find existing treatments for similar diseases TREATS(Amantadine,Huntington): –“…treatment of Huntington’s disease with amantadine…” Treats extracted by SemRep (Rindflesch et al)

34 Results

35 Huntington Disease Inherited neurodegenerative disorder All 5511 Huntington citations (Jan.2006) processed with BioMedLee and SemRep 35 interesting concepts assoc.with change selected and corresponding citations ( ) processed

36 Insulin for Huntington Disease Assoc_with(Huntington,Insulin,decrease): –“Huntington's disease transgenic mice develop an age-dependent reduction of insulin mRNA expression and diminished expression of key regulators of insulin gene transcription, …” Insulin also decreased in diabetes mellitus Therapies used to regulate insulin in diabetes might be used for Huntington

37 Capsaicin for Huntington Assoc_with(Huntington,Substance P,decrease): –“In Huntington's disease brains decreased Substance P staining was found in …” Assoc_with(Capsaicin,Substance P,increase): –“Capsaicin also attenuated the increase in Substance P content in sciatic nerve, …” Capsaicin maybe treats Huntington because Substance P is decreased in Huntington and Capsaicin increases Substance P.

38 Huntington Results - Summary Huntington (Disease X) Maybe_Treats2 Decrease Treats Substance P (Substance Y1) Insulin (Substance Y2) Capsaicin (Drug Z1) Diabetes M (Disease X2) Insulin regulation ther. (Z2) Increase Decrease Maybe_Treats1

39 Example: Parkinson disease as starting concept. Bellow shown some related concepts changed in association to Parkinson

40 Potential Treatments for Parkinson (e.g. gabapentine)

41 Showing Supporting Sentences with highlighted concepts and relations

42 Gabapentine for Parkinson Assoc_with(Parkinson,gamma-aminobutyric acid(GABA),decrease): –“…studies indicate that patients with Parkinson's disease have decreased basal ganglia gamma-aminobutyric acid function… ” Assoc_with(GABA,Gabapentine,increase): –“ Gabapentin, probably through the activation of glutamic acid decarboxylase, leads to the increase in synaptic GABA. ” Explanation: Gabapentine maybe treats Parkinson because GABA is decreased in Parkinson and Gabapentine increases GABA.

43 Part 2: Conclusions A new method to improve LBD presented Based on discovery patterns and semantic relations extracted by BioMedLee and SemRep, coupled with BITOLA LBD Easier for the user to evaluate smaller number of hypotheses Two potentially new therapeutic approaches for Huntington proposed and one for Parkinson Raynaud’s—Fish oil discovery replicated

44 The future of Literature-based Discovery Development of specific discovery patterns based on semantic relations and further integrated with co-occurrence-based LBD

45 Link, References and some propaganda Hristovski D, Peterlin B, Mitchell JA and Humphrey SM. Using literature- based discovery to identify disease candidate genes. Int. J. Med. Inform Vol. 74(2–4), pp. 289–298.  Selected for Yearbook of Medical Informatics 2006 Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic relations for literature-based discovery. In Proc AMIA 2006 Symp; p Ahlers C, Hristovski D, Kilicoglu H, Rindflesch TC. Using the Literature-Based Discovery Paradigm to Investigate Drug Mechanisms. In Proc AMIA 2007 Symp; p  “Distinguished Paper Award AMIA2007” Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Literature-Based Knowledge Discovery using Natural Language Processing.  To appear as a chapter in the first LBD book in 2008