Download presentation
Presentation is loading. Please wait.
1
Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature
2
Biomedical Text Mining (BTM)
Why biomedicine? Consider just MEDLINE: more than 20,000,000 references, 40,000 added per month Dynamic nature of the domain: new terms (genes, proteins, chemical compounds, drugs) constantly created Impossible to manage such an information overload
3
From Text to Knowledge: tackling the data deluge through text mining
Unstructured Text (implicit knowledge) Information Retrieval Information extraction Knowledge Discovery Semantic metadata Structured content (explicit knowledge) Advanced Information Retrieval
4
Information Deluge Bio-databases, controlled vocabularies and bio-ontologies encode only small fraction of information Linking text to databases and ontologies Curators struggling to process scientific literature Discovery of facts and events crucial for gaining insights in biosciences: need for text mining
5
Aims of Biomedical Text Mining
Text mining: discover & extract unstructured knowledge hidden in text Hearst (1999) Text mining aids to construct hypotheses from associations derived from text protein-protein interactions associations of genes – phenotypes functional relationships among genes
6
Impact of biomedical text mining
Extraction of named entities (genes, proteins, metabolites, etc) Discovery of concepts allows semantic annotation of documents Improves information access by going beyond index terms, enabling semantic querying Construction of concept networks from text Allows clustering, classification of documents Visualization of concept maps
7
Impact of BTM Extraction of relationships (events and facts) for knowledge discovery Information extraction, more sophisticated annotation of texts (event annotation) Beyond named entities: facts, events Enables even more advanced semantic querying Querying of the representations, not of the strings. Instances of facts and events are represented conceptually (because based on ontologies) 2 ways of doing this: a) bag of representations of facts and events, data mining over this b) integrate the representations in a knowledge base (our knowledge reaper in Bootstrep)
8
Literature Based Discovery (LBD)
Swanson experiments (1986) influenced conceptual biology rapid ‘mining’ of candidate hypotheses from the literature migraine and magnesium deficiency (Swanson, 1988) indomethacin and Alzheimer’s disease (Swanson and Smalheiser 1994), Curcuma longa and retinal diseases, Crohn's disease and disorders related to the spinal cord (Srinivasan and Libbus 2004). (Weeber M, Rein et al. 2003) thalidomide for treating a series of diseases such as acute pancreatitis, chronic hepatitis C.
9
Literature Based Discovery (LBD)
Conceptual Biology? PKC1 3 8 Insulin 5 CATS 9 Alzheimer Drug repositioning 4 2 SOS2 Swanson’s ABC model
10
Literature-based discovery (LDA)? --- the very idea.
It means deriving, from the public record of science new solutions to scientific problems. The possibility arises, for example, when two articles considered together for the first time suggest new information of scientific interest not apparent from either article alone.
11
Venn Diagram -- ABC Model
Articles about an AB relationship. A C B AB BC Articles about a BC relationship. AB and BC are complementary but disjoint : They can reveal an implicit relationship between A and C in the absence of any explicit relation.
12
An ABC example based on title words in Medline
Magnesium-deficient rat as a model of epilepsy. Lab Animal Sci 28:680-5, 1978 The relation of migraine and epilepsy. Brain 92: , 1969 1018 1710 A magnesium 88204 C migraine 26923 B epilepsy An unintended link The title on the left connects magnesium with epilepsy and on the right epilepsy with migraine. The two together suggest that there may be a connection between magnesium and migraine. But how were these titles chosen? We assume a problem-oriented framework -- in this case, the problem is to find information about the cause or cure of migraine headache. Further than that, assume for now that the titles were brought together fortuitously. Later I will provide a better rationale. The two titles taken together are “interesting”, for they at least hint at a solution to the problem posed above. To explore further, search Medline for information about a magnesium-migraine connection (as of 1988). Our Medline searcher found virtually nothing! The information structure can be represented by a Venn diagram, showing that the migraine circle does not intersect the magnesium circle; that is, the two sets of articles, are disjoint. (There actually are a few intersect articles -- neglected here for the purpose of presenting the man line of argument). Each literature has thousands of articles. Further searching did turn up a dozen or so additional articles on magnesium AND epilepsy and on epilepsy AND migraine, represented in the yellow intersections, and marked by the two arrows. Clearly neither author intended to create a link to the other. Thus we can think of epilepsy as an unintended link between magnesium and migraine. Are there any other unintended links between magnesium and migraine? Yes. Venn diagram: sets of Medline records; A,C are disjoint.
13
Research problems Information model Automation How to discover novelty
Biological information Multi-level Automation Gigantic amount of data Swanson’s ABC model Semi-automatic How to discover novelty Find novel information A1 (Fish Oil) C1 (Raynaud Disease) B1 (Blood Viscosity) Reduce Aggregate Novel Hypothesis Generation
14
Information Model Information Model : category-based interaction model
Interactor node Connects whole relation Represents action by verb Interactor Type Induce Increase Contribute Reduce Reduction Resistant
15
Information Model Each node is represented by mapping a semantic type of the node to its corresponding UMLS top category.
16
Methods Data Flow of BioDiscovery UMLS Extracted Entities/Relations
Sentence Parser Graph Builder MEDLINE / PubMED Abstracts Sentence Splitter Entity Extractor Relation Extractor Visualizer Extracted Entities/Relations Similar Entity Detector UPK
17
Sentence Parsing Phase
Methods Data Flow of BioDiscovery Sentence Parsing Phase Split Sentences Tagger Parsed Tree =Sentence Parser= Input : Split Sentence - Output : Sentence tree by Link Grammar Parser
18
Sentence Parsing - Example
Original Sentence: After the DF1 cells had been cultured for 9 d, the ALV p27 antigen in the supernatants of the two sets was detected by ELISA
19
Sentence Parsing - Example
20
Entity Extractor A NER technique is used to detect entities
LingPipe NER and Genia corpus used to detect The accuracy of entity extraction by LingPipe is low. Validation of the entity type of extracted entities: by looking up UMLS Semantic Network Assignment of the category tag for each entity: by utilizing UMLS top categories such as Anatomical Structure, Substance, and Phenomenon or Process
21
Relation Extractor Selection of the key connector term (i.e., verb)
Difficult decision where complex sentences contain many verbs Utilize Link Grammar link types such as V and MV to determine the key connector Entities that appear before the key connector is set to Interactor entities Entities that appear after the key connector is set to Interactee entities
22
Interaction Graph Builder
A maximum connected graph that can be built by our interaction model is a bow tie shape. Each node represents an entity. Edge between entities is determined by proximity in a sentence. First two nodes to be connected are an interactor entity and an interactee entity that are located closest to the connector. Entities that belong to the same category are inter-connected to each other.
23
Similar Entity Detector
Methods Similarity Measure* MetaMap Type Structural Atomic Count Semantic Similarity 0 : Not Similar 1 : Similar 0.5 : Substructure Ranking scores UMLS Graph Builder =UPK Inference= - Input : Extracted Entity/Relation - Output : UPK MEDLINE / PubMED Abstracts Sentence Selector Entity Extraxtor Relation Extractor Relation Extractor Information Element Recognizer Visualizer UPK UPK Similar Entity Detector Similarity Measure Data Flow of MKEM * See Appendix B for description
24
Similarity Measures Semantic Type Structural Similarity Atomic Count
UMLS Semantic Type Structural Similarity Structural similarity is calculated using the SMSD (Small Molecule Subgraph Detector) system Atomic Count is taken from the chemDB database. Atomic count defines the enumeration of constituent atoms of the chemical which is of interest. Semantic Similarity Relative importance-based graph similarity Topological Similarity (Not implemented yet) Graph topology-based similarity
25
Semantic Similarity Build dependency tree of a sentence
Create semantic distributional models (based on feature vectors) by Tensor Singular Value Decomposition (SVD) The shape is a 3-dimensional tensor of the edge statistics, which has the shape Head-Relation-Dependency It adds dependency edges in the reverse direction Calculate term weight by Point-wise Mutual Information (PMI)
26
Tensor Example
27
Tensors are useful for 3 or more modes
28
Tensor SVD Decomposition
29
2D Analog of Tensor SVD Decomposition
30
Methods UPK Inference Example Apoptosis Increase Malignant T-Cells
Wogonin Malignant T-Cells N/A Apoptosis Increase Fisetin HCT-116 Cells N/A
31
Structural Similarity
Methods UPK Inference Example Similarity measure Apoptosis Increase Wogonin Wogonin Fisetin Similarity UMLS Semantic Type Organic Chemical 1 Structural Similarity 0.75 Atomic Count C16H12O5 C15H10O6 Semantic Similarity 0.265 Malignant T-Cells N/A Apoptosis Increase Fisetin HCT-116 Cells N/A
32
Methods UPK Inference Example Apoptosis Increase Wogonin Wogonin
Malignant T-Cells N/A Apoptosis Increase Fisetin Fisetin HCT-116 Cells N/A
33
Results & Discussion Input data Extraction result
500 PubMED abstracts related to ‘apoptosis’ Extraction result Entity Type # of extracted entities Substances 410 Processes 357 Diseases 44 Body Parts 82
34
Results & Discussion: Semantic Similarity with Wogonin
Similarity between wogonin_NN1 and docetaxel_NN1 : Similarity between wogonin_NN1 and serotonin_NN1 : Similarity between wogonin_NN1 and amisulpride_NN1 : 0.0 Similarity between wogonin_NN1 and ranolazine_NN1 : 0.0 Similarity between wogonin_NN1 and genistein_NN1 : Similarity between wogonin_NN1 and brivaracetam_NN1 : 0.0 Similarity between wogonin_NN1 and carisbamate_NN1 : 0.0 Similarity between wogonin_NN1 and riboflavin_NN1 : 0.0 Similarity between wogonin_NN1 and fisetin_NN1 : Similarity between wogonin_NN1 and daidzein_NN1 : 0.0 Similarity between wogonin_NN1 and caffeine_NN1 : 0.0 Similarity between wogonin_NN1 and enzyme_NN1 : E-4 Similarity between wogonin_NN1 and topiramate_NN1 : 0.0 Similarity between wogonin_NN1 and melatonin_NN1 : 0.084 Similarity between wogonin_NN1 and nimodipine_NN1 : 0.086
35
Results & Discussion: PageRank Score
Substance Name Semantic Type PageRank Similarity NAG-1 Gene or Genome apoptosis Cell Function Flou-3 AM Pharmacologic Substance wogonin Organic Chemical Docetaxel Jarisch-Herxheimer reaction Functional Concept apoptotic cells Cell Genistein p53 docetaxel+SN Organic Compound adverse reactions Finding atRA HCT-116 Cell Line HCT-116 cells SN-38 mesenchyme Embryonic Structure
36
Results & Discussion: Semantic Similarity for Magnesium and Migraine
Semantic Type: Disease or Syndrome Semantic Type: Disease or Syndrome A1 (Magnesium) C1 (Migraine) B1 (Epilepsy) Positive impact on Is related to Element, Ion, or Isotope Magnesium – Epilepsy: 0.033 Magnesium – Malaria: 0.011 Magnesium – Sarcoidosis: 0.015 Magnesium – Diabetes: 0.017 Magnesium – Asthma: 0.021 Magnesium – Hyperoxaluria: 0.026 Magnesium – Hepatitis: 0.018 Epilepsy – Migraine: 0.158 Epilepsy – Malaria: 0.004 Epilepsy – Sarcoidosis: 0.041 Epilepsy – Diabetes: 0.049 Epilepsy – Asthma: 0.058 Epilepsy – Hyperoxaluria: 0.002 Epilepsy – Hepatitis:
37
Results & Discussion Sample of new relationships Supporting Papers
Wogonin increases apoptosis in HCT-116 cells “Reactive oxygen species up-regulate p53 and Puma; a possible mechanism for apoptosis during combined treatment with TRAIL and wogonin”, Dae-Hee Lee et. al. Genistein can induce apoptosis in HCT-116 cells “Genistein, a Dietary Isoflavone, down-regulates the MDM2 Oncogene at Both Transcriptional and Posttranslational Levels”, Mao Li et. al. Substance Effect Type Process Disease Body Part Wogonin Increase Apoptosis N/A HCT-116 Cells Fisetin Malignant T Cells Docetaxel mRNA expression of IL-1 Genistein Tumor Cells
38
Summary & Future Work It is a on-going project.
The system was applied on the entity relations identified by our information model. We proposed a new system that extracts relationships from biomedical text and infers new information. Future work Other techniques for NER. Anaphoric relationship extraction. Further enhancing Link Grammar lexicon. Rule generalization to provide better coverage.
39
Demo Retrieve stored entities and relations
Download pubmed record and extract entities and relations
40
Conclusion We suggested context-vectors to infer unknown relationships based on biologically meaningful terms. We constructed multi-level entity dictionary to recognize multi-level entities from the literature. We utilized our context vectors to discover putative drugs and diseases relationships. We evaluated the results by drug-disease relations which are curated from the literature. (PharmGKB, CTD). In the Alzheimer’s disease 77,711 papers, we found that our context vector based hybrid approach has better precision than previous frequency based ABC model.
41
Thank you! Questions? Thank You!
42
Appendix: Future Study: Difference Approach to Context Terms
Based on Interaction words (verb terms), define possible direct interaction among entities, and assume that interactions among the rest of entities are context. I-verb I-Ent1 I-En2 C-Ent Sentence 1 Sentence 2 C-Ent I-Ent1 I-verb I-En2 C-Ent C-Ent Sentence 3 C-Ent C-Ent I-En1 I-verb I-Ent2 C-Ent
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.