Literature Based Discovery Dimitar Hristovski Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana,

Slides:

Advertisements

Similar presentations

Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.

Advertisements

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.

Semantic indexing in PubMed CERN Workshop on Innovations in Scholarly Communication (OAI8) CERN Workshop on Innovations in Scholarly Communication (OAI8)

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.

Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,

Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.

QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.

DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.

UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.

Social Pharmacy and Pharmacoepidemiology Lister Hill National Center for Biomedical Communications Text-based Discovery in Biomedicine The Architecture.

CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.

Literature Mining Tools for Analysis of Genomic Data Ramin Homayouni, Ph.D. Associate Professor of Biology Director of Bioinformatics UTHSC BINF April.

The LINDI Project Linking Information for New Discoveries UIs for building and reusing hypothesis seeking strategies. Statistical language analysis techniques.

NLM-Semantic Medline Data Science Data Publication Commons Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.

Data Mining Techniques

Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.

Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.

1 WORK ON COMPUTERS Winter Semester : List of Topics 1. Medical literature as a resource for evidence based medicine. An overview. 2. Biomedical literature.

Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.

NICTA Copyright 2013From imagination to impact Identifying Publication Types Using Machine Learning BioASQ Challenge Workshop A. Jimeno Yepes, J.G. Mork,

Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

1 How to find literature - A very short introduction SMED 8004 Medicine and Health Library October 2014.

Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …

CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.

Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida

Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

1 Literature-Based Knowledge Discovery using Natural Language Processing Dimitar Hristovski, 1 PhD, Carol Friedman, 2 PhD, Thomas C Rindflesch, 3 PhD,

Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.

Copyright OpenHelix. No use or reproduction without express written consent1.

Japan Consortium for Glycobiology and Glycotechnology DataBase 日本糖鎖科学統合データベース GDGDB - Glyco-Disease Genes Database The complexity of glycan metabolic pathways.

Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.

1 Semantic Relations for Interpreting DNA Microarray Data and for Novel Hypotheses Generation Dimitar Hristovski, 1 PhD, Andrej Kastrin, 2 Borut Peterlin,

Semantic Relation Discovery by Using Co-occurrence Information Background: MEDLINE contains high quality semantic metadata covering more than 22 million.

Clinical Decision Support Systems Dimitar Hristovski, Ph.D. Institute of Biomedical.

De-anonymizing Genomic Databases Using Phenotypic Traits Humbert et al. Proceedings on Privacy Enhancing Technologies 2015 (2) :

BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.

Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

MEDLINE®/PubMed® PubMed for Trainers, Fall 2015 U.S. National Library of Medicine (NLM) and NLM Training Center An introduction.

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Evidence-Based Medicine in PubMed PubMed for Trainers, Summer 2016 U.S. National Library of Medicine (NLM) and NN/LM Training Office.

Results for all features Results for the reduced set of features

RaJoLink: Creative Knowledge Discovery by Literature Outlier Detection

Lindsay & Gordon’s Discovery Support Systems Model

Evaluating classifiers for disease gene discovery

PubMed Database Interface (Basic Course Module 4 Part A)

Category-Based Pseudowords

Lecture 12: Data Wrangling

A Short Tutorial on Causal Network Modeling and Discovery

Citation-based Extraction of Core Contents from Biomedical Articles

PubMed Database Interface (Basic Course: Module 4 Part A)

PubMed Database Interface Part A (Basic Course Module 4)

PubMed Database Interface (Basic Course: Module 4)

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani

Presentation transcript:

Literature Based Discovery Dimitar Hristovski Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Let me introduce myself … Research and Development BS – Biomedicina Slovenica database Research Evaluation Decision Support System Medical Information Systems –Surgical clinics –Genetic laboratory –Biochemical laboratory Web User Behaviour Analysis Data warehousing and OLAP

Motivation Overspecialization Information overload Large databases For many diseases the chromosomal region known, but not the exact gene

Background Literature-based discovery (Swanson): Concept X (Disease) Concepts Y (Pathologycal or Cell Function, …) Concepts Z (Genes) New Relation?

Biomedical Discovery Support System (BITOLA) Goal: –discover potentially new relations (knowledge) between biomedical concepts –to be used as research idea generator and/or as –an alternative way to search Medline System user (researcher or intermediary): –interactively guides the discovery process –evaluates the proposed relations

Extending and Enhancing Literature Based Discovery Goal: –Make literature based discovery more suitable for disease candidate gene discovery –Decrease the number of candidate relations Method: –Integrate background knowledge: Chromosomal location of diseases and genes Gene expression location Disease manifestation location

Usage Scenarios For a disease with known chromosomal location, find a candidate gene For a gene, find a disease that might be influenced For a disease and gene found to be related by linkage study, find the mechanism of the relation (intermediate concepts should help)

System Overview Knowledge Base Concepts Association Rules Background Knowledge (Chromosomal Locations, …) Discovery Algorithm User Interface Databases (Medline, LocusLink, HUGO, OMIM, …) Knowledge Extraction

Databases Medline: source of known relationships between biomedical concepts Set of concepts: –MeSH (Medical Subject Headings): Controlled dictionary and thesaurus used for indexing and searching the Medline database –HUGO: official gene symbols, names and aliases –LocusLink: gene symbols, aliases and chr.locations –OMIM: genetic diseases UMLS (Unified Medical Language System) Entrez: used to search PubMed, GenBank,... UniGene: gene expression

Knowledge Extraction Build master set of concepts (MeSH terms and gene symbols) Extract occurrence of concepts from each Medline record (MeSH terms from MH field, gene symbols from Title and Abstract) Association rule mining (concept co-occurrence) Chromosomal location extraction (from LocusLink and HUGO) Load into knowledge base

Terminology Problems during Knowledge Extraction Gene names Gene symbols MeSH and genetic diseases

Detected Gene Symbols by Frequency type| II| III| component| CT| AT| ATP| IV| CD4|99657 p53|89357 MR|88682 SD|85889 GH|84797 LPS| |67272 E2| |63521 AMP|61862 TNF|59343 RA|58818 CD8|57324 O2|56847 ACTH|54933 CO2|53171 PKC|51057 EGF|50483 T3|49632 MS|46813 A2|44896 ER|43212 upstream|41820 PRL|41599

Gene Symbol Disambiguation Find MEDLINE docs in which we can expect to find gene symbols JD indexing (Susanne Humphrey) as possible solution: –Identifies the semantic context of docs –If semantic context not genetic, then gene symbol probably false positive Example of false positive: –Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390 –breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support

JD Indexing JDs are 127 Journal Descriptors (e.g., JDs for journal Hum Mol Genet: Cytogenetics; Genetics, Medical) Training set docs (435,000) inherit JDs from journals Training set provides co-occurrence data between inherited JDs and: –indexing terms assigned to docs directly –words in docs Docs having indexing terms/words occurring often with genetics JDs in tr. set assumed to have genetics context Extended to indexing by 134 UMLS semantic types (e.g. Gene or Genome, Gene Function,…)

System Overview Knowledge Base Concepts Association Rules Background Knowledge (Chromosomal Locations, …) Discovery Algorithm User Interface Databases (Medline, LocusLink, HUGO, OMIM, …) Knowledge Extraction

Binary Association Rules X  Y (confidence, support) If X Then Y (confidence, support) Confidence = % of docs containing Y within the X docs Support = number (or %) of docs containing both X and Y The relation between X and Y not known. Examples: –Multiple Sclerosis  Optic Neuritis (2.02, 117) –Multiple Sclerosis  Interferon-beta (5.17, 300)

Discovery Algorithm Concept X (Disease) Concepts Y (Pathologycal or Cell Function, …) Concepts Z (Genes) Chromosomal Region Chromosomal Location Candidate Gene? Match Manifestation Location Expression Location Match

Discovery Algorithm Let X be starting concept of interest. Find all Y for which X  Y. Find all Z for which Y  Z. Eliminate those Z for which X->Z already exists. Eliminate those Z that do not match the chromosomal region of X Eliminate those Z that do not match the expression location of X Remaining Z are candidates for new relation between X and Z. In general: X  Y 1  …  Y n  Z, but not X  Z Example: X = disease Y = (pato)physiology of X Z = (de)regulators of Y (drugs, proteins, genes) New relation example: Z is candidate gene for disease X

Ranking Concepts Z X Y1Y1 Y2Y2 Y3Y3 YiYi YjYj … … Z1Z1 Z2Z2 Z3Z3 ZkZk ZnZn

Results: Concepts in Medline Full Medline (end 2001) analyzed (11,226,520 recs) Looking for 19,781 MeSH terms and 22,252 human genes (14,659 from HUGO and 7,593 from LocusLink). 24,613 alias gene symbols added Gene symbols found in 2,689,958 Medline recs. Most frequent ambiguous symbols (CT, MR, CO2,…) or format errors

Results: Co-occurring Concepts in Medline 29,851,448 distinct pairs of co-occurring concepts: –In 7,106,099 at least one gene symbol appeared –In 679,159 pairs both concepts are gene symbols Total co-occurrence frequency: 798,366,684 59,702,986 association rules calculated and stored

Bilateral Perisylvian Polymicrogiria - BPP (OMIM: ) Polymicrogyria of the cerebral cortex is a developmental abnormality characterized by excessive surface convolution Clinical characteristics: –Mental retardation –Epilepsy –Pseudobulbar palsy (paralysis of the face, throat, tongue and the chewing process) X linked dominant inheritance

It is considered a disorder of neuronal migration (unlayered type) or a consequence of intrauterine ischemia (layered type) BPP - pathogenesis

Finding Candidate Genes for Polymicrogyria, bilateral perisylvan

18 gene candidates 15 gene candidates Tissue specific expression 2 gene candidates: L1CAM and FLNA relation between semantic types Cell Movement and Gene or gene products Sublocalisation in the Xq genes in Xq28

User Interface “cgi-bin” version

Automatically search for supporting Medline Citations

Cleft Palate – Predicting Candidate Genes

Summary and Conclusions We extend and enhance an existing discovery support system (BITOLA) The system can be used as: –Research idea generator, or –Alternative method of searching Medline Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery

Further Work Increase the number of concepts Gene symbol disambiguation Semantic relations extraction System evaluation Improve the Web version of the system

System Availability URL:

Related work: SemGen Tom Rindflesch et al Extract semantic predications on genetic basis of disease “Deletions of INK4 occur in malignant tumors” –INK4|ASSOCIATED_WITH|Malignant Tumors Evaluation and visualization of SemGen output

Semantic Structures CAUSEPREDISPOSEASSOCIATED_WITH ETIOLOGY_OF cause determine result in control underlie transmit responsible predispose lead to promote susceptibility risk associate involve link implicate influence related

Statistical Evaluation Assoc. rule base divided into 2 segments: older ( ) and newer ( ) The system predicts new relations based on the older segment Predictions compared with actual new relations in the newer segment

Summary Statistical Evaluation Results

Statistical Evaluation Results With no assoc. rules constraints: – predicts almost all new relations, but too many candidate relations With constraints: –predicts new relations 6.9 times better than random predictions –tighter the constraints, better (correct / all predictions) ratio (6.5%)