Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Kincho H. Law, Siddharth Taduri, Gloria T. Lau Engineering Informatics.

Slides:



Advertisements
Similar presentations
Knowledge Graph: Connecting Big Data Semantics
Advertisements

Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Bio-REGNET An Ontology to Integrate Multiple Information Domains in the Patent System Siddharth Taduri Hang Yu Gloria T. Lau Kincho H. Law Jay P. Kesan.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
The Thomson Reuters CITATION CONNECTION Digital Library st March – 3 rd April 2014, Jasná David Horký Country Manager – Central and Eastern Europe.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Overview of Search Engines
The Importance and Role of Patent Information Jerusalem 21 June 2010 Andrew Czajkowski Head, Innovation and Technology Support Section.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
How to do a literature search Saharuddin Ahmad Aida Jaffar Department of Family Medicine.
BME1450: Biomaterials and Biomedical Research Michelle Baratta Engineering & Computer Science Library Maria Buda Dentistry Library.
Bioinformatics and medicine: Are we meeting the challenge?
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
Flexible Text Mining using Interactive Information Extraction David Milward
Information Management and Compliance Assistance for Patent Laws and Regulations PIs: Jay Kesan, University of Illinois at Urbana-Champaign Kincho Law,
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
INFO Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Mining the Biomedical Research Literature Ken Baclawski.
Workshop on The Transformation of Science Max Planck Society, Elmau, Germany June 1, 1999 TOWARDS INFORMATIONAL SCIENCE Indexing and Analyzing the Knowledge.
1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.
Bio-REGNET Retrieval of Patent Documents from Heterogeneous Sources using Ontologies and Similarity Analysis Siddharth Taduri, Gloria T. Lau, Kincho H.
UIC at TREC 2007: Genomics Track Wei Zhou, Clement Yu University of Illinois at Chicago Nov. 8, 2007.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
APPLICATION OF ONTOLOGIES IN CANCER NANOTECHNOLOGY RESEARCH Faculty of Engineering in Foreign Languages 1 Student: Andreea Buga Group: 1241E – FILS Coordinating.
Bio-REGNET Developing an Ontology for the U.S. Patent System Siddharth Taduri, Hang Yu, Gloria T. Lau, Kincho H. Law, Jay P. Kesan Stanford University.
Oncologic Pathology in Biomedical Terminologies Challenges for Data Integration Olivier Bodenreider National Library of Medicine Bethesda, Maryland -
Supporting Collaborative Ontology Development in Protégé International Semantic Web Conference 2008 Tania Tudorache, Natalya F. Noy, Mark A. Musen Stanford.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
MEDLINE®/PubMed® PubMed for Trainers, Fall 2015 U.S. National Library of Medicine (NLM) and NLM Training Center An introduction.
GUIDE. P UB M ED
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
BME1450: Biomaterials and Biomedical Research
CCNT Lab of Zhejiang University
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Applications of Text Mining
Ontology Evolution: A Methodological Overview
CSE 635 Multimedia Information Retrieval
The Linked Data Cloud Source: Chris Bizer. Linking Open Drug Data Susie Stephens, Principal Research Scientist, Eli Lilly.
PubMed.
Jonathan Griffin, Managing Director, IFIS Publishing &
Presentation transcript:

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Kincho H. Law, Siddharth Taduri, Gloria T. Lau Engineering Informatics Lab at Stanford University

Motivation PMID: Regional variability in the incidence of end-stage renal disease : an epidemiological approach. …. Regional variability in the incidence of end-stage renal disease (ESRD) in Austria is reported. Our aim was …. low rates in the state of Tyrol. …. ESRD incidence data were obtained from …. …. Between 1995 and 1999, 4811 new cases of ESRD were recorded; the state of Tyrol (T) …. incidence of ESRD patients with type 2 diabetes mellitus …. the difference in the overall ESRD incidence …. prevalence of DM, a highly significant correlation was found between ESRD incidence and DM. …. variability in the ESRD incidence in Austria is explained mainly by regional differences in DM-2. Data from similar studies …. allocation for ESRD …. …. Synonyms for ESRD End Stage Kidney Disease … Renal Disease, End Stage …. Renal Failure, End Stage …. Kidney Disease, Chronic Renal Failure, Chronic End-Stage Kidney Disease ESRD Renal Disease, End-Stage Renal Failure, End-Stage Chronic Kidney Failure Chronic Renal Failure 05/01/2012 Engineering Informatics Lab at Stanford University 2

Data Set and Knowledge TREC 2007 Genomics Data Set Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine Metadata available through MEDLINE Tasks involve passage, document, and feature retrieval Methodologies are evaluated on their response to 36 topics (‘queries’) The topics are categorized based on 13 entity types (Proteins, Genes, etc.) Domain Knowledge Over 250 biomedical ontologies from BioPortal 05/01/2012 Engineering Informatics Lab at Stanford University 3

XML Representation of Scientific Publications in PubMed …. … …. The Journal of clinical endocrinology and metabolism J. Clin. Endocrinol. Metab. About the use … of an ACTH 1-39 …. …. 05/01/2012 Engineering Informatics Lab at Stanford University 4

Domain Knowledge Integration (1)Annotating Documents prior to indexing – Response time is fast – Not flexible, the entire index has to be updated if a new ontology needs to be added – Indexes can grow very large (2) Query Expansion – Response time is slower – Very flexible, ontologies can be dynamically chosen 05/01/2012 Engineering Informatics Lab at Stanford University 5

Query Expansion The pre-processed query is automatically expanded using BioPortal’s API [Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma, Leukemia …} Tumor Leukemia Melanoma Adenocarcinoma Nerve Sheath Neo Synonyms Cancer, Neoplasm, … Synonyms Leucocythaemias Leucocythemia MeSH 05/01/2012 Engineering Informatics Lab at Stanford University 6

Choosing Domain Knowledge The use of synonymy results in inconsistent performance (2007 TREC genomics track) Common reasons include: – Relevant terms may not be classified as expected – Some relevant terms may not be classified in a particular ontology – Incomplete information (such as synonyms) Selection of the appropriate domain ontology is important 05/01/2012 Engineering Informatics Lab at Stanford University 7

Enriching Existing Ontologies Existing ontologies can be enriched to complete some missing information Multiple ontologies can be used to provide different classifications MeSH NCI OntologyNDF ConceptPamidronate Synonyms from NDFAPD, Amidronate,... Synonyms from MeSH pamidronate calcium, pamidronate monosodium, aredia Synonyms from NCIPamidronic acid, pamidronate disodium, … 05/01/2012 Engineering Informatics Lab at Stanford University 8

Evaluations Baseline With Query Expansion (Suggested Sources) Using Enriched Ontologies Multiple Query Expansions per query Summary of Document MAP scores in 2007 TREC genomics track Max Min Mean Median /01/2012 Engineering Informatics Lab at Stanford University 9

Queries Topic Number QuerySuggested Sources for Terms (TREC) Selected Domain Knowledge (Our Methodology) 205What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease? WikipediaSymptom Ontology 206What [TOXICITIES] are associated with zoledronic acid? Wikipedia + Aaron NCI Thesaurus 207What [TOXICITIES] are associated with etidronate?Wikipedia + Aaron NCI Thesaurus 211What [ANTIBODIES] have been used to detect protein PSD-95? MeSH 229What [SIGNS OR SYMPTOMS] are caused by human parvovirus infection? WikipediaSymptom Ontology 231What [TUMOR TYPES] are found in zebrafish?AaronMeSH 05/01/2012 Engineering Informatics Lab at Stanford University 10

Baseline Queries are used without modification, e.g., – “What [ANTIBODIES] have been used to detect protein PSD-95?” – “What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease?” Document MAP: /01/2012 Engineering Informatics Lab at Stanford University 11

Query Expansion Original Query: What [TUMOR TYPES] are found in zebrafish? Queries are formulated in ‘AND’ clauses: “[Tumor][MeSH] AND zebrafish” => (Tumor, Neoplasm, Carcinoma, Leukemia …) AND zebrafish Document MAP: /01/2012 Engineering Informatics Lab at Stanford University 12

Multiple Query Expansion Terms Expansion can be performed on multiple terms in the query Example: Coronary Artery Disease => {Coronary heart disease, coronary disease, CAD, …} [Tumor][MeSH] AND zebrafish[MeSH} => (tumor, neoplasm, …) AND (zebrafish, danio rerio, …) Document MAP: /01/2012 Engineering Informatics Lab at Stanford University 13

Enriched Ontology – Current Status Marginal improvement over basic enhanced models Document MAP: (Marginal improvement from 0.347) Issues: – Framework for enrichment based on synonymy is rigid, i.e., relevant terms that are entirely missing in the ontology are still not included – Relevant terms that are classified differently are never included in the search 05/01/2012 Engineering Informatics Lab at Stanford University 14

IR Tool Expert knowledge is valuable Developed a search tool which automatically integrates with knowledge sources and searches documents We extend MINOE, a co-occurrence based visualization tool, originally designed for exploring marine ecosystems User can browse (or search) documents through ontologies and visualize interactions between concepts 05/01/2012 Engineering Informatics Lab at Stanford University 15

Snapshots of the Tool 05/01/2012 Engineering Informatics Lab at Stanford University 16

I. Enter Query Terms II. Domain Knowledge Integration III. Shows Expanded Query, and other filters that are added to the search 05/01/2012 Engineering Informatics Lab at Stanford University 17

TREC Topic 220 Query: What [PROTEINS] are involved in the activation or recognition mechanism for PmrD? Domain Knowledge: MeSH 05/01/2012 Engineering Informatics Lab at Stanford University 18 Depth of Hierarchical Expansion to Child NodesLevel 1Level 2Level 3 Document MAP

05/01/2012 Engineering Informatics Lab at Stanford University 19

05/01/2012 Engineering Informatics Lab at Stanford University 20

05/01/2012 Engineering Informatics Lab at Stanford University 21

05/01/2012 Engineering Informatics Lab at Stanford University 22

05/01/2012 Engineering Informatics Lab at Stanford University 23

05/01/2012 Engineering Informatics Lab at Stanford University 24

Changed 05/01/2012 Engineering Informatics Lab at Stanford University 25

05/01/2012 Engineering Informatics Lab at Stanford University 26

MeSH Descriptors 05/01/2012 Engineering Informatics Lab at Stanford University 27

05/01/2012 Engineering Informatics Lab at Stanford University 28

05/01/2012 Engineering Informatics Lab at Stanford University 29

05/01/2012 Engineering Informatics Lab at Stanford University 30

(>1500 Documents) Shows Association Between Concepts 05/01/2012 Engineering Informatics Lab at Stanford University 31

CHILD CONCEPTS Stronger Association: ~270 Documents Weaker Association: ~57 Documents 05/01/2012 Engineering Informatics Lab at Stanford University 32

Retrieving Information Across Multiple Diverse Information Sources Issued Patents and Applications Court Cases File Wrappers Technical Publications Regulations and Laws Patent System Technology Firms’ Concerns Can I get patent protection for my innovation? Do I build or do I buy related technologies? What are my competitors doing? How strong are their patents? Am I perhaps infringing on someone else’s patents? Is so, are those patents valid? Have they been enforced in court? Has their validity been challenged in court? 05/01/2012 Engineering Informatics Lab at Stanford University 33

PATENT United States Patent, 5,955,422 September 21, 1999 Production of erthropoietin Abstract: Disclosed are novel polypeptides possessing part or all of the primary structural conformation and one or more of the biological properties of mammalian erythropoietin ("EPO") … Inventors: Lin; Fu-Kuen (Thousand Oaks, CA) Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA) Appl. No.: 08/100,197 Filed: August 2, COURT CASE 314 F.3d 1313 (2003) AMGEN INC., Plaintiff-Cross Appellant v. HOECHST MARION ROUSSEL, INC. (now known as Aventis Pharmaceuticals, Inc.) and Transkaryotic Therapies, Inc., Defendants-Appellants. … Plaintiff-Cross Appellant Amgen Inc. is the owner of numerous patents directed to the production of erythropoietin ("EPO"), …alleging that TKT's Investigational New Drug Application ("INDA") infringed United States Patent Nos. 5,547,933; 5,618,698; and 5,621,080. The complaint was amended in October 1999 to include United States Patent Nos. 5,756,349 and 5,955,422, which issued after suit was filed. FILE WRAPPER U.S. Patent 5,955,422 … Claims are rejected under 35 U.S.C. § 103 as being unpatentable over any one of Miyake et al., 1977 (R) … In accordance with the provisions of 37 C.F.R. §1.607, the present continuation is being filed for the purpose of … Publication Database REGULATIONS: U.S. Code Title 35, C. F. R Title 37, M. P. E. P. … BIOPORTAL: DOMAIN KNOWLEDGE Cross-Referencing between Information Sources Solution: Patent System Ontology 05/01/2012 Engineering Informatics Lab at Stanford University 34

Patent System Ontology I.Facilitate information integration across multiple diverse information sources This requires a standardized representation (a formal semantic model) - Patent System Ontology II.Integrate Domain Semantics into existing Information Retrieval and Text mining methodologies to improve retrieval of information 05/01/2012 Engineering Informatics Lab at Stanford University 35

Patent System Ontology Information Retrieval Framework 05/01/2012 Engineering Informatics Lab at Stanford University 36

Future Work Using multiple enriched ontologies may provide the necessary terms MeSH Descriptors are provided for every publication during indexing and can potentially improve results Implement Okapi model for scoring documents 05/01/2012 Engineering Informatics Lab at Stanford University 37

Thank You 05/01/2012 Engineering Informatics Lab at Stanford University 38

Backup Slides 05/01/2012 Engineering Informatics Lab at Stanford University 39

Motivation Scientific literature is an important source of information Retrieving relevant information from scientific publications is challenging Domain terminology is used inconsistently in scientific publications Increasing amounts of information amplify the problem Improved methodologies based on semantics are required 05/01/2012 Engineering Informatics Lab at Stanford University 40

Background Text REtrieval Conference (TREC) organized by NIST has showcased many successful methods The Genomics track focused on full-text scientific publications from 49 prominent journals Methodologies involved: – Use of Synonymy from ontologies – Language based models – Query expansion and annotations – Okapi scoring model 05/01/2012 Engineering Informatics Lab at Stanford University 41

Goals Understand how domain ontologies can be leveraged Understand which domain ontologies can be leveraged Develop a knowledge-based approach to integrate domain knowledge with search mechanism 05/01/2012 Engineering Informatics Lab at Stanford University 42

Query Expansion TREC Queries are first manually pre-processed “What [TUMOR TYPES] are found in zebrafish?” => “[Tumor][MeSH] AND zebrafish” [Tumor] indicates term that has to be expanded [MeSH] indicates ontology that should be used 05/01/2012 Engineering Informatics Lab at Stanford University 43

Summary Search methodologies must be based on semantics in order to tackle terminology inconsistency Domain ontologies provide these semantics Domain ontologies need to be modified (or enriched) in order to fulfill information needs User interaction is important 05/01/2012 Engineering Informatics Lab at Stanford University 44

BioPortal BioPortal is an integrated resource for biomedical ontologies Currently indexes over 300 ontologies including Medical Subject Headings and Gene Ontology Provides a comprehensive web service, abstracting the formats and API’s of all underlying ontologies 05/01/2012 Engineering Informatics Lab at Stanford University 45