Linked Life Data for annotation of Medline Semantic data-integration and search in the life science domain Vassil Momtchev (Ontotext)

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.
Frank van Harmelen Vrije Universiteit Amsterdam The Information Universe of the (Near) Futur e Creative Commons License: allowed to share & remix, but.
Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery K. Baclawski Northeastern University.
Frank van Harmelen Vrije Universiteit Amsterdam The Web of data and LarKC’s role in it Creative Commons License: allowed to share & remix, but must attribute.
Semantic Web Agents: Hope or Hype Nicholas Gibbins School of Electronics and Computer Science University of Southampton.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Pharmaceutical R&D and the role of semantics in information management and decision- making Otto Ritter AstraZeneca R&D Boston W3C Workshop on Semantic.
The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.
Who am I Gianluca Correndo PhD student (end of PhD) Work in the group of medical informatics (Paolo Terenziani) PhD thesis on contextualization techniques.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Toward Making Online Biological Data Machine Understandable Cui Tao.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
Overview of Search Engines
OIL: An Ontology Infrastructure for the Semantic Web D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider Presenter: Cristina.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
E ARLY C LINICAL D RUG D EVELOPMENT WP7a Use case 2 AstraZeneca & Ontotext 09/09/
VIVO: Sharing Data for Research Discovery Mike Conlon University of Florida
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Sept. 5, 2012 Kevin T. Gallagher and Linda C. Gundersen September 5, 2012 CDI Science.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
A Context Model based on Ontological Languages: a Proposal for Information Visualization School of Informatics Castilla-La Mancha University Ramón Hervás.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Working with Ontologies Introduction to DOGMA and related research.
12/7/2015Page 1 Service-enabling Biomedical Research Enterprise Chapter 5 B. Ramamurthy.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Mining the Biomedical Research Literature Ken Baclawski.
Workshop on The Transformation of Science Max Planck Society, Elmau, Germany June 1, 1999 TOWARDS INFORMATIONAL SCIENCE Indexing and Analyzing the Knowledge.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Developing GRID Applications GRACE Project
Genomic Medicine Grid Juan Pedro Sánchez Merino Instituto de Salud Carlos III
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
TDM in the Life Sciences Application to Drug Repositioning *
Information Retrieval in Practice
2. An overview of SDMX (What is SDMX? Part I)
LOD reference architecture
Metadata The metadata contains
Service-enabling Biomedical Research Enterprise
Presentation transcript:

Linked Life Data for annotation of Medline Semantic data-integration and search in the life science domain Vassil Momtchev (Ontotext)

Outline Life science and health care vertical – opportunity for semantic technology How RDF technology will help the end-user Linked Life Data – a platform for semantic data integration LifeSKIM – A smart textual analysis backed by an ontology The way to semantic Service Oriented Architecture

Innovation or Stagnation What’s the Diagnosis? Investment & progress in basic biomedical science has for surpassed investment and progress in the medical product development process The development process – the critical path to patients – becoming a serious bottleneck to delivery of new products We are using the evaluation tools and infrastructure of the last century to develop this century’s advances From FDA presentation on Critical Path for Science Board by Janet Woodcock, 2004/04/26 The way to semantic Service Oriented Architecture

Andy Law’s First and Second Laws “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” The way to semantic Service Oriented Architecture

Take Your Best Guess The way to semantic Service Oriented Architecture

The Problems The data is supported by different organizations The information is highly distributed and redundant There are tons of flat file formats with special semantics The knowledge is locked in vast data silos There are many isolated communities which could not reach cross-domain understanding Massive data integration and interpretation problem! The way to semantic Service Oriented Architecture

Drug Development Process TIHILOECDPoCDfLRegLCM Discovery Early Clinical Dev. Development Preclinical studiesClinical studies Target Identification Hit Identification Lead Optimisation Proof of Concept Development for Launch Registration and Launch Life Cycle Management The way to semantic Service Oriented Architecture

The Questions in Early Clinical Development The " translation " of basic research into real therapies for real patients – Translational Medicine Understand the drug in context of: the disease – The chemistry/pharmacology process – How to measure? – What causes the disease? – How does the disease evolve? the patient – What different phenotypes exists? – Are there different Genetic profiles? The way to semantic Service Oriented Architecture

The Challenge Develop compound and knowledge to prove its target population Analyze the vast amounts of existing information A successful project lasts for 7 to 15 years The way to semantic Service Oriented Architecture

The Health Care and Life Science Industry Needs Support incremental extension of the knowledge base with highly heterogeneous data sets Allow straightforward updates of the information Provide scientists with computational support to conceptualize the breath and depth of relationships between data Analyze unstructured information The need of powerful heterogeneous knowledge stores The way to semantic Service Oriented Architecture

Which Technology to Choose? The way to semantic Service Oriented Architecture

Possible Solutions Classical data-integration with: Data warehouses Federation middleware frameworks Database middleware technology Not really... Mapping works efficiently on a small scale Different design paradigm can be a real challenge Direct mapping usually does not work No standard way to integrate textual information The way to semantic Service Oriented Architecture We are using the evaluation tools and infrastructure of the last century to develop this century’s advances

Semantic Data Integration Benefits To overcome the different semantic and syntax representation To handle inconsistencies problems related to incomplete data or different versions To unlock the data stored in silos and solve container- reference dichotomy – data once stored and connected is hard to rearrange and connect in new ways How semantic web technology could help to end users? The way to semantic Service Oriented Architecture

What is Semantic Web? Enrich the existing web Recipe: –Annotate, classify, index Meta-data from: –Automatically producing mark-up: named-entity recognition concept extraction, tagging, etc. Enable personalisation, search, browse... Semantic Web as Web of Data Recipe: –Expose data on the web, use RDF, integrate Meta-data from: –Expressing DB schema semantics in machine interpretable ways Enable integration and unexpected reuse The way to semantic Service Oriented Architecture Source: Frank van Harmelen RDF presentation

W3C Stack XML – Surface syntax, no semantics XML Schema – Describes structure of XML documents RDF – Data model for “relations” between “things” RDF Schema – RDF Vocabulary Definition Language The picture is a bit out-dated today

So Why No Just Use XML? 01 Sweden Stockholm 01 No agreement on: Structure is country a: object? class? attribute? relation? something else? what nesting mean? Vocabulary is country same as nation? Are the above XML documents the same? Do they convey the same information? Is that information machine-accessible?

What is RDF? RDF – stands for Resource Description Framework – is a W3C Recommendation ( RDF is a data model – for representing meta-data (data about data) – for describing the semantics of information in a machine-accessible way What can you use it for? – intelligent information brokering – meaning-based computing – agent communication

How RDF looks like? urn:country:Sweden urn:city:Stockholm “Sweden” “Stockholm” “01” hasName hasCapital hasAreaCode SubjectPredicateObject urn:country:SwedenhasName“Sweden”. urn:country:SwedenhasCapitalurn:city:Stockholm. urn:city:StockholmhasName“Stockholm”. urn:city:StockholmhasAreaCode“01”.

RDF Schema and further interpretation urn:country:Sweden urn:city:Stockholm “Sweden” “Stockholm” “01” hasName hasCapital hasAreaCode urn:concept:Country urn:concept:Capital ofType urn:concept:Nation sameAs ofType

RDF for Life Sciences ESTC Sept, 2008 rdf:type rdf:seeAlso urn:intact:1007 urn:uniprot:P urn:uniprot:Protein urn:biogrid:Interaction urn:biogrid:15904 urn:biogrid:FBgn urn:biogrid:FBgn urn:pubmed:15904 urn:uniprot:FBgn urn:uniprot:FBgn rdf:type urn:intact:Interaction urn:uniprot:Q interactsWith hasParticipant rdf:type sameAs Resolve the syntactic differences in the identifiersUse relationships to derive new implicit knowledge These are only examples resource names

Entrez Databases The way to semantic Service Oriented Architecture

Linked Life Data Linked Life Data stands for a platform to: Operate with heterogeneous data sets Allow semantic data integration Provide tools for knowledge access and management Compliant with W3C standards and recommendations Developed in collaboration with AstraZeneca in LarKC project The way to semantic Service Oriented Architecture

Our Objectives Integrate the linked information using RDF data model – Integrated data sources to cover the path: gene – proteins – pathways – targets – disease – drugs – patient Reason over the integrated dataset – Remove redundancy / generate new links – Derive new implicit knowledge (e.g., “caspase activation via cytochrome c” is special form of “apoptosis regulation”) Do it on a very large scale! The way to semantic Service Oriented Architecture

Data Sources Type of data sources Gene and gene annotations Protein sequences Protein cross references Gene and gene product annotations Organisms Molecular interaction and pathways Database name Entrez-Gene Uniprot iProClass GeneOntology NCBI Taxonomy BioGRID, NCI, Reactome, BioCarta, KEGG, BioCyc ESTC Sept, 2008 Give me all human genes which are located in X chromosome? List all protein identifiers encoded by gene IL2? Give me all human proteins associated with endoplasmic reticulum? List all articles where protein Interleukin-2 is mentioned? List me all cross references to a protein Interleukin-2? Give all terms more specific than “cell signaling” (e.g., synaptic transmission, transmission of nerve impulse) List all primates sub categories? Give me all interactions of cell division protein kinase? Sometimes we need to ask far more questions efficiently: Give me all proteins which interacts in nucleus and are annotated with repressor and have at least one participants that is encoded by gene annotated with specific term and is located in chromosome X? Filter the results for Mammalia organisms!

The Approach Identify Data Source Generated RDF Consolidate Data Define Semantics RDF format YES NO The way to semantic Service Oriented Architecture

Challenges to Overcome Syntactic – The way the different are serialized Structure – The way the different entities are represented Semantic – The way the different entities are interpreted W3C standard serialization formats for data exchange The graph model used by RDF gives maximum flexibility Support custom R- entailment rules to derive meaning The way to semantic Service Oriented Architecture

DatabaseDatasetSchemaDescription UniprotCurated entries Original by the providerProtein sequences and annotations Entrez-GeneCompleteCustom RDF schemaGenes and annotation iProClassCompleteCustom RDF schemaProtein cross- references Gene OntologyCompleteSchema by the providerGene and gene product annotation thesaurus BioGRIDCompleteBioPAX 2.0 (custom generated)Protein interactions extracted from the literature NCI - Pathway Interaction Database CompleteBioPAX 2.0 (original by the provider) Human pathway interaction database The Cancer Cell MapCompleteBioPAX 2.0 (original by the provider) Cancer pathways database ReactomeCompleteBioPAX 2.0 (original by the provider) Human pathways and interactions BioCartaCompleteBioPAX 2.0 (original by the provider) Pathway database KEGGCompleteBioPAX 1.0 (original by the provider) Molecular Interaction BioCycCompleteBioPAX 1.0 (original by the provider) Pathway database NCBI TaxonomyCompleteCustom RDF schemaOrganisms

Linked Life Data Overview Platform to automate the process: – Infrastructure to store and inferences – Transform the structured data sources to RDF – Provide web interface and SPARQL endpoint to access the data Currently operates over semantic repository Linked Life Data statistics: – gene – proteins – pathways – targets – disease – drugs – patient – Number of statements: 1,159,857,602 – Number of explicit statements: 403,361,589 – Number of entities: 128,948,564 Publicly available at: The way to semantic Service Oriented Architecture

Linked Life Data Semantic integration of biological databases The way to semantic Service Oriented Architecture

LifeSKIM – Quick Facts LifeSKIM application provides a scalable support of: Querying and navigation of knowledge generated from structured (biological databases) and unstructured (biomedical document); Semantic indexing and retrieval of document using ontology Ontology population and learning of new types of entities from text Efficient reasoning against the extracted and structured information, e.g., “type I programmed cell death” is “Apoptosis of neutrophils” and “biological process” ; Co-occurrence and ranking of entities The way to semantic Service Oriented Architecture

Semantic Annotation Example The way to semantic Service Oriented Architecture

How LifeSKIM Searchers Better? The classical IR could not match: interleukin 6 with a HGF or HSF or BSF2 or IL-6 or IFNB2 Interleukin 6 is a an entity in Entrez-Gene with GeneID: 3569, and HGF; HSF; BSF2; IL-6; IFNB2 are aliases for the same gene entity. apoptosis of neutrophils with “programmed cell death”; GeneOntology thesaurus adds the above list of terms as part of apoptosis of neutrophils term. The way to semantic Service Oriented Architecture

A Complex IE Pipeline is Requred The way to semantic Service Oriented Architecture

Current Entity Categories Gene names (Entrez-Gene) Gene and gene production annotations (Gene Ontology) Organisms (NCBI Taxonomy) Diseases (SNOMED from UMLS) Drug compounds (DrugBank) The classes Ambiguous gene, Cell Line, DNA and RNA are automatically learned from text The way to semantic Service Oriented Architecture

Results of the Semantic Annotation Process Type Genes12,416 Organism10,617 Diseases9,256 Drugs2,029 Neoplastic process1,667 Biological process1,604 Pathological functions1,342 Mental/behaviour dysfunction749 Molecular function624 Cellular component205 DNAs (newly recognized)156,426 Cell lines (newly recognized)89,217 Cell types (newly recognized)85,199 RNAs (newly recognized)6,001 The way to semantic Service Oriented Architecture 1,204,063 Medline abstracts are annotated 10,884,032 semantic annotations are created Saved links to 40,510 existing entities

LifeSKIM Semantic annotation of biomedical documents The way to semantic Service Oriented Architecture