Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)
Presentation Outline Life Sciences Domain Integration Problems Pathway and Interaction Knowledge Base Linked Life Data LifeSKIM Application to Show Case Platform Sept, 2008 ESTC
Andy Law’s First Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008
The problem! The data is supported by different organizations The information is highly distributed and redundant There are tons of flat file formats with special semantics The knowledge is locked in vast data silos There are many isolated communities which could not reach cross-domain understanding ESTC Sept, 2008
Andy Law’s Second Law “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008
Take Your Best Guess ESTC Sept, 2008
PIKB Overview Stands for Pathway and Interaction Knowledge Base (PIKB) Interactions in the cell unveil the molecular mechanisms – Which molecular function or a biological process is affected after the admission of given drug? – What is the involvement of chemical compounds to a specific biological process or disease? The work is developed in context LARKC and it is refined with AstraZeneca researcher The use case of “Semantic Integration for Early Clinical and Drug Development” will be assessed with clinical data of AstraZeneca ESTC Sept, 2008
LARKC Project precision (soundness) recall (completeness) logic IR Semantic Web “Web Scale and Style Reasoning” Giving up 100% correctness: trading quality for size often completeness is not needed sometimes even soundness is not needed ESTC
PIKB Objectives Easily integrate pathway and interaction data from different sources Allow straightforward updates of the information Provide scientists with computational support to conceptualize the breath and depth of relationships between data Scale up to billions of statements ESTC Sept, 2008
PIKB Data Sources Type of data sources Gene and gene annotations Protein sequences Protein cross references Gene and gene product annotations Organisms Molecular interaction and pathways Database name Entrez-Gene Uniprot iProClass GeneOntology NCBI Taxonomy BioGRID, NCI, Reactome, BioCarta, KEGG, BioCyc ESTC Sept, 2008 Give me all human genes which are located in X chromosome? List all protein identifiers encoded by gene IL2? Give me all human proteins associated with endoplasmic reticulum? List all articles where protein Interleukin-2 is mentioned? List me all cross references to a protein Interleukin-2? Give all terms more specific than “cell signaling” (e.g., synaptic transmission, transmission of nerve impulse) List all primates sub categories? Give me all interactions of cell division protein kinase? Sometimes we need to ask far more questions efficiently: Give me all proteins which interacts in nucleus and are annotated with repressor and have at least one participants that is encoded by gene annotated with specific term and is located in chromosome X? Filter the results for Mammalia organisms!
Possible Solutions Classical data-integration with: – data warehouses – federation middleware frameworks – database middleware technology Not really... – Mapping works efficiently on a small scale – Different design paradigm can be a real challenge – Direct mapping usually does not work – No standard way to integrate textual information ESTC Sept, 2008
Our Approach Convert all data sources to RDF representation (if not already distributed) Collide the data to scalable semantic repository Apply light-weight reasoning to specify formal interpretations of the data (e.g., remove redundancy) Derive new implicit knowledge ESTC Sept, 2008
Try to Visualise it ESTC Sept, 2008 rdf:type rdf:seeAlso urn:intact:1007 urn:uniprot:P urn:uniprot:Protein urn:biogrid:Interaction urn:biogrid:15904 urn:biogrid:FBgn urn:biogrid:FBgn urn:pubmed:15904 urn:uniprot:FBgn urn:uniprot:FBgn rdf:type urn:intact:Interaction urn:uniprot:Q interactsWith hasParticipant rdf:type sameAs Resolve the syntactic differences in the identifiersUse relationships to derive new implicit knowledge These are only examples resource names
ESTC Sept, 2008 DatabaseDatasetSchemaDescription UniprotCurated entries Original by the providerProtein sequences and annotations Entrez-GeneCompleteCustom RDF schemaGenes and annotation iProClassCompleteCustom RDF schemaProtein cross- references Gene OntologyCompleteSchema by the providerGene and gene product annotation thesaurus BioGRIDCompleteBioPAX 2.0 (custom generated)Protein interactions extracted from the literature NCI - Pathway Interaction Database CompleteBioPAX 2.0 (original by the provider) Human pathway interaction database The Cancer Cell MapCompleteBioPAX 2.0 (original by the provider) Cancer pathways database ReactomeCompleteBioPAX 2.0 (original by the provider) Human pathways and interactions BioCartaCompleteBioPAX 2.0 (original by the provider) Pathway database KEGGCompleteBioPAX 1.0 (original by the provider) Molecular Interaction BioCycCompleteBioPAX 1.0 (original by the provider) Pathway database NCBI TaxonomyCompleteCustom RDF schemaOrganisms
Linked Life Data Overview Platform to automate the process: – Infrastructure to store and inferences – Transform the structured data sources to RDF – Provide web interface to access the data Currently operates over OWLIM semantic repository LinkedLifeData - PIKB statistics: – Number of statements: 1,159,857,602 – Number of explicit statements: 403,361,589 – Number of entities: 128,948,564 Publicly available at: ESTC Sept, 2008
LifeSKIM Application A platform offering software infrastructure for: – automatic semantic annotation of text – ontology population Store the extracted facts and reason on top of them Semantic indexing and retrieval of content Query and navigation involving structured knowledge Based on Information Extraction (i.e. text-mining) technology ESTC Sept, 2008
How LifeSKIM Searchers Better? LifeSKIM can match a query Documents about interleukin 6 (interferon, beta 2) where is connected to apoptosis of neutrophils. With a document containing …. the same effect was not observed for IFNB2, IL-8 and TNF- alpha…….. …. is induced neutrophil programmed cell death by apoptosis …… ESTC Sept, 2008
How LifeSKIM Searchers Better? The classical IR could not match: interleukin 6 with a HGF; HSF; BSF2; IL-6; IFNB2 Interleukin 6 is a an entity in Entrez-Gene with GeneID: 3569, and HGF; HSF; BSF2; IL-6; IFNB2 are aliases for the same gene entity. apoptosis of neutrophils with neutrophil apoptosis; programmed cell death of neutrophils by apoptosis; programmed cell death, neutrophils; neutrophil programmed cell death by apoptosis; GeneOntology thesaurus adds the above list of terms as part of apoptosis of neutrophils term. ESTC Sept, 2008
Semantic Annotation Example ESTC Sept, 2008
Thanks AstraZeneca Bosse Andersson Elisabet Söderhielm Kaushal Desai Ontotext Deyan Peychev Georgi Georgiev OWLIM team KIM team ESTC Sept, 2008 The development of PIKB and Linked Life Data is partially funded by FP LarKC