Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)

Presentation Outline Life Sciences Domain Integration Problems Pathway and Interaction Knowledge Base Linked Life Data LifeSKIM Application to Show Case Platform Sept, 2008 ESTC

Andy Law’s First Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008

The problem! The data is supported by different organizations The information is highly distributed and redundant There are tons of flat file formats with special semantics The knowledge is locked in vast data silos There are many isolated communities which could not reach cross-domain understanding ESTC Sept, 2008

Andy Law’s Second Law “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008

Take Your Best Guess ESTC Sept, 2008

PIKB Overview Stands for Pathway and Interaction Knowledge Base (PIKB) Interactions in the cell unveil the molecular mechanisms – Which molecular function or a biological process is affected after the admission of given drug? – What is the involvement of chemical compounds to a specific biological process or disease? The work is developed in context LARKC and it is refined with AstraZeneca researcher The use case of “Semantic Integration for Early Clinical and Drug Development” will be assessed with clinical data of AstraZeneca ESTC Sept, 2008

LARKC Project precision (soundness) recall (completeness) logic IR Semantic Web “Web Scale and Style Reasoning” Giving up 100% correctness: trading quality for size often completeness is not needed sometimes even soundness is not needed ESTC

PIKB Objectives Easily integrate pathway and interaction data from different sources Allow straightforward updates of the information Provide scientists with computational support to conceptualize the breath and depth of relationships between data Scale up to billions of statements ESTC Sept, 2008

PIKB Data Sources Type of data sources Gene and gene annotations Protein sequences Protein cross references Gene and gene product annotations Organisms Molecular interaction and pathways Database name Entrez-Gene Uniprot iProClass GeneOntology NCBI Taxonomy BioGRID, NCI, Reactome, BioCarta, KEGG, BioCyc ESTC Sept, 2008 Give me all human genes which are located in X chromosome? List all protein identifiers encoded by gene IL2? Give me all human proteins associated with endoplasmic reticulum? List all articles where protein Interleukin-2 is mentioned? List me all cross references to a protein Interleukin-2? Give all terms more specific than “cell signaling” (e.g., synaptic transmission, transmission of nerve impulse) List all primates sub categories? Give me all interactions of cell division protein kinase? Sometimes we need to ask far more questions efficiently: Give me all proteins which interacts in nucleus and are annotated with repressor and have at least one participants that is encoded by gene annotated with specific term and is located in chromosome X? Filter the results for Mammalia organisms!

Possible Solutions Classical data-integration with: – data warehouses – federation middleware frameworks – database middleware technology Not really... – Mapping works efficiently on a small scale – Different design paradigm can be a real challenge – Direct mapping usually does not work – No standard way to integrate textual information ESTC Sept, 2008

Our Approach Convert all data sources to RDF representation (if not already distributed) Collide the data to scalable semantic repository Apply light-weight reasoning to specify formal interpretations of the data (e.g., remove redundancy) Derive new implicit knowledge ESTC Sept, 2008

Try to Visualise it ESTC Sept, 2008 rdf:type rdf:seeAlso urn:intact:1007 urn:uniprot:P104172 urn:uniprot:Protein urn:biogrid:Interaction urn:biogrid:15904 urn:biogrid:FBgn00134235 urn:biogrid:FBgn0068575 urn:pubmed:15904 urn:uniprot:FBgn0068575 urn:uniprot:FBgn00134235 rdf:type urn:intact:Interaction urn:uniprot:Q709356 interactsWith hasParticipant rdf:type sameAs Resolve the syntactic differences in the identifiersUse relationships to derive new implicit knowledge These are only examples resource names

ESTC Sept, 2008 DatabaseDatasetSchemaDescription UniprotCurated entries Original by the providerProtein sequences and annotations Entrez-GeneCompleteCustom RDF schemaGenes and annotation iProClassCompleteCustom RDF schemaProtein cross- references Gene OntologyCompleteSchema by the providerGene and gene product annotation thesaurus BioGRIDCompleteBioPAX 2.0 (custom generated)Protein interactions extracted from the literature NCI - Pathway Interaction Database CompleteBioPAX 2.0 (original by the provider) Human pathway interaction database The Cancer Cell MapCompleteBioPAX 2.0 (original by the provider) Cancer pathways database ReactomeCompleteBioPAX 2.0 (original by the provider) Human pathways and interactions BioCartaCompleteBioPAX 2.0 (original by the provider) Pathway database KEGGCompleteBioPAX 1.0 (original by the provider) Molecular Interaction BioCycCompleteBioPAX 1.0 (original by the provider) Pathway database NCBI TaxonomyCompleteCustom RDF schemaOrganisms

Linked Life Data Overview Platform to automate the process: – Infrastructure to store and inferences – Transform the structured data sources to RDF – Provide web interface to access the data Currently operates over OWLIM semantic repository LinkedLifeData - PIKB statistics: – Number of statements: 1,159,857,602 – Number of explicit statements: 403,361,589 – Number of entities: 128,948,564 Publicly available at: http://www.linkedlifedata.com ESTC Sept, 2008

LifeSKIM Application A platform offering software infrastructure for: – automatic semantic annotation of text – ontology population Store the extracted facts and reason on top of them Semantic indexing and retrieval of content Query and navigation involving structured knowledge Based on Information Extraction (i.e. text-mining) technology ESTC Sept, 2008

How LifeSKIM Searchers Better? LifeSKIM can match a query Documents about interleukin 6 (interferon, beta 2) where is connected to apoptosis of neutrophils. With a document containing …. the same effect was not observed for IFNB2, IL-8 and TNF- alpha…….. …. is induced neutrophil programmed cell death by apoptosis …… ESTC Sept, 2008

How LifeSKIM Searchers Better? The classical IR could not match: interleukin 6 with a HGF; HSF; BSF2; IL-6; IFNB2 Interleukin 6 is a an entity in Entrez-Gene with GeneID: 3569, and HGF; HSF; BSF2; IL-6; IFNB2 are aliases for the same gene entity. apoptosis of neutrophils with neutrophil apoptosis; programmed cell death of neutrophils by apoptosis; programmed cell death, neutrophils; neutrophil programmed cell death by apoptosis; GeneOntology thesaurus adds the above list of terms as part of apoptosis of neutrophils term. ESTC Sept, 2008

Semantic Annotation Example ESTC Sept, 2008

Thanks AstraZeneca Bosse Andersson Elisabet Söderhielm Kaushal Desai Ontotext Deyan Peychev Georgi Georgiev OWLIM team KIM team ESTC Sept, 2008 The development of PIKB and Linked Life Data is partially funded by FP7 215535 LarKC

Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)

Similar presentations

Presentation on theme: "Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)

Similar presentations

Presentation on theme: "Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)"— Presentation transcript:

Similar presentations

About project

Feedback