Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Frank van Harmelen Vrije Universiteit Amsterdam The Information Universe of the (Near) Futur e Creative Commons License: allowed to share & remix, but.
Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery K. Baclawski Northeastern University.
Frank van Harmelen Vrije Universiteit Amsterdam The Web of data and LarKC’s role in it Creative Commons License: allowed to share & remix, but must attribute.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
The Semantic Web: New-style data-integration (and how it works for life-scientists too!) Frank van Harmelen AI Department Vrije Universiteit Amsterdam.
1 © Copyright 2010 Dieter Fensel and Federico Facca Semantic Web Reasoning on the Web.
Information Retrieval in Practice
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.
EXPERT SYSTEMS Part I.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Course Instructor: Aisha Azeem
Overview of Search Engines
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
E ARLY C LINICAL D RUG D EVELOPMENT WP7a Use case 2 AstraZeneca & Ontotext 09/09/
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
VIVO: Sharing Data for Research Discovery Mike Conlon University of Florida
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Linked Life Data for annotation of Medline Semantic data-integration and search in the life science domain Vassil Momtchev (Ontotext)
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Towards Web Semantics Spreadsheets and the US Government Lee Feigenbaum, Cambridge Semantics Brand Niemann, U.S. EPA SICoP Special Conference February.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Semantic Visualization What do we mean when we talk about visualization? - Understanding data - Showing the relationships between elements of data Overviews.
Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.
Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid +, P. Rieger * * Humboldt Universität zu Berlin + University of Maryland.
Majid Sazvar Knowledge Engineering Research Group Ferdowsi University of Mashhad Semantic Web Reasoning.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
PHS / Department of General Practice Royal College of Surgeons in Ireland Coláiste Ríoga na Máinleá in Éirinn Knowledge representation in TRANSFoRm AMIA.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
12/7/2015Page 1 Service-enabling Biomedical Research Enterprise Chapter 5 B. Ramamurthy.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Mining the Biomedical Research Literature Ken Baclawski.
Information Retrieval
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Visual Knowledge ® Software Inc. Visual Knowledge BioCAD Case Study Parallels to Other Domains VK Semantic Web Server.
Developing GRID Applications GRACE Project
Improving Research Data Sharing and Reuse: Scientists and Repositories Michael Conlon, PhD Emeritus Faculty Member, University of Florida VIVO Project.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im.
TDM in the Life Sciences Application to Drug Repositioning *
Information Retrieval in Practice
David Amar, Tom Hait, and Ron Shamir
Search Engine Architecture
Service-enabling Biomedical Research Enterprise
Presentation transcript:

Sept, 2008 Life Science Knowledge Collider Vassil Momtchev (Ontotext)

Presentation Outline Life Sciences Domain Integration Problems Pathway and Interaction Knowledge Base Linked Life Data LifeSKIM Application to Show Case Platform Sept, 2008 ESTC

Andy Law’s First Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC Sept, 2008

The problem! The data is supported by different organizations The information is highly distributed and redundant There are tons of flat file formats with special semantics The knowledge is locked in vast data silos There are many isolated communities which could not reach cross-domain understanding ESTC Sept, 2008

Andy Law’s Second Law “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC Sept, 2008

Take Your Best Guess ESTC Sept, 2008

PIKB Overview Stands for Pathway and Interaction Knowledge Base (PIKB) Interactions in the cell unveil the molecular mechanisms – Which molecular function or a biological process is affected after the admission of given drug? – What is the involvement of chemical compounds to a specific biological process or disease? The work is developed in context LARKC and it is refined with AstraZeneca researcher The use case of “Semantic Integration for Early Clinical and Drug Development” will be assessed with clinical data of AstraZeneca ESTC Sept, 2008

LARKC Project precision (soundness) recall (completeness) logic IR Semantic Web “Web Scale and Style Reasoning” Giving up 100% correctness: trading quality for size often completeness is not needed sometimes even soundness is not needed ESTC

PIKB Objectives Easily integrate pathway and interaction data from different sources Allow straightforward updates of the information Provide scientists with computational support to conceptualize the breath and depth of relationships between data Scale up to billions of statements ESTC Sept, 2008

PIKB Data Sources Type of data sources Gene and gene annotations Protein sequences Protein cross references Gene and gene product annotations Organisms Molecular interaction and pathways Database name Entrez-Gene Uniprot iProClass GeneOntology NCBI Taxonomy BioGRID, NCI, Reactome, BioCarta, KEGG, BioCyc ESTC Sept, 2008 Give me all human genes which are located in X chromosome? List all protein identifiers encoded by gene IL2? Give me all human proteins associated with endoplasmic reticulum? List all articles where protein Interleukin-2 is mentioned? List me all cross references to a protein Interleukin-2? Give all terms more specific than “cell signaling” (e.g., synaptic transmission, transmission of nerve impulse) List all primates sub categories? Give me all interactions of cell division protein kinase? Sometimes we need to ask far more questions efficiently: Give me all proteins which interacts in nucleus and are annotated with repressor and have at least one participants that is encoded by gene annotated with specific term and is located in chromosome X? Filter the results for Mammalia organisms!

Possible Solutions Classical data-integration with: – data warehouses – federation middleware frameworks – database middleware technology Not really... – Mapping works efficiently on a small scale – Different design paradigm can be a real challenge – Direct mapping usually does not work – No standard way to integrate textual information ESTC Sept, 2008

Our Approach Convert all data sources to RDF representation (if not already distributed) Collide the data to scalable semantic repository Apply light-weight reasoning to specify formal interpretations of the data (e.g., remove redundancy) Derive new implicit knowledge ESTC Sept, 2008

Try to Visualise it ESTC Sept, 2008 rdf:type rdf:seeAlso urn:intact:1007 urn:uniprot:P urn:uniprot:Protein urn:biogrid:Interaction urn:biogrid:15904 urn:biogrid:FBgn urn:biogrid:FBgn urn:pubmed:15904 urn:uniprot:FBgn urn:uniprot:FBgn rdf:type urn:intact:Interaction urn:uniprot:Q interactsWith hasParticipant rdf:type sameAs Resolve the syntactic differences in the identifiersUse relationships to derive new implicit knowledge These are only examples resource names

ESTC Sept, 2008 DatabaseDatasetSchemaDescription UniprotCurated entries Original by the providerProtein sequences and annotations Entrez-GeneCompleteCustom RDF schemaGenes and annotation iProClassCompleteCustom RDF schemaProtein cross- references Gene OntologyCompleteSchema by the providerGene and gene product annotation thesaurus BioGRIDCompleteBioPAX 2.0 (custom generated)Protein interactions extracted from the literature NCI - Pathway Interaction Database CompleteBioPAX 2.0 (original by the provider) Human pathway interaction database The Cancer Cell MapCompleteBioPAX 2.0 (original by the provider) Cancer pathways database ReactomeCompleteBioPAX 2.0 (original by the provider) Human pathways and interactions BioCartaCompleteBioPAX 2.0 (original by the provider) Pathway database KEGGCompleteBioPAX 1.0 (original by the provider) Molecular Interaction BioCycCompleteBioPAX 1.0 (original by the provider) Pathway database NCBI TaxonomyCompleteCustom RDF schemaOrganisms

Linked Life Data Overview Platform to automate the process: – Infrastructure to store and inferences – Transform the structured data sources to RDF – Provide web interface to access the data Currently operates over OWLIM semantic repository LinkedLifeData - PIKB statistics: – Number of statements: 1,159,857,602 – Number of explicit statements: 403,361,589 – Number of entities: 128,948,564 Publicly available at: ESTC Sept, 2008

LifeSKIM Application A platform offering software infrastructure for: – automatic semantic annotation of text – ontology population Store the extracted facts and reason on top of them Semantic indexing and retrieval of content Query and navigation involving structured knowledge Based on Information Extraction (i.e. text-mining) technology ESTC Sept, 2008

How LifeSKIM Searchers Better? LifeSKIM can match a query Documents about interleukin 6 (interferon, beta 2) where is connected to apoptosis of neutrophils. With a document containing …. the same effect was not observed for IFNB2, IL-8 and TNF- alpha…….. …. is induced neutrophil programmed cell death by apoptosis …… ESTC Sept, 2008

How LifeSKIM Searchers Better? The classical IR could not match: interleukin 6 with a HGF; HSF; BSF2; IL-6; IFNB2 Interleukin 6 is a an entity in Entrez-Gene with GeneID: 3569, and HGF; HSF; BSF2; IL-6; IFNB2 are aliases for the same gene entity. apoptosis of neutrophils with neutrophil apoptosis; programmed cell death of neutrophils by apoptosis; programmed cell death, neutrophils; neutrophil programmed cell death by apoptosis; GeneOntology thesaurus adds the above list of terms as part of apoptosis of neutrophils term. ESTC Sept, 2008

Semantic Annotation Example ESTC Sept, 2008

Thanks AstraZeneca Bosse Andersson Elisabet Söderhielm Kaushal Desai Ontotext Deyan Peychev Georgi Georgiev OWLIM team KIM team ESTC Sept, 2008 The development of PIKB and Linked Life Data is partially funded by FP LarKC