Building linked-data, large-scale chemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor,

Slides:



Advertisements
Similar presentations
PUBLIC ChemAxon European UGM Building an Electronic Research Habitat at ETC Peter Condron.
Advertisements

Federation eCrystals Federation: Open Repositories for Data-driven Science Dr Liz Lyon, UKOLN, University of Bath, UK Dr Simon Coles, University of Southampton,
© S.J. Coles 2006 Institutional Data Repositories for Chemistry Simon Coles School of Chemistry, University of Southampton, U.K.
Supporting Engagement in Open Access: a Publishers Perspective
Open PHACTS Easy API Community Workshop, June 25, 2014 Christine Chichester Swiss Institute of Bioinformatics.
Representing the Immune Epitope Database in OWL Jason A. Greenbaum 1, Randi Vita 1, Laura Zarebski 1, Hussein Emami 2, Alessandro Sette 1, Alan Ruttenberg.
Knowledge Graph: Connecting Big Data Semantics
THE GLOBAL CHEMISTRY NETWORK David James Executive Director, Strategic Innovation Jim Iley Executive Director, Science and Education 3 rd September 2013.
Antonis Loizou (some slides created by Paul Groth) VU University Amsterdam LDBC TUC Meeting.
ChemSpider: Searching by Chemical Name. ChemSpider  What is ChemSpider?  How to conduct a search  What do you get?
Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical.
Royal Society of Chemistry developments to support open drug discovery Antony Williams, Ken Karapetyan, Valery Tkachenko, Colin Batchelor Alexey Pshenichnov.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
University of Southampton, U.K.
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010.
Crowdsourcing Chemistry for the Community – 5 Years of Experiences Antony Williams NFAIS, February 28 th 2012.
Open PHACTS: a precompetitive infrastructure for pharmacological research Bryn Williams-Jones.
Open PHACTS “Data integration for all” Andrew Leach.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
The Value of a Unique Researcher Identifier to ChemSpider Projects Antony Williams ORCID Meeting, Boston, May 18 th 2011.
The Open PHACTS Discovery Platform Open PHACTS for Academia.
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on.
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
Paul Groth VU University Amsterdam Convergence Meeting: Semantic Interoperability for Clinical Research & Patient.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
The Open Pharmacological Concepts Triple Store
Open PHACTS in a few slides. Why? Public Domain Drug Discovery Data: Pharma are accessing, processing, storing & re-processing each company x.
Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances”
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune,
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Marrying ACD/Labs technologies to eScience Projects at the Royal Society of Chemistry Antony Williams ACD/Labs User Meeting June 2013.
Delivering an online service for validating and standardizing chemical structure files using the ChemSpider platform.
One publisher’s perspectives on an evolving industry Grace Baynes Nature Publishing Group October 2009.
The Open PHACTS Ecosystem Fostering a user community Open PHACTS Community Workshop June 2014.
RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im.
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,
A Chemistry Data Repository to Serve Them All Antony Williams.
Structure verification and elucidation using the ChemSpider database Antony J Williams, Valery Tkachenko and Alexey Pshenichnov SERMACS, November 16 th.
OncoTrack Bioinformatics Workshop Max Planck Institute for Molecular Genetics, Berlin Wednesday 6 th November 2013 TimeSubject 13:30-15:00 Introduction.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
Cheminformatics and Metabolism Team The EBI Enzyme Portal.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Introduction to PubChem BioAssay
TDM in the Life Sciences Application to Drug Repositioning *
Ontology, RDF, SW for Chemical Structures
Implementing chemistry platform for OpenPHACTS: Lessons learned
Classifying Chemistry: Current Efforts in Canada
Networks and Interactions
Applying Royal Society of Chemistry Cheminformatics Skills to Support the PharmaSea Project Antony Williams, Alexey Pshenichnov, Valery Tkachenko, Ken.
Biological Databases By: Komal Arora.
Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST.
Dealing with the complex challenge of managing diverse chemistry data online Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan.
It is a web-based tool for the retrieval of chemistry information and data from published literature. The content covers more than 200 years of chemistry.
CCNT Lab of Zhejiang University
Open PHACTS 1.3 Release ( triples)
Overview: Fedora Architecture and Software Features
ATOM Accelerating Therapeutics for Opportunities in Medicine
YourDataStories: Transparency and Corruption Fighting through Data Interlinking and Visual Exploration Georgios Petasis1, Anna Triantafillou2, Eric Karstens3.
Chair of Tech Committee, BetterGrids.org
OMPOL – Visualisation of large chemical spaces
WikiNeuron: Semantic Neuro-Mashup
An ontology for e-Research
Semantic Annotation service
Service-enabling Biomedical Research Enterprise
Developing Institutional Data Repositories
Altered Caspase-8 Expression
Presentation transcript:

Building linked-data, large-scale chemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter Corbett Royal Society of Chemistry ACS Spring 2016 San Diego, CA March 13th 2016

ChemSpider – 2007 - 2011 OpenPHACTS – 2011 - 2014 Chemistry Data Platform – 2014 - …

45 million chemicals and growing Data sourced from >500 different sources Crowdsourced curation and annotation Ongoing deposition of data from our journals and our collaborators A structure centric hub for web-searching

ChemSpider

Chemical vendors and datasources

ChemSpider

Properties - experimental

Literature and patents references

Classification

Spectra

Multimedia

Tagging

ChemSpider - Summary Simple, flattish data model InChI as a primary identifier Linked by synonyms Linked by “ExtId” Standard searches (identity, substructure, similarity) Very little semantics

OpenPHACTS: 2011-2014 Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources Into A Single Open & Sustainable Access Point

OpenPHACTS Open PHACTS Practical Semantics GlaxoSmithKline – Coordinator Universität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit Amsterdam Novartis Merck Serono H. Lundbeck A/S Eli Lilly Netherlands Bioinformatics Centre Swiss Institute of Bioinformatics ConnectedDiscovery EMBL-European Bioinformatics Institute Janssen Esteve Almirall OpenLink Scibite The Open PHACTS Foundation Spanish National Cancer Research Centre University of Manchester Maastricht University Aqnowledge University of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität Bonn AstraZeneca Pfizer info@openphactsfoundation.org @Open_PHACTS

Why is it so hard to…. IP? What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP? Remember this, some of these questions are easier to answer than others Competitors?

Mx/psa, how calculated who did it? Mash up. With your data too, “Let me compare MW, logP and PSA for known oxidoreductase inhibitors” “What is the selectivity profile of known p38 inhibitors?” “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” ChEMBL DrugBank Gene Ontology Wikipathways GeneGo Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial ChEBI UniProt UMLS neXtProt GVKBio ConceptWiki ChemSpider DisGeNet TrialTrove TR Integrity ChEMBL Target Class ENZYME FDA adverse events SureChEMBL 17

Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data @gray_alasdair Big Data Integration

OpenPHACTS Discovery Platform RDF Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Indexing Core Platform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” Public Content Commercial Public Ontologies User Annotations Apps Import data into cache API calls populate SPARQL queries Integration approach Data kept in original model Data cached in central triple store API call translated to SPARQL query Query expressed in terms of original data Queries expanded by IMS to cover URIs of original datasets 21 October 2014 Scientific Lenses – A. J. G. Gray

Gleevec®: Imatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N Mesylate ChemSpider Drugbank PubChem Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results 21 October 2014 Scientific Lenses – A. J. G. Gray

Scientific Lenses – A. J. G. Gray Structure Lens I need to compute an analysis, give me details of the active compound in Gleevec. Strict Relaxed Analysing Browsing Interested in physiochemical properties of Gleevec skos:exactMatch (InChI) 21 October 2014 Scientific Lenses – A. J. G. Gray

Lens Effects: Ibuprofen Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. CHEMBL427526 CHEMBL521 CHEMBL175 21 October 2014 Scientific Lenses – A. J. G. Gray

Scientific Lenses – A. J. G. Gray Default Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray

Scientific Lenses – A. J. G. Gray Stereoisomer Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray

Scientific Lenses – A. J. G. Gray Mapping Generation ops:OPS437281 ✔ is_stereoisomer_of [ci:CHEMINF_000461] has_stereoundefined_parent [ci:CHEMINF_000456] Other relationships has part is tautomer of uncharged counterpart isotope … ops:OPS380297 Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types ops:OPS380297 21 October 2014 Scientific Lenses – A. J. G. Gray

OpenPHACTS UI http://explorer.openphacts.org/

Scientific Lenses – A. J. G. Gray Explorer Screenshot 21 October 2014 Scientific Lenses – A. J. G. Gray

Scientific Lenses – A. J. G. Gray Explorer Screenshot Pharmacology count 2370  3044 21 October 2014 Scientific Lenses – A. J. G. Gray

OpenPHACTS - Summary Principal difference – inter-domain links More complex, but still structure-centric data model Ontological relationships introduced Chemical Lenses – new type of search

Chemistry Data Platform – 2014 - …

Dimensions and complexity of science What about science and chemistry in particular?

RSC Archive – since 1841

Digitally Enabling RSC Archive

ChemSpider Synthetic Pages Compounds Reaction Analytical Data Text and References

RSC Databases RSC Compounds RSC Reactions RSC Spectra RSC Crystals RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…

Compounds domain

Data quality issue and CVSP Robochemistry Proliferation of errors in public and private databases ChemSpider PubChem DrugBank KEGG ChEBI/ChEMBL Automated quality control system

Chemistry Validation and Standardization Platform

Chemistry Validation and Standardization Platform

Reactions domain Information typically associated with reactions

Analytical data domain

Crystallography domain

Chemistry Data Platform - Summary Simplified models within domain Domains are described with its own models with embedded semantics No proper domain-specific identifiers Extensive quality control – CVSP (DOI 10.1186/s13321-015-0072-8)

There is no way back

Thank you Email: tkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16 48