Building linked-data, large-scale chemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter Corbett Royal Society of Chemistry ACS Spring 2016 San Diego, CA March 13th 2016
ChemSpider – 2007 - 2011 OpenPHACTS – 2011 - 2014 Chemistry Data Platform – 2014 - …
45 million chemicals and growing Data sourced from >500 different sources Crowdsourced curation and annotation Ongoing deposition of data from our journals and our collaborators A structure centric hub for web-searching
ChemSpider
Chemical vendors and datasources
ChemSpider
Properties - experimental
Literature and patents references
Classification
Spectra
Multimedia
Tagging
ChemSpider - Summary Simple, flattish data model InChI as a primary identifier Linked by synonyms Linked by “ExtId” Standard searches (identity, substructure, similarity) Very little semantics
OpenPHACTS: 2011-2014 Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources Into A Single Open & Sustainable Access Point
OpenPHACTS Open PHACTS Practical Semantics GlaxoSmithKline – Coordinator Universität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit Amsterdam Novartis Merck Serono H. Lundbeck A/S Eli Lilly Netherlands Bioinformatics Centre Swiss Institute of Bioinformatics ConnectedDiscovery EMBL-European Bioinformatics Institute Janssen Esteve Almirall OpenLink Scibite The Open PHACTS Foundation Spanish National Cancer Research Centre University of Manchester Maastricht University Aqnowledge University of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität Bonn AstraZeneca Pfizer info@openphactsfoundation.org @Open_PHACTS
Why is it so hard to…. IP? What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP? Remember this, some of these questions are easier to answer than others Competitors?
Mx/psa, how calculated who did it? Mash up. With your data too, “Let me compare MW, logP and PSA for known oxidoreductase inhibitors” “What is the selectivity profile of known p38 inhibitors?” “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” ChEMBL DrugBank Gene Ontology Wikipathways GeneGo Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial ChEBI UniProt UMLS neXtProt GVKBio ConceptWiki ChemSpider DisGeNet TrialTrove TR Integrity ChEMBL Target Class ENZYME FDA adverse events SureChEMBL 17
Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data @gray_alasdair Big Data Integration
OpenPHACTS Discovery Platform RDF Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Indexing Core Platform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” Public Content Commercial Public Ontologies User Annotations Apps Import data into cache API calls populate SPARQL queries Integration approach Data kept in original model Data cached in central triple store API call translated to SPARQL query Query expressed in terms of original data Queries expanded by IMS to cover URIs of original datasets 21 October 2014 Scientific Lenses – A. J. G. Gray
Gleevec®: Imatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N Mesylate ChemSpider Drugbank PubChem Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results 21 October 2014 Scientific Lenses – A. J. G. Gray
Scientific Lenses – A. J. G. Gray Structure Lens I need to compute an analysis, give me details of the active compound in Gleevec. Strict Relaxed Analysing Browsing Interested in physiochemical properties of Gleevec skos:exactMatch (InChI) 21 October 2014 Scientific Lenses – A. J. G. Gray
Lens Effects: Ibuprofen Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. CHEMBL427526 CHEMBL521 CHEMBL175 21 October 2014 Scientific Lenses – A. J. G. Gray
Scientific Lenses – A. J. G. Gray Default Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray
Scientific Lenses – A. J. G. Gray Stereoisomer Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray
Scientific Lenses – A. J. G. Gray Mapping Generation ops:OPS437281 ✔ is_stereoisomer_of [ci:CHEMINF_000461] has_stereoundefined_parent [ci:CHEMINF_000456] Other relationships has part is tautomer of uncharged counterpart isotope … ops:OPS380297 Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types ops:OPS380297 21 October 2014 Scientific Lenses – A. J. G. Gray
OpenPHACTS UI http://explorer.openphacts.org/
Scientific Lenses – A. J. G. Gray Explorer Screenshot 21 October 2014 Scientific Lenses – A. J. G. Gray
Scientific Lenses – A. J. G. Gray Explorer Screenshot Pharmacology count 2370 3044 21 October 2014 Scientific Lenses – A. J. G. Gray
OpenPHACTS - Summary Principal difference – inter-domain links More complex, but still structure-centric data model Ontological relationships introduced Chemical Lenses – new type of search
Chemistry Data Platform – 2014 - …
Dimensions and complexity of science What about science and chemistry in particular?
RSC Archive – since 1841
Digitally Enabling RSC Archive
ChemSpider Synthetic Pages Compounds Reaction Analytical Data Text and References
RSC Databases RSC Compounds RSC Reactions RSC Spectra RSC Crystals RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…
Compounds domain
Data quality issue and CVSP Robochemistry Proliferation of errors in public and private databases ChemSpider PubChem DrugBank KEGG ChEBI/ChEMBL Automated quality control system
Chemistry Validation and Standardization Platform
Chemistry Validation and Standardization Platform
Reactions domain Information typically associated with reactions
Analytical data domain
Crystallography domain
Chemistry Data Platform - Summary Simplified models within domain Domains are described with its own models with embedded semantics No proper domain-specific identifiers Extensive quality control – CVSP (DOI 10.1186/s13321-015-0072-8)
There is no way back
Thank you Email: tkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16 48