Download presentation
Presentation is loading. Please wait.
Published byLeo Tyler Modified over 6 years ago
1
Building linked-data, large-scale chemistry platform: challenges, lessons and solutions
Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter Corbett Royal Society of Chemistry ACS Spring 2016 San Diego, CA March 13th 2016
2
ChemSpider – OpenPHACTS – Chemistry Data Platform – …
3
45 million chemicals and growing
Data sourced from >500 different sources Crowdsourced curation and annotation Ongoing deposition of data from our journals and our collaborators A structure centric hub for web-searching
4
ChemSpider
5
Chemical vendors and datasources
6
ChemSpider
7
Properties - experimental
8
Literature and patents references
9
Classification
10
Spectra
11
Multimedia
12
Tagging
13
ChemSpider - Summary Simple, flattish data model
InChI as a primary identifier Linked by synonyms Linked by “ExtId” Standard searches (identity, substructure, similarity) Very little semantics
14
OpenPHACTS: 2011-2014 Open PHACTS Mission:
Integrate Multiple Research Biomedical Data Resources Into A Single Open & Sustainable Access Point
15
OpenPHACTS Open PHACTS Practical Semantics
GlaxoSmithKline – Coordinator Universität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit Amsterdam Novartis Merck Serono H. Lundbeck A/S Eli Lilly Netherlands Bioinformatics Centre Swiss Institute of Bioinformatics ConnectedDiscovery EMBL-European Bioinformatics Institute Janssen Esteve Almirall OpenLink Scibite The Open PHACTS Foundation Spanish National Cancer Research Centre University of Manchester Maastricht University Aqnowledge University of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität Bonn AstraZeneca Pfizer @Open_PHACTS
16
Why is it so hard to…. IP? What’s the structure? Are they in our file?
What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP? Remember this, some of these questions are easier to answer than others Competitors?
17
Mx/psa, how calculated who did it? Mash up. With your data too,
“Let me compare MW, logP and PSA for known oxidoreductase inhibitors” “What is the selectivity profile of known p38 inhibitors?” “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” ChEMBL DrugBank Gene Ontology Wikipathways GeneGo Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial ChEBI UniProt UMLS neXtProt GVKBio ConceptWiki ChemSpider DisGeNet TrialTrove TR Integrity ChEMBL Target Class ENZYME FDA adverse events SureChEMBL 17
18
Open PHACTS was developed to support the key questions of drug discovery
Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data @gray_alasdair Big Data Integration
19
OpenPHACTS Discovery Platform
RDF Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Indexing Core Platform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” Public Content Commercial Public Ontologies User Annotations Apps Import data into cache API calls populate SPARQL queries Integration approach Data kept in original model Data cached in central triple store API call translated to SPARQL query Query expressed in terms of original data Queries expanded by IMS to cover URIs of original datasets 21 October 2014 Scientific Lenses – A. J. G. Gray
20
Gleevec®: Imatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N Mesylate ChemSpider Drugbank PubChem Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results 21 October 2014 Scientific Lenses – A. J. G. Gray
21
Scientific Lenses – A. J. G. Gray
Structure Lens I need to compute an analysis, give me details of the active compound in Gleevec. Strict Relaxed Analysing Browsing Interested in physiochemical properties of Gleevec skos:exactMatch (InChI) 21 October 2014 Scientific Lenses – A. J. G. Gray
22
Lens Effects: Ibuprofen
Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. CHEMBL427526 CHEMBL521 CHEMBL175 21 October 2014 Scientific Lenses – A. J. G. Gray
23
Scientific Lenses – A. J. G. Gray
Default Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray
24
Scientific Lenses – A. J. G. Gray
Stereoisomer Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray
25
Scientific Lenses – A. J. G. Gray
Mapping Generation ops:OPS437281 ✔ is_stereoisomer_of [ci:CHEMINF_000461] has_stereoundefined_parent [ci:CHEMINF_000456] Other relationships has part is tautomer of uncharged counterpart isotope … ops:OPS380297 Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types ops:OPS380297 21 October 2014 Scientific Lenses – A. J. G. Gray
26
OpenPHACTS UI http://explorer.openphacts.org/
27
Scientific Lenses – A. J. G. Gray
Explorer Screenshot 21 October 2014 Scientific Lenses – A. J. G. Gray
28
Scientific Lenses – A. J. G. Gray
Explorer Screenshot Pharmacology count 2370 3044 21 October 2014 Scientific Lenses – A. J. G. Gray
29
OpenPHACTS - Summary Principal difference – inter-domain links
More complex, but still structure-centric data model Ontological relationships introduced Chemical Lenses – new type of search
30
Chemistry Data Platform – 2014 - …
31
Dimensions and complexity of science
What about science and chemistry in particular?
33
RSC Archive – since 1841
34
Digitally Enabling RSC Archive
35
ChemSpider Synthetic Pages
Compounds Reaction Analytical Data Text and References
36
RSC Databases RSC Compounds RSC Reactions RSC Spectra RSC Crystals
RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…
37
Compounds domain
39
Data quality issue and CVSP
Robochemistry Proliferation of errors in public and private databases ChemSpider PubChem DrugBank KEGG ChEBI/ChEMBL Automated quality control system
40
Chemistry Validation and Standardization Platform
41
Chemistry Validation and Standardization Platform
42
Reactions domain Information typically associated with reactions
44
Analytical data domain
45
Crystallography domain
46
Chemistry Data Platform - Summary
Simplified models within domain Domains are described with its own models with embedded semantics No proper domain-specific identifiers Extensive quality control – CVSP (DOI /s )
47
There is no way back
48
Thank you Slides: 48
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.