Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building linked-data, large-scale chemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor,

Similar presentations


Presentation on theme: "Building linked-data, large-scale chemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor,"— Presentation transcript:

1 Building linked-data, large-scale chemistry platform: challenges, lessons and solutions
Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter Corbett Royal Society of Chemistry ACS Spring 2016 San Diego, CA March 13th 2016

2 ChemSpider – OpenPHACTS – Chemistry Data Platform – …

3 45 million chemicals and growing
Data sourced from >500 different sources Crowdsourced curation and annotation Ongoing deposition of data from our journals and our collaborators A structure centric hub for web-searching

4 ChemSpider

5 Chemical vendors and datasources

6 ChemSpider

7 Properties - experimental

8 Literature and patents references

9 Classification

10 Spectra

11 Multimedia

12 Tagging

13 ChemSpider - Summary Simple, flattish data model
InChI as a primary identifier Linked by synonyms Linked by “ExtId” Standard searches (identity, substructure, similarity) Very little semantics

14 OpenPHACTS: 2011-2014 Open PHACTS Mission:
Integrate Multiple Research Biomedical Data Resources Into A Single Open & Sustainable Access Point

15 OpenPHACTS Open PHACTS Practical Semantics
GlaxoSmithKline – Coordinator Universität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit Amsterdam Novartis Merck Serono H. Lundbeck A/S Eli Lilly Netherlands Bioinformatics Centre Swiss Institute of Bioinformatics ConnectedDiscovery EMBL-European Bioinformatics Institute Janssen Esteve Almirall OpenLink Scibite The Open PHACTS Foundation Spanish National Cancer Research Centre University of Manchester Maastricht University Aqnowledge University of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität Bonn AstraZeneca Pfizer @Open_PHACTS

16 Why is it so hard to…. IP? What’s the structure? Are they in our file?
What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? IP? Remember this, some of these questions are easier to answer than others Competitors?

17 Mx/psa, how calculated who did it? Mash up. With your data too,
“Let me compare MW, logP and PSA for known oxidoreductase inhibitors” “What is the selectivity profile of known p38 inhibitors?” “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” ChEMBL DrugBank Gene Ontology Wikipathways GeneGo Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial ChEBI UniProt UMLS neXtProt GVKBio ConceptWiki ChemSpider DisGeNet TrialTrove TR Integrity ChEMBL Target Class ENZYME FDA adverse events SureChEMBL 17

18 Open PHACTS was developed to support the key questions of drug discovery
Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data @gray_alasdair Big Data Integration

19 OpenPHACTS Discovery Platform
RDF Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Indexing Core Platform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” Public Content Commercial Public Ontologies User Annotations Apps Import data into cache API calls populate SPARQL queries Integration approach Data kept in original model Data cached in central triple store API call translated to SPARQL query Query expressed in terms of original data Queries expanded by IMS to cover URIs of original datasets 21 October 2014 Scientific Lenses – A. J. G. Gray

20 Gleevec®: Imatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N Mesylate ChemSpider Drugbank PubChem Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results 21 October 2014 Scientific Lenses – A. J. G. Gray

21 Scientific Lenses – A. J. G. Gray
Structure Lens I need to compute an analysis, give me details of the active compound in Gleevec. Strict Relaxed Analysing Browsing Interested in physiochemical properties of Gleevec skos:exactMatch (InChI) 21 October 2014 Scientific Lenses – A. J. G. Gray

22 Lens Effects: Ibuprofen
Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. CHEMBL427526 CHEMBL521 CHEMBL175 21 October 2014 Scientific Lenses – A. J. G. Gray

23 Scientific Lenses – A. J. G. Gray
Default Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray

24 Scientific Lenses – A. J. G. Gray
Stereoisomer Lens Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer. 21 October 2014 Scientific Lenses – A. J. G. Gray

25 Scientific Lenses – A. J. G. Gray
Mapping Generation ops:OPS437281 is_stereoisomer_of [ci:CHEMINF_000461] has_stereoundefined_parent [ci:CHEMINF_000456] Other relationships has part is tautomer of uncharged counterpart isotope ops:OPS380297 Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types ops:OPS380297 21 October 2014 Scientific Lenses – A. J. G. Gray

26 OpenPHACTS UI http://explorer.openphacts.org/

27 Scientific Lenses – A. J. G. Gray
Explorer Screenshot 21 October 2014 Scientific Lenses – A. J. G. Gray

28 Scientific Lenses – A. J. G. Gray
Explorer Screenshot Pharmacology count 2370  3044 21 October 2014 Scientific Lenses – A. J. G. Gray

29 OpenPHACTS - Summary Principal difference – inter-domain links
More complex, but still structure-centric data model Ontological relationships introduced Chemical Lenses – new type of search

30 Chemistry Data Platform – 2014 - …

31 Dimensions and complexity of science
What about science and chemistry in particular?

32

33 RSC Archive – since 1841

34 Digitally Enabling RSC Archive

35 ChemSpider Synthetic Pages
Compounds Reaction Analytical Data Text and References

36 RSC Databases RSC Compounds RSC Reactions RSC Spectra RSC Crystals
RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…

37 Compounds domain

38

39 Data quality issue and CVSP
Robochemistry Proliferation of errors in public and private databases ChemSpider PubChem DrugBank KEGG ChEBI/ChEMBL Automated quality control system

40 Chemistry Validation and Standardization Platform

41 Chemistry Validation and Standardization Platform

42 Reactions domain Information typically associated with reactions

43

44 Analytical data domain

45 Crystallography domain

46 Chemistry Data Platform - Summary
Simplified models within domain Domains are described with its own models with embedded semantics No proper domain-specific identifiers Extensive quality control – CVSP (DOI /s )

47 There is no way back

48 Thank you Slides: 48


Download ppt "Building linked-data, large-scale chemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor,"

Similar presentations


Ads by Google