Download presentation
Presentation is loading. Please wait.
Published byCatherine Mathews Modified over 6 years ago
1
Dealing with the complex challenge of managing diverse chemistry data online
Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan ACS San Francisco August 2014
3
CAS Counter http://www.cas.org/content/counter
4
About Me…as a Chemist I’ve performed a few dozen chemical syntheses
I’ve run thousands of analytical spectra I’ve generated thousands of NMR assignments I’ve probably published <5% of all work Most of it has been lost But things can be different today…. But it still needs to be associated with me…
5
Think about chemistry a mo’
If we imagine that permission exists… (i.e. forget IP, chemical and pharma companies etc…think students…) How many syntheses are performed How many spectra are run How many properties are measured How many compounds are made How many, how much, how big??..... Let’s go manage it all!!
7
Consider a shift to Openness
8
Open Access funder mandates…
Times have changed… Open Access funder mandates…
9
Publishers are responding
10
The world of Open Data is here
11
Open Data are everywhere
Is Openness and Social Sharing changing the world? The cultural experiments in Open Data and exchange are almost daily Mobile platforms enhance participation And then what of Chemistry Data???
12
An Experiment - ChemSpider
ChemSpider allowed the community to participate in linking the internet of chemistry & crowdsourcing of data Successful experiment in terms of building a central hub for integrated web search More people are “users” than “contributors” Yet basic feedback and game-play helps
13
An Experiment - CSSP
14
An EPSRC Call “…the identification of the need for a UK national service for the provision of a searchable, electronic chemical database for the UK academic research community.”
15
National Chemical Database Service
16
We set a vision… Manage “all” of the chemistry data associated with chemical substances – PUBLISHED and UNPUBLISHED Based on user selected licensing the data to be downloadable, reusable, interactive Build a platform that enables the scientist Data storage, validation, standardization and curation Collaborative data sharing Provide data platform that can enable and enhance publishing of scientific papers
17
Data Repository Registration of chemical compounds
Deposition of chemical syntheses Addition of analytical data Integration to electronic notebooks Rewards and recognition for data sharing Document processing Hosting of data as private, embargoed or public
18
Development of Data Repository
Data repository should not just be a data dump – should not be a “big disk” Searchable, integrated, segregated repository of data types Data access including private, shared embargoed and public Delivery of derived models from data
19
New Repository Architecture doi: 10.1007/s10822-014-9784-5
20
New Repository Architecture
21
Input data pipeline
22
Compounds
23
Reactions
24
Analytical data
25
Crystallography data
26
For Deposition of Data Quality of data at source
ensuring chemicals are correct - VALIDATION reactions map and balance as appropriate – VALIDATION and STANDARDIZATION file format handling for analytical data types – binary file formats are proprietary - STANDARDIZATION valid interpretation of data – VALIDATION and ANNOTATION
27
Input data pipeline
28
Depositions Gateway User Interface
29
Deposition of Data
30
Validate and Standardize
31
CVSP Filtering
32
CVSP Filtering of DrugBank
33
ChEMBL (1.3 million records)
11,020 records with 4 bonds and zero charge, e.g. CHEMBL or CHEMBL501973 271 records with hypervalent oxygen (e.g. , CHEMBL ), carbon (e.g ), boron, chlorine, iodine or phosphine 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704
35
Depositions User Interface
36
The challenges of analytical data
Vendors produce complex proprietary data formats and standard formats are required (JCAMP, NetCDF, AniML) ChemSpider already hosts thousands of JCAMP spectra Support of “assigned spectra” in place Data validation approaches understood There are a myriad of analytical data types…
37
ChemSpider ID H1 NMR
38
ChemSpider ID C13 NMR
39
ChemSpider ID HHCOSY
40
ChemSpider ID HSQC
41
ChemSpider ID HMBC
42
Managing Assignments?
43
Depositions User Interface
44
Depositions from ELNs Development work integrating chemistry into the Southampton Labtrove notebook Stoichiometry table development Analytical data integration “ChemTrove” rolled out to a small test group in January
48
Document deposition/processing
49
Experimental data checker
50
User Interface Approach
53
User Interface Approach
55
Display Widgets
56
Work in Progress
57
Work in Progress
58
User Interface Approach
60
A Compounds Repository Interface
61
A Reactions/Document Interface
63
The PharmaSea Website
64
The Open PHACTS community ecosystem
65
Open Source Drug Discovery India
66
What can drive participation?
What can drive scientists to participate and contribute? Ensuring provenance of their data for reuse Mandates from funding agencies Improved systems to ease contribution Additional contributions to science Improved publishing processes Recognition for contributions
67
Scientists are Increasingly Quantified…
68
AltMetrics as Scientist Impact
69
AltMetrics
71
Detailed Usage Statistics
72
Rewards and Recognition
The First Step badge is awarded when a user submits (& has published) their 1st CSSP article. Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.
74
AltMetrics Feeds For our data repository ensure contribution of data will feed out to the AltMetrics platforms Every data point, every data download, use and reuse will be associated with the scientist Data will be DOI’ed (presently under review) Services provided will allow for AltMetrics use
75
What do we have in place? We are testing an early form of the data repository on our data – ChemSpider and our archive of publications Working with collaborators to define needs Testing and enhancing deposition systems Chemical validation & standardization platform Analytical data handling formats And lots in development…
76
The Challenges Ahead Chemistry is NOT just nicely defined structures!
Materials, minerals, attached to beads, polymers, ambiguous materials Domain-specific measurements File format standards are limited in application Encouraging scientists to free up their data AltMetrics, open data mandates, systems The data explosion continues
77
But it’s not easy of course
Not everything we would like around data handling is there for sure Many systems, tools, platforms are already available but we don’t know about them or even if we did contributing us “more work” “What’s in it for me?”, “It’s my data”, “It’s too much work”, “What credit do I get?”
78
And yes…we know…
79
Thank you ORCID: Personal Blog: SLIDES: 79
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.