A Chemistry Data Repository to Serve Them All Antony Williams.

Slides:



Advertisements
Similar presentations
S.J. Coles a*, M.B. Hursthouse a, R.A. Stephenson a, P. Cliff b, E. Lyon b, M. Patel b J. Downing c & P. Murray-Rust.
Advertisements

© S.J. Coles 2006 Usability WS, NeSC Jan 06 Enabling the reusability of scientific data: Experiences with designing an open access infrastructure for sharing.
Opening the Research Data Lifecycle Workshop Capturing and Sharing Research Data Simon Coles School of Chemistry, University of Southampton, U.K.
S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge b. a School of Chemistry, University of Southampton, UK.; b School of Electronics.
© S.J. Coles 2006 Institutional Data Repositories for Chemistry Simon Coles School of Chemistry, University of Southampton, U.K.
Supporting Engagement in Open Access: a Publishers Perspective
THE GLOBAL CHEMISTRY NETWORK David James Executive Director, Strategic Innovation Jim Iley Executive Director, Science and Education 3 rd September 2013.
UK National Chemical Database Service: An integration of commercial and public chemistry services to support chemists in the United Kingdom Antony Williams,
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
Data activities of the International Union of Crystallography Brian McMahon IUCr 5 Abbey Square Chester CH1 2HU
ChemSpider: Searching by Chemical Name. ChemSpider  What is ChemSpider?  How to conduct a search  What do you get?
The Central Role of Data ‘Capturing and Sharing Chemistry Research Data’ Simon Coles School of Chemistry, University of Southampton, U.K.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2004.
The Royal Society of Chemistry: Advancing Excellence in the Chemical Sciences Dan Dyer Head of Sales.
Royal Society of Chemistry developments to support open drug discovery Antony Williams, Ken Karapetyan, Valery Tkachenko, Colin Batchelor Alexey Pshenichnov.
NATIONAL LIBRARY OF MEDICINE PubMed Central Brooke Dine National Library of Medicine Medical Library Association Conference May 2005.
Why you need this App Sean Ekins 1, Alex M. Clark 2 1 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay Varina, NC 27526, U.S.A. 2 Molecular.
Global Alignment and Collaboration Jo
1,000,000,000 7,000, US $ investment hours of work experiments researchers years drug Pharma is experience challenges.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
University of Southampton, U.K.
© S.J. Coles 2006 Data Management in the Chemistry Domain Simon Coles School of Chemistry, University of Southampton, U.K.
How community crowdsourcing and social networking is helping to build a quality online resource for chemists.
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010.
Crowdsourcing Chemistry for the Community – 5 Years of Experiences Antony Williams NFAIS, February 28 th 2012.
The Value of a Unique Researcher Identifier to ChemSpider Projects Antony Williams ORCID Meeting, Boston, May 18 th 2011.
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on.
STIM Sloan-Stanford Network for the History of Technology.
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances”
ChemSpider – A Combination Platform of Free Chemistry Database, Free Prediction Engines and Crowdsourcing Environment Antony Williams University of Oregon,
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune,
ChemModLab: A Web-based Cheminformatics Modeling Laboratory S. Stanley Young + ECCR and ChemSpider Teams.
Chemical health and safety data online – data consistency Antony Williams iRAMP Meeting, Ithaca, Feb 2014.
Marrying ACD/Labs technologies to eScience Projects at the Royal Society of Chemistry Antony Williams ACD/Labs User Meeting June 2013.
The Benefits of Participation in the Social Web of Science Antony Williams Research Square October 30 th 2014.
May 2, 2013 An introduction to DSpace. Module 1 – An Introduction By the end of this module, you will … Understand what DSpace is, and what it can be.
Now launched! Visit nature.com/scientificdata Honorary Academic Editor Susanna-Assunta Sansone Advisory.
Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
Vendor Session: ChemSpider, from Royal Society of Chemistry.
One publisher’s perspectives on an evolving industry Grace Baynes Nature Publishing Group October 2009.
Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko.
Publishing & Citing Research Data Arun Prakash. Agenda  Introduction  Why is Data publishing important ?  Ongoing Work  Role of Semantics.
Reaxys – The Highlights. Slide 2 What is Reaxys? A brand new workflow solution for research chemists and scientists from related disciplines An extensive.
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,
Structure verification and elucidation using the ChemSpider database Antony J Williams, Valery Tkachenko and Alexey Pshenichnov SERMACS, November 16 th.
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
NRF Open Access Statement
The CompTox Chemistry Dashboard: an informational data hub at the
The KNIME workflow for automated processing of PHYSPROP data
Open Research Data and Open Access publications: How do they sit in the Web of Science? Guillaume Rivalle, Manager, Europe solution specialists
Preliminaries Have you sign up for SciFinder account? Login to PC
Ian Bruno, Suzanna Ward The Cambridge Crystallographic Data Centre
Applying Royal Society of Chemistry Cheminformatics Skills to Support the PharmaSea Project Antony Williams, Alexey Pshenichnov, Valery Tkachenko, Ken.
Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST.
Dealing with the complex challenge of managing diverse chemistry data online Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan.
Using Chemistry Databases for Literature, Substance and Reaction Searching for Chemistry Year 3 Students (CM3291) Wee Kin.
It is a web-based tool for the retrieval of chemistry information and data from published literature. The content covers more than 200 years of chemistry.
ORCID ID: Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online.
Using Chemistry Databases for Literature, Substance and Reaction Searching for Chemistry Year 3 Students (CM3291) Magdeline.
ORCID ID: Driving needs for analytical data exchange standards and the potential impacts on the chemical sciences Antony Williams.
ATOM Accelerating Therapeutics for Opportunities in Medicine
Who knew I would get here from there: How I became the ChemConnector
Beyond the paper resume and how to develop an online profile as a scientist Antony Williams.
Mobilizing EPA’s CompTox Chemistry Dashboard Data on Mobile Devices
Using Chemistry Databases for Literature, Substance and Reaction Searching for Chemistry Year 3 Students (CM3291) Pattarin.
Developing Institutional Data Repositories
Presentation transcript:

A Chemistry Data Repository to Serve Them All Antony Williams

Chemistry for the Community The Royal Society of Chemistry as a provider of chemistry for the community: As a charity As a scientific publisher As a host of commercial databases As a partner in grant-based projects As the host of ChemSpider And now in development : the RSC Data Repository for Chemistry

~30 million chemicals and growing Data sourced from >500 different sources Crowd sourced curation and annotation Ongoing deposition of data from our journals and our collaborators Structure centric hub for web-searching …and a really big dictionary!!!

ChemSpider

Experimental/Predicted Properties

Literature references

Patents references

Books

Vendors and data sources

Aspirin on ChemSpider

Many Names, One Structure

What is the Structure of Vitamin K?

MeSH A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione).

What is the Structure of Vitamin K?

The ultimate “dictionary” Search all forms of structure IDs Systematic name(s) Trivial Name(s) SMILES InChI Strings InChIKeys Database IDs Registry Number

Linking Names to Structures

Semantic Mark-up of Articles

Crowdsourced “Annotations” Users can add Descriptions, Syntheses and Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos

Crowdsourced Enhancement The community can clean and enhance the database by providing Feedback and direct curation Tens of thousands of edits made

ChemSpider ChemSpider allowed the community to participate in linking the internet of chemistry & crowdsourcing of data Successful experiment in terms of building a central hub for integrated web search More people are “users” than “contributors” Yet basic feedback and game-play helps

ChemSpider Spectra

An Adventure into the World of Small but significant contribution..

ChemSpider SyntheticPages

Micropublishing with Peer Review (a chemical synthesis blog?)

Multi-Step Synthesis

Interactive Data

Publications-summary of work Scientific publications are a summary of work Is all work reported? How much science is lost to pruning? What of value sits in notebooks and is lost? Publications offering access to “real data”? How much data is lost? How many compounds never reported? How many syntheses fail or succeed? How many characterization measurements?

Deposition of Research Data If we manage compounds, syntheses and analytical data… If we have security and provenance of data… If we deliver user interfaces to satisfy the various use cases… Then we have delivered electronic lab notebooks for chemistry laboratories. ELNs are research data repositories

What are we building? We are building the “RSC Data Repository” Containers for compounds, reactions, analytical data, tabular data Algorithms for data validation and standardization Flexible indexing and search technologies A platform for modeling data and hosting existing models and predictive algorithms

Deposition of Data

Compounds

Reactions

Analytical data

Crystallography data

Deposition of Data Developing systems that provides feedback to users regarding data quality Validate/standardize chemical compounds Check for balanced reactions Checks spectral data EXAMPLE Future work Properties – compare experimental to pred. Automated structure verification - NMR

Can we get historical data? Text and data can be mined Spectra can be extracted and converted SO MUCH Open Source Code available

Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6, thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer, thermometer and reflux condenser. The reaction mixture was heated at reflux with stirring, for a period of about one-half hour. After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6, thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer, thermometer and reflux condenser. The reaction mixture was heated at reflux with stirring, for a period of about one-half hour. After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Text spectra? 13C NMR (CDCl3, 100 MHz): δ = (CH3), (CH, benzylic methane), (CH, benzylic methane), (CH2), (CH2), , , , , , , , , , (ArCH), 99.42, , , , , , , , (ArC)

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Turn “Figures” Into Data

Make it interactive

SO MANY reactions!

Extracting our Archive What could we get from our archive? Find chemical names and generate structures Find chemical images and generate structures Find reactions Find data (MP, BP, LogP) and deposit Find figures and database them Find spectra (and link to structures)

Models published from data

Text-mining Data to compare

Support grant-based services Multiple European consortium-based grants PharmaSea (FP7 funded) Open PHACTS (IMI funded) UK National Chemical Database Service ( – developing data repository for lab data, integrate Electronic Lab Notebookshttp://cds.rsc.org Open Drug Discovery projects

3-year Innovative Medicines Initiative project Integrating chemistry and biology data using semantic web technologies Open code, open data, open standards Academics, Pharmas, Publishers… To put medicines in the pipeline…

The Open PHACTS community ecosystem

Open Source Drug Discovery India

UK Chemical Database Service The National Chemical Database Service is for UK academics

Vision for the Service PART 1 Provide access to databases and services of interest to the academic community to serve their needs. Access to services to include: Crystallography data – Organic and inorganic materials Thermophysical data Reactions Data including retrosynthetic analysis Prediction technologies – name generation, physicochemical parameters, NMR prediction

Vision for the Service PART 2 Response to the call for proposals included our vision for a 21 st Century data repository At a time of Open Access, Open Data and funding agency requirement to make data public – build a data repository Funding is split for licensing content and services (VAST MAJORITY) and some funding for research and development

An Initial “Vague” Vision Set Manage “all” of the chemistry data associated with chemical substances Data to be downloadable, reusable, interactive Build a platform that enables the scientist Data storage, validation, standardization and curation Collaborative data sharing Provide data platform that can enable and enhance publishing of scientific papers

10 years ago… There was no iPad, no iPhone, no Android Facebook had only just been released

ChemSpider is 7 years old… 10s of thousands of users per day Over 30 million chemical compounds Recognized as one of the worlds primary sources of chemistry data But 7 years is a long time in software… The data repository is the new foundation, ChemSpider is an interface and brand

Powered by RSC Data What will it be like when we are hosting chemistry data that doesn’t get published??? Or hosting all data UNTIL it gets published?? What will it be like when computer models are being rebuilt every time there is a new dataset – validating the data, flagging data What will it be like when publications are not only peer-reviewed but also computer reviewed?

Internet Data The Future Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals

A Global Chemistry Network The Global Chemistry Network is much bigger than just data - scientific networking, micro/publishing, integration hub. The data repository as a handler for data, GCN as a submission interface, GCN as a profile handler, rewards and recognition platform etc. Data repository architecture designed to deliver the underpinning data containers and visualization widgets etc.

Thank you ORCID: Twitter: Personal Blog: SLIDES: