Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances”

Slides:



Advertisements
Similar presentations
S.J. Coles a*, M.B. Hursthouse a, R.A. Stephenson a, P. Cliff b, E. Lyon b, M. Patel b J. Downing c & P. Murray-Rust.
Advertisements

Supporting Engagement in Open Access: a Publishers Perspective
Zuleika Medina Tyler Green CSC 101- Assignment 2_1.
THE GLOBAL CHEMISTRY NETWORK David James Executive Director, Strategic Innovation Jim Iley Executive Director, Science and Education 3 rd September 2013.
UK National Chemical Database Service: An integration of commercial and public chemistry services to support chemists in the United Kingdom Antony Williams,
ChemSpider: Searching by Chemical Name. ChemSpider  What is ChemSpider?  How to conduct a search  What do you get?
Using LinkedIn. LinkedIn At A Glance LinkedIn is a social network aimed toward professionals More than 50 million users Usually seen as an online version.
Royal Society of Chemistry developments to support open drug discovery Antony Williams, Ken Karapetyan, Valery Tkachenko, Colin Batchelor Alexey Pshenichnov.
Why you need this App Sean Ekins 1, Alex M. Clark 2 1 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay Varina, NC 27526, U.S.A. 2 Molecular.
TC2-Computer Literacy Mr. Sencer February 4, 2010.
Web 2.0 Web 2.0 is the term given to describe a second generation of the World Wide Web (WWW) that is focused on the ability for people to collaborate.
The collection, curation and modeling of Open Melting Point measurements August 26, th Meeting on U.S. Government Chemical Databases and Open Chemistry.
How community crowdsourcing and social networking is helping to build a quality online resource for chemists.
Adriana Iordan Web Marketing Manager / Avangate Social Networking Media How the software authors should use it?
Top Objectives: 1.Increase web traffic and exposure 2.Become definitive authority on Coffee 3.Increase sales to coffee centric Food Service Operators 4.Engage.
Yahoo! For Teachers By Teachers, For Teachers. In July 2006 Yahoo! invited a group of educators to partner with them to build technology that addressed.
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010.
Crowdsourcing Chemistry for the Community – 5 Years of Experiences Antony Williams NFAIS, February 28 th 2012.
Lecture 5 Title: Networks and Businesses
The Value of a Unique Researcher Identifier to ChemSpider Projects Antony Williams ORCID Meeting, Boston, May 18 th 2011.
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on.
Paul Groth VU University Amsterdam Convergence Meeting: Semantic Interoperability for Clinical Research & Patient.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Social Content ASIDIC, Tampa Fl, March 2009 What is Social Content? How can we use Social Content? What is the future of Social Content?
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
CSED Computational Science & Engineering Department CHEMICAL DATABASE SERVICE The Current Service is Well Regarded The CDS has a long and distinguished.
ChemSpider – A Combination Platform of Free Chemistry Database, Free Prediction Engines and Crowdsourcing Environment Antony Williams University of Oregon,
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune,
Internet Skills The World Wide Web (Web) consists of billions of interconnected pages of information from a wide variety of sources. In this section: Web.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
ChemModLab: A Web-based Cheminformatics Modeling Laboratory S. Stanley Young + ECCR and ChemSpider Teams.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Chemical health and safety data online – data consistency Antony Williams iRAMP Meeting, Ithaca, Feb 2014.
Marrying ACD/Labs technologies to eScience Projects at the Royal Society of Chemistry Antony Williams ACD/Labs User Meeting June 2013.
World English Dictionary Web 2.0 —n the Internet viewed as a medium in which interactive experience, in the form of blogs, wikis, forums, etc, plays.
The Benefits of Participation in the Social Web of Science Antony Williams Research Square October 30 th 2014.
ITGS Databases.
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Delivering an online service for validating and standardizing chemical structure files using the ChemSpider platform.
RSC Publishing Platform Amanda Sun
Vendor Session: ChemSpider, from Royal Society of Chemistry.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
One publisher’s perspectives on an evolving industry Grace Baynes Nature Publishing Group October 2009.
Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko.
SciFinder for Academic Research Sci-Edge Information, Pune Chemical Abstracts Service Representative -
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Social Media & Social Networking 101 Canadian Society of Safety Engineering (CSSE)
A Chemistry Data Repository to Serve Them All Antony Williams.
Structure verification and elucidation using the ChemSpider database Antony J Williams, Valery Tkachenko and Alexey Pshenichnov SERMACS, November 16 th.
Frompo is a Next Generation Curated Search Engine. Frompo has a community of users who come together and curate search results to help improve.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
The KNIME workflow for automated processing of PHYSPROP data
Preliminaries Have you sign up for SciFinder account? Login to PC
Ian Bruno, Suzanna Ward The Cambridge Crystallographic Data Centre
Applying Royal Society of Chemistry Cheminformatics Skills to Support the PharmaSea Project Antony Williams, Alexey Pshenichnov, Valery Tkachenko, Ken.
Preliminaries Have you sign up for SciFinder account? Login to PC
Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST.
Dealing with the complex challenge of managing diverse chemistry data online Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan.
Using Chemistry Databases for Literature, Substance and Reaction Searching for Chemistry Year 3 Students (CM3291) Wee Kin.
Using Chemistry Databases for Literature, Substance and Reaction Searching for Chemistry Year 3 Students (CM3291) Magdeline.
ORCID ID: Driving needs for analytical data exchange standards and the potential impacts on the chemical sciences Antony Williams.
Preliminaries Have you sign up for SciFinder account? Login to PC
Who knew I would get here from there: How I became the ChemConnector
Choosing the Discovery Model Martin Forsberg
European Network of e-Lexicography
Overview of open resources to support automated structure verification
Mobilizing EPA’s CompTox Chemistry Dashboard Data on Mobile Devices
Using Chemistry Databases for Literature, Substance and Reaction Searching for Chemistry Year 3 Students (CM3291) Pattarin.
Presentation transcript:

Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances” Antony Williams

RSC eScience  What was once just ChemSpider is much more…  ChemSpider Reactions  Chemicals Validation and Standardization Platform  Learn Chemistry Wiki  National Chemical Database Service  Open PHACTS  PharmaSea  Global Chemistry Hub

We are known for ChemSpider…  The Free Chemical Database  A central hub for chemists to source information  >28 million unique chemical records  Aggregated from >400 data sources  Chemicals, spectra, CIF files, movies, images, podcasts, links to patents, publications, predictions  A central hub for chemists to deposit & curate data

We Want to Answer Questions  Questions a chemist might ask…  What is the melting point of n-heptanol?  What is the chemical structure of Xanax?  Chemically, what is phenolphthalein?  What are the stereocenters of cholesterol?  Where can I find publications about xylene?  What are the different trade names for Ketoconazole?  What is the NMR spectrum of Aspirin?  What are the safety handling issues for Thymol Blue?

I want to know about “Vincristine”

Vincristine: Identifiers and Properties

Vincristine: Vendors and Sources

Vincristine: Articles

How did we build it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers

The InChI Identifier

Downsides of InChI  Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”  InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…

Side Effects of InChI Usage

SMILES by comparison…

Side Effects of InChI Usage

Searches: The INTERNET

Search by InChI

ChemSpider Google Search

How did we build it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers  We deal with “various forms” of data

Crowdsourced “Annotations”  Users can add  Descriptions/Syntheses/Commentaries  Links to PubMed articles  Links to articles via DOIs  Add spectral data  Add Crystallographic Information Files  Add photos  Add MP3 files  Add Videos

ChemSpider : Spectra Linked

ChemSpider ID H1 NMR

ChemSpider ID HHCOSY

How did we build it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers  We deal with “various forms” of data  We are challenged with the complexities of chemical names

Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 addresses ChemSpiderman (blog, Twitter account, Facebook, Friendfeed) OpenID ….

Aspirin names and synonyms Text searches depend on correct association >300 suggested identifiers for Aspirin just on PubChem Disambiguation dictionaries are necessary, not just for authors!

The Final Search Strategy

All Those Names, One Structure

Curated Dictionaries Matter

Crowdsourcing ChemSpider  ChemSpider is crowdsourced  Community deposition, annotation and curation  Anyone can “Leave Feedback”  Registered users can add data

“Curate” Identifiers

Success Depends on Dictionaries

Vincristine: Identifiers and Properties

Vincristine: Patents Linked by Name

Validated Names for Searching…

And yes..there are challenges

Licensing Data is Tough…

Data Licensing, Open Data  The use of CAS data in third party Data Mining Tools is permitted as long as CAS Records are downloaded via STN ® AnaVist™. All of these new "freedoms" are aimed at further enabling the dissemination of scientific information and the advancement of scientific research.  CAS does not permit the building of Databases that have wide and general availability and no longer fulfill the purpose of individual or team research that CAS permits but instead serve as a substitute for the use of CAS Databases.

A Comment on Quality  For >28 million chemical compounds there are some errors:  “Incorrect” structure representations  Mismatched name-structure relationships  Experimental properties (the values, the units)  Real vs. virtual compounds – text-mining and conversion  We have deprecated a LOT of data…

Identifier Dictionaries  Reciprocal curation processes…share curation with each other.  If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.  A series of “added” and “removed” synonyms against InChIKeys for matching.

Federated Data Curation Sharing Who wants to work with us?

Structure Validation using feed  Look for approved synonyms  Compare feed InChIKey with database InChIKey  If different, flag for inspection

Many Problems Can be Solved…  Clean up databases – structure validation, structure standardization  Warn about  Valency, charge balance, depiction issues, bond types, absent stereo, and another 100 rules (or so…)  Standardize  Agree community rules to “Standardize”

Structure Validation

Structure Validation - Fixed

What needs to happen?  If we could validate  Catch errors in databases (and clean)  Proactively catch errors in publications/patents  Reduce junk in the ether – improve QUALITY!  If we standardized  Interlinking should improve

CVSP: result of processing

NCATS Dataset

DrugBank dataset (6516 records)  Marked as Errors (arbitrary)  2 records with query bonds  3 records with invalid atoms (asterisk in polymers)  Unusual valence: ~70 (oxygen 3, sulfur 3 and 5, Mg 4, B 5, etc.)  Warnings  INCHI not matching structure (100+)  SMILES not matching structure (100+)

 DrugBank ID: DB00755  InChI=1S/C20H28O2/c1-15( (2)14-19(21)22) (3) (18,4)5/h6,8-9,11-12,14H,7,10,13H2,1- 5H3,(H,21,22)/b9-6+,12-11+,15-8+,  DrugBank ID: DB00614

Connecting Chemistry across the web  So much of what is seen on ChemSpider is retrieved in real time using services

Connecting Chemistry across the web

Online Predictions

Web Services Open Up Collaboration  Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup  Many academic sites integrating directly – metabonomics, name lookup, semantic markup  Mobile app integration  Commercial structure drawing packages

Web Services

ChemSpider Everywhere: Spectral Game

ChemSpider Everywhere Crowdsourced Curation of Spectra

Web Services Integrate INTERNAL Projects  Integration between ChemSpider and…  Our publishing platform for structure display  ChemSpider SyntheticPages  LearnChemistry Wiki  National Chemical Database Service  And….a growing list….

What ChemSpider Does Not Handle  Polymers  Markush structures  Organometallics  Many Inorganics  Materials  Reactions…but….

ChemSpider Reactions

DERA  Digitally Enabling the RSC Archive…back to 1841  Extracting data and making available via appropriate platform  Chemicals  Reactions  Analytical Data  Figures

Chemical Database Service

Data for life sciences What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? Competitors? IP?

OpenPHACTS  Open PHACTS is an Innovative Medicines Initiative (IMI) – 3 years project  To reduce the barriers to drug discovery in industry, academia and for small businesses  To build an open platform, integrating chemistry and biology data from public domain resources  Semantic web platform  Open Standards, Open Data and Open Source

 Crowdsourcing across drug discovery  Open PHACTS : partnership between European Community and European Pharma Companies  22 partners, 8 pharmaceutical companies, 3 biotechs working together for 3 years  Freely accessible for knowledge discovery and verification.  Data on chemistry and biology  Pharmacological profiles  Proprietary and public data sources.

PharmaSea FP7 Initiative. PharmaSea: increasing value and flow in the marine biodiscovery pipeline ( ) Improve the quality, volume and value of active agents discovered in the marine environment and increase the speed at which they can be delivered RSC: Providing dereplication via ChemSpider, analytical data algorithms, integration with computer-assisted structure elucidation algorithms

Conclusions  RSC eScience supporting increasing number of grant-based projects  ChemSpider grows daily – community depositions and data from RSC Content with a focus on expanding data while improving quality  ChemSpider is an integration platform for MANY projects through web services  CVSP processing is available to use and provide feedback – will be available as a service also  We believe in curation sharing - who wants to collaborate?

Thank you Twitter: ChemConnector Blog: Personal Blog: SLIDES: