ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on.

Slides:



Advertisements
Similar presentations
Technology and the google. Challenges to economic development? Most economic developers face similar challenges, besides not having adequate funding,
Advertisements

SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
S.J. Coles a*, M.B. Hursthouse a, R.A. Stephenson a, P. Cliff b, E. Lyon b, M. Patel b J. Downing c & P. Murray-Rust.
Lorcan Dempsey OCLC Big Heads – Heads of Technical Services of Large Research Libraries ALA 2013 Chicago 28 June things about
Supporting Engagement in Open Access: a Publishers Perspective
THE GLOBAL CHEMISTRY NETWORK David James Executive Director, Strategic Innovation Jim Iley Executive Director, Science and Education 3 rd September 2013.
UK National Chemical Database Service: An integration of commercial and public chemistry services to support chemists in the United Kingdom Antony Williams,
Open Range Software Hazardous Material Management.
Libraries in FE Colleges Capita Library Management System Demonstration May 2013.
ChemSpider: Searching by Chemical Name. ChemSpider  What is ChemSpider?  How to conduct a search  What do you get?
Royal Society of Chemistry developments to support open drug discovery Antony Williams, Ken Karapetyan, Valery Tkachenko, Colin Batchelor Alexey Pshenichnov.
Why you need this App Sean Ekins 1, Alex M. Clark 2 1 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay Varina, NC 27526, U.S.A. 2 Molecular.
Microsoft and Web 2.0 In the enterprise. A working definition of Web 2.0.
Validation of chemical data on Wikipedia Martin A. Walker Dept. of Chemistry, SUNY Potsdam Member of the Wikipedia Chemistry Project.
The collection, curation and modeling of Open Melting Point measurements August 26, th Meeting on U.S. Government Chemical Databases and Open Chemistry.
Improving Quality with the Substance Registry Services (SRS) John Harman U.S. EPA May 14, 2009.
How community crowdsourcing and social networking is helping to build a quality online resource for chemists.
Top Objectives: 1.Increase web traffic and exposure 2.Become definitive authority on Coffee 3.Increase sales to coffee centric Food Service Operators 4.Engage.
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data? Antony Williams Wolfram Summit, September 2010.
Crowdsourcing Chemistry for the Community – 5 Years of Experiences Antony Williams NFAIS, February 28 th 2012.
The Value of a Unique Researcher Identifier to ChemSpider Projects Antony Williams ORCID Meeting, Boston, May 18 th 2011.
Paul Groth VU University Amsterdam Convergence Meeting: Semantic Interoperability for Clinical Research & Patient.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
© 2008 IBM Corporation ® Atlas for Lotus Connections Unlock the power of your social network! Customer Overview Presentation An IBM Software Services for.
CSED Computational Science & Engineering Department CHEMICAL DATABASE SERVICE The Current Service is Well Regarded The CDS has a long and distinguished.
Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances”
ChemSpider – A Combination Platform of Free Chemistry Database, Free Prediction Engines and Crowdsourcing Environment Antony Williams University of Oregon,
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune,
May 2012 Development of an Accountability Framework for CARE International.
ChemModLab: A Web-based Cheminformatics Modeling Laboratory S. Stanley Young + ECCR and ChemSpider Teams.
Chemical health and safety data online – data consistency Antony Williams iRAMP Meeting, Ithaca, Feb 2014.
ChEMBL– Open Access Database For Drug Discovery By – Udghosh Singh M.S.(Pharm), 3 rd Sem Pharmacoinformatics.
Pathway Interaction Database (PID) Market Research BioPortals Tiger Team Meeting Mervi Heiskanen January 31, 2013.
Marrying ACD/Labs technologies to eScience Projects at the Royal Society of Chemistry Antony Williams ACD/Labs User Meeting June 2013.
The Benefits of Participation in the Social Web of Science Antony Williams Research Square October 30 th 2014.
Essential 3a - SSID Enrollment Capabilities and Key Concepts v3.0, August 07, 2012 SSID ENROLLMENT Capabilities and Key Concepts Essential 3a.
Sharon M. Jordan Assistant Director for Program Integration U.S. DOE Office of Scientific & Technical Information Vantage Point: Government R&D Results.
Anne Hersey ChEMBL Group, EMBL-EBI ChEMBL – A Database of Bioactive Drug-like Small Molecules.
Retrieving Chemistry Information Loh Mee Lan & Mak Jie Ying Inorganic Chemistry - 6 Aug 2015.
Everyone Communicates Few Connect
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Delivering an online service for validating and standardizing chemical structure files using the ChemSpider platform.
Welcoming Remarks – and a Very Brief History of U.S. Govt. Chemical Databases and Open Chemistry Marc C. Nicklaus Computer-Aided Drug Design Group Chemical.
Vendor Session: ChemSpider, from Royal Society of Chemistry.
One publisher’s perspectives on an evolving industry Grace Baynes Nature Publishing Group October 2009.
Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Reaxys – The Highlights. Slide 2 What is Reaxys? A brand new workflow solution for research chemists and scientists from related disciplines An extensive.
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,
A Chemistry Data Repository to Serve Them All Antony Williams.
Structure verification and elucidation using the ChemSpider database Antony J Williams, Valery Tkachenko and Alexey Pshenichnov SERMACS, November 16 th.
KNOWLEDGE MANAGEMENT (KM) Session # 33. Corporate Intranet A Conceptual Model INTRANET Production Team— New Product Budget Director— New Product Knowledge.
Pharmacy Orientation Part II Carrie L. Gassett, M.S.I.S. Aug. 9, 2013.
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
Who is NCCT? National Center for Computational Toxicology – part of EPA’s Office of Research and Development Research driven by EPA’s Chemical Safety for.
The CompTox Chemistry Dashboard: an informational data hub at the
The KNIME workflow for automated processing of PHYSPROP data
US EPA’s CompTox Chemistry Dashboard
Applying Royal Society of Chemistry Cheminformatics Skills to Support the PharmaSea Project Antony Williams, Alexey Pshenichnov, Valery Tkachenko, Ken.
Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST.
Dealing with the complex challenge of managing diverse chemistry data online Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan.
ORCID ID: Driving needs for analytical data exchange standards and the potential impacts on the chemical sciences Antony Williams.
Fig. 1 The FAF-Drugs4 and FAF-QED servers
Who knew I would get here from there: How I became the ChemConnector
Beyond the paper resume and how to develop an online profile as a scientist Antony Williams.
Current Issues or Challenges in Visual Analytics
Overview of open resources to support automated structure verification
Mobilizing EPA’s CompTox Chemistry Dashboard Data on Mobile Devices
McGraw-Hill Technology Education
Presentation transcript:

ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011

I want to know about “Vincristine”

Vincristine: Identifiers and Properties

Vincristine: Vendors and Sources

Vincristine: Patents

Vincristine: Articles

Vincristine: RSC Databases

Searches: The INTERNET

Validated Names for Searching…

And InChIs…

ChemSpider  The Free Chemical Database  A central hub for chemists to source information  >26 million unique chemical records  Aggregated from >400 data sources  Chemicals, spectra, CIF files, movies, images, podcasts, links to patents, publications, predictions  A central hub for chemists to deposit & curate data

Essential aspects of ChemSpider  ChemSpider is a BIG database..and growing  Our focus has increasingly become QUALITY over quantity  Data curation and validation is our strength – crowdsourcing is contributing, more is required  Validated data has enabled linking of the internet

There are NO errors in ChemSpider

“All That Glisters is Not Gold” What is the structure of Discodermolide?

How to distinguish…who’s wrong?

Neither is wrong

Data Curation…long torturous task  Data curation – JUST structure-name validation is a long, torturous, iterative task.  How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra

Hand on my heart….

Hand on my heart  No offence meant by what follows! We ALL have quality issues!

PHYSPROP Database The freely downloadable database under the EPI Suite prediction software Very Basic filters suggest data quality issues

The Stereochemistry challenge chemicals with “missed” stereo

NIST Webbook

EPA’s DailyMed

PubChem

Linking

Patents

WYSIWYG compounds

Data Curation…long torturous task  Data curation – JUST structure-name validation is a long, torturous, iterative task.  How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra  The crowd in crowdsourcing is …generally small  Which of the large databases are doing careful curation. How can we share the workload? Hmm..

 Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.

Drug NameGeneric Name ChEBIChemSpider CAS Com. ChemChemIDPlusDailyMedDrugBankPubChemWikipedia Spiriva Tiotropium Bromide No Hits   4/0 Depakote Valproate semisodium No Structure BasenVoglibose No Hits  2/1 Symbicort1) Budesonide  8/1 Symbicort2) Formoterol WRONG  No Hits  6/1  Vytorin1) Ezetimibe No Hits Vytorin2) Simvastatin  2/1  TaxolPaclitaxel  44/1 ThalidomidThalidomide No Hits ZocorSimvastatin  2/1 CrestorRosuvastatin No Hits  2/1

Who does the Curation?

ChemSpider can “do it” for us  ChemSpider has built a curation interface used by the community and ourselves for curating.  All curation activities are available for review, online immediately, iteratively checked.  Curators have different abilities based on their profile: There are only a few “Master Curators”.  Can we “share” the curation workload?

Proof of Concept Data Curation Sharing

Identifier Dictionaries  Reciprocal curation processes…share curation with each other.  If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.  A series of “added” and “removed” synonyms against InChIKeys for matching.  Who will participate???

Proof of Concept Data Curation Sharing

Lessons Learned : Big vs Good!

15 compounds called Yohimbine 54 Skeletons for Yohimbine

Aggegators suffer dilution…

User Understanding of Data  Users searching “Yohimbine” expect to find it…not labeled versions of it, not ambiguous stereochemistries, not partial stereochemistries.  Data “aggregation” into a meaningful form is a major challenge. e.g. Assays for radiolabeled compounds linked to actual drugs.  Data curation efforts such as ChEMBL are essential!

SciMobileApps.com

SciDBs.com (Coming soon)

 Open PHACTS : partnership between European Community and EFPIA  Freely accessible for knowledge discovery and verification.  Data on small molecules  Pharmacological profiles  Pharmacokinetics  ADMET data  Biological targets and pathways  Proprietary and public data sources.

Standardization and Quality  Our initial approaches to standardization were imperfect. We are revisiting to support OpenPHACTS.  Highly dependent on InChI and not enough standardization prior to InChI generation.  InChI is excellent and acknowledged imperfect. Way better than SMILES for linking the internet!

Conclusions  ChemSpider is one of many important chemistry resources on the internet  We have assumed an important role of curating and validating data – specifically name-structure dictionaries are of high importance but data validation is also key  We are a part of the federation of internet databases serving chemistry. MORE collaboration can serve us all better…how?

A Plea to Gov’t DBs…  Please improve gov’t DB communications

A Plea to Gov’t DBs…  Please improve gov’t DB communications  Please buddy up and get closer together

A Plea to Gov’t DBs…  Please improve gov’t DB communications  Please buddy up and get closer together  Get into deep conversations

Acknowledgments  Our development team – headed by THAT man..  Many in this room: InChI, PubChem, DssTOX, FDA, ChEBI/ChEMBL, SureChem, many more  Curators – special gratitude to Barrie Walker!  Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)

Thank you Twitter: ChemConnector Personal Blog: SLIDES: