Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST.

Similar presentations


Presentation on theme: "Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST."— Presentation transcript:

1 Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST

2 Overview of Our Activities
The Royal Society of Chemistry as a provider of chemistry for the community: As a charity As a scientific publisher As a host of commercial databases As a partner in grant-based projects As the host of ChemSpider And now in development : the RSC Data Repository for Chemistry

3 ~30 million chemicals and growing
Data sourced from >500 different sources Crowd sourced curation and annotation Ongoing deposition of data from our journals and our collaborators Structure centric hub for web-searching …and a really big dictionary!!!

4 ChemSpider

5 ChemSpider

6 ChemSpider

7 Experimental/Predicted Properties

8 Literature references

9 Patents references

10 RSC Books

11 Google Books

12 Vendors and data sources

13 Crowdsourced “Annotations”
Users can add Descriptions, Syntheses and Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos

14 APIs

15 APIs

16 WebBook and ChemSpider

17 WebBook and ChemSpider

18 WebBook and ChemSpider

19 WebBook and ChemSpider

20 WebBook and ChemSpider

21 Javascript viewer NMR, MS, IR

22 Aspirin on ChemSpider

23 Many Names, One Structure

24 What is the Structure of Vitamin K?

25 MeSH A lipid cofactor that is required for normal blood clotting.
Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione).

26 What is the Structure of Vitamin K?

27 The ultimate “dictionary”
Search all forms of structure IDs Systematic name(s) Trivial Name(s) SMILES InChI Strings InChIKeys Database IDs Registry Number

28 Linking Names to Structures

29 Semantic Mark-up of Articles

30 Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011

31 Data quality is a known issue

32 Standardize Use the SRS as a guidance document for standardization
Adjust as necessary to our needs

33 Nitro groups

34 Salt and Ionic Bonds

35 Ammonium salts

36 CVSP Filtering and Flagging

37 Openness and Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011

38 7 Substructure # of Hits # of Correct Hits No stereochemistry
Incomplete Stereochemistry Complete but incorrect stereochemistry Gonane 34 5 8 21 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23

39 Crowdsourced Enhancement
The community can clean and enhance the database by providing Feedback and direct curation Tens of thousands of edits made

40 Data Quality is Work Cholesterol Taxol

41 Maybe we can help? Is there an interest in data checking the WebBook or other NIST data sources?

42 Publications-summary of work
Scientific publications are a summary of work Is all work reported? How much science is lost to pruning? What of value sits in notebooks and is lost? Publications offering access to “real data”? How much data is lost? How many compounds never reported? How many syntheses fail or succeed? How many characterization measurements?

43 What are we building? We are building the “RSC Data Repository”
Containers for compounds, reactions, analytical data, tabular data Algorithms for data validation and standardization Flexible indexing and search technologies A platform for modeling data and hosting existing models and predictive algorithms

44 Deposition of Data

45 Compounds

46 Reactions

47 Analytical data

48 Crystallography data

49 Can we get historical data?
Text and data can be mined Spectra can be extracted and converted SO MUCH Open Source Code available

50 Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

51 Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

52 Text spectra? 13C NMR (CDCl3, 100 MHz): δ = (CH3), (CH, benzylic methane), (CH, benzylic methane), (CH2), (CH2), , , , , , , , , , (ArCH), 99.42, , , , , , , , (ArC)

53 1H NMR (CDCl3, 400 MHz): δ = 2. 57 (m, 4H, Me, C(5a)H), 4
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

54 Turn “Figures” Into Data

55 Make it interactive

56 SO MANY reactions!

57 Extracting our Archive
What could we get from our archive? Find chemical names and generate structures Find chemical images and generate structures Find reactions Find data (MP, BP, LogP) and deposit Find figures and database them Find spectra (and link to structures)

58 Models published from data

59 Text-mining Data to compare

60 How is DERA going? We have text-mined all 21st century articles… >100k articles from Marked up with XML and published onto the HTML forms of the articles Required multiple iterations based on dictionaries, markup, text mining iterations New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!

61 Work in Progress

62 Work in Progress

63 Work in Progress

64 Work in Progress

65 Is It Easy? Curated dictionaries for known names Dictionary
(ontologies) Dictionary (chemistry) RSC ontologies (methods, reactions) Unknown names: automated name to structure conversion Production processes Text-mining OPSIN ACD N2S XML ready for publication Chemical structures SD file CDX integration (coming soon) Marked-up XML

66 Acknowledgments Regarding InChI – Steve Stein, Steve Heller, Dmitrii Tchekhovskoi, Igor Pletnev

67 Thank you ORCID: Personal Blog: SLIDES: 67


Download ppt "Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST."

Similar presentations


Ads by Google