Presentation is loading. Please wait.

Presentation is loading. Please wait.

ORCID ID:0000-0002-2668-4821 Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online.

Similar presentations


Presentation on theme: "ORCID ID:0000-0002-2668-4821 Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online."— Presentation transcript:

1 ORCID ID: Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online Scientist Antony Williams

2 My background… From 1985-present day PhD’ed in the UK
Canadian Government lab as postdoc Academia as NMR Facility Manager Fortune 500 Company as Technology Leader Start-up – product manager and CSO Consultant – chemistry informatics industry Entrepreneur – Created “ChemSpider” Publisher - Royal Society of Chemistry EPA-NCCT as cheminformatics expert

3 Of interest to faculty?

4 CASE Systems – Natural Products

5 Maybe you know this???

6 Computational Analysis at NCCT
Toxcast can help investigate particular endpoints for a chemical – an abundance of relevant data to model.

7 Public Access and Systems

8 My Hopes for Today Encourage you in the “era of participation”
Provide an overview of tools available Share some stories, statistics and strategies Encourage you to “share for the sake of science” OUTCOMES You will claim an ORCiD You take responsibility for your online profile You will invest >1 hour per week

9 I would tell a chemistry joke…
But all of the good ones…

10 An ambitious idea…. Let’s map together all online chemistry data and build systems to integrate it Heck, let’s integrate chemistry and biology data and add in disease data too if we can Let’s extract property data and model it and see if we can extract new relationships – quantitative and qualitative Let’s make it all available on the web…for free

11

12 What about this…. We’re going to map the world
We’re going to take photos of as many places as we can and link them together We’ll let people annotate and curate the map Then let’s make it available free on the web We’ll make it available for decision making Put it on Mobile Devices, give it away…

13 Where is chemistry online?
Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

14 ~35 million chemicals and growing
Data sourced from >500 different sources Crowd sourced curation and annotation Ongoing deposition of data from our journals and our collaborators Structure centric hub for web-searching …and a really big dictionary!!!

15 ChemSpider

16 ChemSpider

17 Experimental/Predicted Properties

18 Literature references

19 Patents references

20 RSC Books

21 Google Books

22 Organic Chemistry is hard…

23 …it has alkynes of trouble

24 Flavors of Chemistry

25 Molfiles V2000 C C O C O C C N O N M END

26 Molfiles Molfiles are the primary exchange format between structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates

27 Stereo

28 Tautomeric forms

29 Chemists are good…

30 The InChI Identifier

31 InChI SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES InChI Strings can be reversed to structures – same problem as with SMILES – no layout Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet

32 Multiple Layers

33 Tautomers

34 Stereo

35 InChIStrings Hash to InChIKeys

36 Structure search the web

37 Exact Search

38 Skeleton Search

39 Data Quality/Standardization
MANY structures meant to be something online are MISREPRESENTED. Commonly you will have better success finding information by name searches than structure – with many caveats of course… Validating chemical structure representations is laborious work – and it’s shocking to review data…

40 Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011

41 Data quality is a known issue

42 Data quality is a known issue

43 Patent data in public databases

44 Patent data in public databases

45 You just can’t trust atoms!

46 Depiction vs Accurate Representation

47 Depiction vs Accurate Representation

48 What is the Structure of Vitamin K1?

49 Date Quality Issues and $$$$

50 Many Names, One Structure

51 But big and often noisy

52 Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

53 Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

54 Name to Structure Conversion

55 Name to Structure Conversion

56 What could we get? US A1, The melting point and both NMR spectra are associated with the compound. Other physical quantities e.g. volumes, pressures etc. are also detected

57 PhysChem first: Melting Points
Melting/sublimation/decomposition points extracted for 287,635 distinct compounds from USPTO patent applications/grants Sanity checks used to flag dubious values – probably 130-4°C Non-melting outcomes recorded e.g. mp °C. (subl.) What models could be built? Mostly melting points (as opposed to sublimation/decomposition). Dubious values usually mistakes in the original document e.g. in this case probably a missing hyphen.

58 Modeling “BIG data” Melting point models developed with ca. 300k compounds Required 34Gb memory and about 400MB disk space (zipped) Matrix with 2*1011 entries (300k molecules x 700k descriptors) >12k core-hours (>600 CPU-days) for parameter optimization Parallelized on > 600 cores with up to 24 cores per one task Consensus model as average of individual models Accuracy of consensus model is ~33.6 °C for drug-like region compounds Models publicly available at

59 A Recent Talk http://www.slideshare.net/AntonyWilliams/

60 ESI – Text Spectra

61 ChemSpider ID H1 NMR

62 We want to find text spectra?
We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = (CH3), (CH, benzylic methane), (CH, benzylic methane), (CH2), (CH2), , , , , , , , , , (ArCH), 99.42, , , , , , , , (ArC) What would be better are spectral figures – and include assignments where possible!

63 1H NMR (CDCl3, 400 MHz): δ = 2. 57 (m, 4H, Me, C(5a)H), 4
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

64 2,316,005 distinct spectra in 2001-2015 USPTO
NMR Spectra 2,316,005 distinct spectra in USPTO Nucleus Count H C 173970 Unknown 107439 F 22158 P 16333 B 980 Si 715 Pt 275 N 170 V 101 Unknown spectra are almost always hydrogen. As carbon shifts are so different to hydrogen a very crude check could partition the unknowns into proton and carbon NMR. Small numbers of other obscure spectra also found (but also false positives due to really bizarre “OCR” errors of hydrogen or the likei.e. 1 in a million errors :-p)

65 ESI Data also contains figures

66 “Where is the real data please?”
FIGURE

67 Data added to ChemSpider

68 Visibility Means Discoverability
Q: Does a Social Profile as a scientist matter? You are visible, when you share your skills, experience and research activities by: Establishing a public profile Getting on the record Collaborative Science Demonstrating a skill set Measured using “alternative metrics” Contributing to the public peer review process There are many ways to become “visible”

69 Scientists measured by Impact

70 How to Measure Impact

71 Your Research Outputs? Research datasets Scientific software
Publications – peer-reviewed and many others Posters and presentations at conferences Electronic theses and dissertations Performances in film and audio Lectures, online classes and teaching activities What else??? The possibilities to share are endless

72 Open Researcher & Contributor ID

73 Here’s why they are useful…

74 Wonderful Profile…

75 CONTRIBUTE to the community
Share your expertise in the new world of open Share your Figures, share your data Contribute to Wikis – Wikipedia and others Participate in Open Notebook Science Build tools and platforms to support chemists Curate, use and comment on data Get engaged on blogs and discussions

76 Oxidation by Sodium Hydride?

77 The Blogosphere Analyzes…

78 The Blogosphere Analyzes…

79 The new world of micropublishing

80 ChemSpider SyntheticPages

81 Micropublishing with Peer Review (a chemical synthesis blog?)

82 Multi-Step Synthesis

83 Interactive Data

84 You should be LinkedIn LinkedIn for “professionals”
Expose work history, skills, your professional interests, your memberships – your profile WILL be watched! Who you are linked to says a lot about who you are. Get Linked to people in your domain. Professional relationships rather than just friendships. FaceBook-it for friends

85 LinkedIn http://www.linkedin.com/in/AntonyWilliams

86 My Career Captured…

87 And “Endorsements”

88 Highlight “Projects”

89 Manage Articles Here Too.

90 …and presentations

91 My Google Scholar Profile http://scholar. google. com/citations

92 “I don’t have any publications”
This is YOUR choice! Conference Abstracts.. You produce reports, presentations and posters during your studies – share them !

93 Slideshare – Highly Accessed

94 Slideshare – EXPANDED Audience

95 Fast Network Communication

96 Slideshare – NOT Just Slides

97 ResearchGate https://www.researchgate.net/profile/Antony_Williams

98 ResearchGate

99 ResearchGate

100 I have a set of statistics & profiles
My Blog: Twitter: ORCID: Amazon Author Page: Follow Link to Author Page My Klout: LinkedIn: SlideShare: Google Scholar Citations Profile: Antony Williams Citations Wikipedia :

101 The Power of Social Media

102 I recommend… Register for an ORCID ID – then use it
Develop your LinkedIn profile Publish to Slideshare Track Google Scholar Citations (for now) Choose: ResearchGate or Academia.edu Set up an About.ME page to link everything Participate in building your profile

103 Thank you ORCID: Personal Blog: SLIDES: 103


Download ppt "ORCID ID:0000-0002-2668-4821 Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online."

Similar presentations


Ads by Google