Download presentation
Presentation is loading. Please wait.
Published byPhillip Dennis Modified over 6 years ago
1
ORCID ID: Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online Scientist Antony Williams
2
My background… From 1985-present day PhD’ed in the UK
Canadian Government lab as postdoc Academia as NMR Facility Manager Fortune 500 Company as Technology Leader Start-up – product manager and CSO Consultant – chemistry informatics industry Entrepreneur – Created “ChemSpider” Publisher - Royal Society of Chemistry EPA-NCCT as cheminformatics expert
3
Of interest to faculty?
4
CASE Systems – Natural Products
5
Maybe you know this???
6
Computational Analysis at NCCT
Toxcast can help investigate particular endpoints for a chemical – an abundance of relevant data to model.
7
Public Access and Systems
8
My Hopes for Today Encourage you in the “era of participation”
Provide an overview of tools available Share some stories, statistics and strategies Encourage you to “share for the sake of science” OUTCOMES You will claim an ORCiD You take responsibility for your online profile You will invest >1 hour per week
9
I would tell a chemistry joke…
But all of the good ones…
10
An ambitious idea…. Let’s map together all online chemistry data and build systems to integrate it Heck, let’s integrate chemistry and biology data and add in disease data too if we can Let’s extract property data and model it and see if we can extract new relationships – quantitative and qualitative Let’s make it all available on the web…for free
12
What about this…. We’re going to map the world
We’re going to take photos of as many places as we can and link them together We’ll let people annotate and curate the map Then let’s make it available free on the web We’ll make it available for decision making Put it on Mobile Devices, give it away…
13
Where is chemistry online?
Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
14
~35 million chemicals and growing
Data sourced from >500 different sources Crowd sourced curation and annotation Ongoing deposition of data from our journals and our collaborators Structure centric hub for web-searching …and a really big dictionary!!!
15
ChemSpider
16
ChemSpider
17
Experimental/Predicted Properties
18
Literature references
19
Patents references
20
RSC Books
21
Google Books
22
Organic Chemistry is hard…
23
…it has alkynes of trouble
24
Flavors of Chemistry
25
Molfiles V2000 C C O C O C C N O N M END
26
Molfiles Molfiles are the primary exchange format between structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates
27
Stereo
28
Tautomeric forms
29
Chemists are good…
30
The InChI Identifier
31
InChI SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES InChI Strings can be reversed to structures – same problem as with SMILES – no layout Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet
32
Multiple Layers
33
Tautomers
34
Stereo
35
InChIStrings Hash to InChIKeys
36
Structure search the web
37
Exact Search
38
Skeleton Search
39
Data Quality/Standardization
MANY structures meant to be something online are MISREPRESENTED. Commonly you will have better success finding information by name searches than structure – with many caveats of course… Validating chemical structure representations is laborious work – and it’s shocking to review data…
40
Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
41
Data quality is a known issue
42
Data quality is a known issue
43
Patent data in public databases
44
Patent data in public databases
45
You just can’t trust atoms!
46
Depiction vs Accurate Representation
47
Depiction vs Accurate Representation
48
What is the Structure of Vitamin K1?
49
Date Quality Issues and $$$$
50
Many Names, One Structure
51
But big and often noisy
52
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
53
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
54
Name to Structure Conversion
55
Name to Structure Conversion
56
What could we get? US A1, The melting point and both NMR spectra are associated with the compound. Other physical quantities e.g. volumes, pressures etc. are also detected
57
PhysChem first: Melting Points
Melting/sublimation/decomposition points extracted for 287,635 distinct compounds from USPTO patent applications/grants Sanity checks used to flag dubious values – probably 130-4°C Non-melting outcomes recorded e.g. mp °C. (subl.) What models could be built? Mostly melting points (as opposed to sublimation/decomposition). Dubious values usually mistakes in the original document e.g. in this case probably a missing hyphen.
58
Modeling “BIG data” Melting point models developed with ca. 300k compounds Required 34Gb memory and about 400MB disk space (zipped) Matrix with 2*1011 entries (300k molecules x 700k descriptors) >12k core-hours (>600 CPU-days) for parameter optimization Parallelized on > 600 cores with up to 24 cores per one task Consensus model as average of individual models Accuracy of consensus model is ~33.6 °C for drug-like region compounds Models publicly available at
59
A Recent Talk http://www.slideshare.net/AntonyWilliams/
60
ESI – Text Spectra
61
ChemSpider ID H1 NMR
62
We want to find text spectra?
We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = (CH3), (CH, benzylic methane), (CH, benzylic methane), (CH2), (CH2), , , , , , , , , , (ArCH), 99.42, , , , , , , , (ArC) What would be better are spectral figures – and include assignments where possible!
63
1H NMR (CDCl3, 400 MHz): δ = 2. 57 (m, 4H, Me, C(5a)H), 4
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
64
2,316,005 distinct spectra in 2001-2015 USPTO
NMR Spectra 2,316,005 distinct spectra in USPTO Nucleus Count H C 173970 Unknown 107439 F 22158 P 16333 B 980 Si 715 Pt 275 N 170 V 101 Unknown spectra are almost always hydrogen. As carbon shifts are so different to hydrogen a very crude check could partition the unknowns into proton and carbon NMR. Small numbers of other obscure spectra also found (but also false positives due to really bizarre “OCR” errors of hydrogen or the likei.e. 1 in a million errors :-p)
65
ESI Data also contains figures
66
“Where is the real data please?”
FIGURE
67
Data added to ChemSpider
68
Visibility Means Discoverability
Q: Does a Social Profile as a scientist matter? You are visible, when you share your skills, experience and research activities by: Establishing a public profile Getting on the record Collaborative Science Demonstrating a skill set Measured using “alternative metrics” Contributing to the public peer review process There are many ways to become “visible”
69
Scientists measured by Impact
70
How to Measure Impact
71
Your Research Outputs? Research datasets Scientific software
Publications – peer-reviewed and many others Posters and presentations at conferences Electronic theses and dissertations Performances in film and audio Lectures, online classes and teaching activities What else??? The possibilities to share are endless
72
Open Researcher & Contributor ID
73
Here’s why they are useful…
74
Wonderful Profile…
75
CONTRIBUTE to the community
Share your expertise in the new world of open Share your Figures, share your data Contribute to Wikis – Wikipedia and others Participate in Open Notebook Science Build tools and platforms to support chemists Curate, use and comment on data Get engaged on blogs and discussions
76
Oxidation by Sodium Hydride?
77
The Blogosphere Analyzes…
78
The Blogosphere Analyzes…
79
The new world of micropublishing
80
ChemSpider SyntheticPages
81
Micropublishing with Peer Review (a chemical synthesis blog?)
82
Multi-Step Synthesis
83
Interactive Data
84
You should be LinkedIn LinkedIn for “professionals”
Expose work history, skills, your professional interests, your memberships – your profile WILL be watched! Who you are linked to says a lot about who you are. Get Linked to people in your domain. Professional relationships rather than just friendships. FaceBook-it for friends
85
LinkedIn http://www.linkedin.com/in/AntonyWilliams
86
My Career Captured…
87
And “Endorsements”
88
Highlight “Projects”
89
Manage Articles Here Too.
90
…and presentations
91
My Google Scholar Profile http://scholar. google. com/citations
92
“I don’t have any publications”
This is YOUR choice! Conference Abstracts.. You produce reports, presentations and posters during your studies – share them !
93
Slideshare – Highly Accessed
94
Slideshare – EXPANDED Audience
95
Fast Network Communication
96
Slideshare – NOT Just Slides
97
ResearchGate https://www.researchgate.net/profile/Antony_Williams
98
ResearchGate
99
ResearchGate
100
I have a set of statistics & profiles
My Blog: Twitter: ORCID: Amazon Author Page: Follow Link to Author Page My Klout: LinkedIn: SlideShare: Google Scholar Citations Profile: Antony Williams Citations Wikipedia :
101
The Power of Social Media
102
I recommend… Register for an ORCID ID – then use it
Develop your LinkedIn profile Publish to Slideshare Track Google Scholar Citations (for now) Choose: ResearchGate or Academia.edu Set up an About.ME page to link everything Participate in building your profile
103
Thank you ORCID: Personal Blog: SLIDES: 103
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.