Download presentation
Presentation is loading. Please wait.
1
Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge
2
The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
3
The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages
4
The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 <scalar dictRef=“ccml:mp” units=“units:c” minValue=“65” maxValue=“66” /> mp 65-66 C Human-readable Machine-readable
5
The Vision-2 Chemists can carry on doing what they want Capturing Chemistry in XML/CML ACS March 2004 Reuse chemistry Archive data Ensure validity of data Create new sources of data / molecules But also
6
Our Approach Let chemists use familiar programs … …and document templates Focus on Journal Articles, Theses, CompChem Create data for knowledge-based discovery Let computers do the work Evolution… Capturing Chemistry in XML/CML ACS March 2004
7
Machine Parsing of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured documents and data in XML MACHINE PARSING ?
8
Abstract Discussion Experimental How? Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add Structure Parse with Regular Expressions Legacy to CML converters
9
Regular Expressions Capturing Chemistry in XML/CML ACS March 2004 [Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C Maybe ‘.’ Any punctuation 0 or more digits Capital ‘C’ Melting point: two possible syntaxes Capital or lowercase ‘m’ Lowercase ‘p’ Maybe whitespace Maybe degrees sign m.p. > 23.5 °C mp 23.5 – 25 °C
10
CML - XML For Chemistry Based on W3C XML Schemas 300+ components Customisable Extensible through dictionaries Openly available software Capturing Chemistry in XML/CML ACS March 2004 J. Chem. Inf. Comp. Sci., 2003, 43, 757
11
The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG, * AniML +, * ThermoML $, etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci., 2003, 43, 757
12
Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
13
CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
14
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
15
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
16
CompChem Output Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLComp CMLSpect Input/jobControlGeneral Parsers
17
Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
18
Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure, thermodynamic properties
19
Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule... The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
20
Dictionaries Capturing Chemistry in XML/CML ACS March 2004 <scalar dictRef=“ccml:mp” units=“units:c” minValue=“65” maxValue=“66” /> Linked to CML schema Accesses CCML namespace Units dictionary id="celsius" name="Celsius" parentSI="k" multiplierToSI="1" constantToSI="273.15" abbreviation="C" unitType="temp" id="meltrange" term="Melting range" definition="Minimum and maximum values of melting range in degrees Celsius"
21
OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
22
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
23
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
24
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
25
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
26
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
27
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
28
Information Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 Chemical name Yield Boiling / Melting point Carbon NMR Hydrogen NMR Infra Red spectrometry Mass spectrometry Elemental Analysis Optical Rotation Refractive Index R f value Ultra Violet spectrometry Nature (colour, state, modifiers, description, etc.)
29
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
30
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
31
OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
32
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
33
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
34
OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
35
OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
36
OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
37
XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
38
Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 Autogenerate IUPAC INChI universal identifier Embed MDLMol or Chemdraw files in MSWord Autoconvert to CML connection table Next phase: Parse chemical names into CML using modern NLP + Learning-machine rather than rule-based + Natural Language Processing Encourage chemists to
39
NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
40
Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.