Presentation is loading. Please wait.

Presentation is loading. Please wait.

Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March.

Similar presentations


Presentation on theme: "Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March."— Presentation transcript:

1 Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge

2 The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World

3 The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages

4 The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 <scalar dictRef=“ccml:mp” units=“units:c” minValue=“65” maxValue=“66” /> mp 65-66  C Human-readable Machine-readable

5 The Vision-2 Chemists can carry on doing what they want Capturing Chemistry in XML/CML ACS March 2004 Reuse chemistry Archive data Ensure validity of data Create new sources of data / molecules But also

6 Our Approach Let chemists use familiar programs … …and document templates Focus on Journal Articles, Theses, CompChem Create data for knowledge-based discovery Let computers do the work Evolution… Capturing Chemistry in XML/CML ACS March 2004

7 Machine Parsing of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured documents and data in XML MACHINE PARSING ?

8 Abstract Discussion Experimental How? Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add Structure Parse with Regular Expressions Legacy to CML converters

9 Regular Expressions Capturing Chemistry in XML/CML ACS March 2004 [Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C Maybe ‘.’ Any punctuation 0 or more digits Capital ‘C’ Melting point: two possible syntaxes Capital or lowercase ‘m’ Lowercase ‘p’ Maybe whitespace Maybe degrees sign m.p. > 23.5 °C mp 23.5 – 25 °C

10 CML - XML For Chemistry Based on W3C XML Schemas 300+ components Customisable Extensible through dictionaries Openly available software Capturing Chemistry in XML/CML ACS March 2004 J. Chem. Inf. Comp. Sci., 2003, 43, 757

11 The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG, * AniML +, * ThermoML $, etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci., 2003, 43, 757

12 Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004

13 CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy

14 Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential

15 Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential

16 CompChem Output Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLComp CMLSpect Input/jobControlGeneral Parsers

17 Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT

18 Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure, thermodynamic properties

19 Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule... The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole

20 Dictionaries Capturing Chemistry in XML/CML ACS March 2004 <scalar dictRef=“ccml:mp” units=“units:c” minValue=“65” maxValue=“66” /> Linked to CML schema Accesses CCML namespace Units dictionary id="celsius" name="Celsius" parentSI="k" multiplierToSI="1" constantToSI="273.15" abbreviation="C" unitType="temp" id="meltrange" term="Melting range" definition="Minimum and maximum values of melting range in degrees Celsius"

21 OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/

22 Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article

23 Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article

24 Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article

25 Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article

26 Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article

27 Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental

28 Information Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 Chemical name Yield Boiling / Melting point Carbon NMR Hydrogen NMR Infra Red spectrometry Mass spectrometry Elemental Analysis Optical Rotation Refractive Index R f value Ultra Violet spectrometry Nature (colour, state, modifiers, description, etc.)

29 OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS

30 OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004

31 OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper

32 OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2

33 OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula

34 OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004

35 OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision

36 OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %

37 XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice

38 Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 Autogenerate IUPAC INChI universal identifier Embed MDLMol or Chemdraw files in MSWord Autoconvert to CML connection table Next phase: Parse chemical names into CML using modern NLP + Learning-machine rather than rule-based + Natural Language Processing Encourage chemists to

39 NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride

40 Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004


Download ppt "Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March."

Similar presentations


Ads by Google