Presentation is loading. Please wait.

Presentation is loading. Please wait.

Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,

Similar presentations


Presentation on theme: "Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,"— Presentation transcript:

1 Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST, April 19-20, 2004

2 Bad News: –There are more problems than you thought Good News: –NIST/IUPAC are trying to solve them for you The News

3 Data Tags STM – Scientific, Technical, Medical ‘Publication’ thermokinetics spectroscopy synthesis Chemistry

4 Data Tags IUPAC/NIST Chemical Identity – INChI Interdisciplinary Terms – Gold & Green STM – Scientific, Technical, Medical ‘Publication’ Chemistry

5

6 A Digital ‘Name’ for A Chemical Entity convert chemical structure to digital ‘signature’ To allow computers to: –Organize chemical data –Disseminate data (queries) –Manage quality control

7 Current Representations are Inadequate Drawing – for humans only CAS registry number –Arbitrary value (hard to find and confirm) –CAS Indexer may not match Specialist –Expensive, imprecise, incomplete, no hierarchy Connection Table –One compound – Many representations –Embedded ambiguities ‘Canonical’ Connection Table –No open standard

8 Reactive Intermediates Ions, radicals, excited states –In principle, no problem Equilibrated species –Must specify variability precisely Weakly bound complexes –OK if orientation is omitted Transition states –Maybe not necessary in data compilation

9 ChemWeb, 3/2002

10

11 Nature, May 23, 2002

12 Requirements Different compounds have different identifiers –All distinguishing structural information is included INChI - 1 INChI - 2 = =

13 Requirements One compound has only one identifier –Include only necessary information Same INChI = ==

14 Two Problems Chemicals –Fast isomerization (esp, H-atoms) –Unconventional connectivity Chemists –Differing conventions Depends on discipline, education and convenience –Imprecision/uncertainty

15 3 Steps to INChI Chemistry –‘Normalize’ Input Structure Implement chemical rules Math –‘Canonicalize’ (label the atoms) Equivalent atoms get the same label Format –‘Serialize’ Labeled Structure Output as character string (‘name’)

16 Normalize Simplify Divide structure into ‘layers’ –Each layer ‘refines’ structure Ignore ‘Electron Density’ –Ignore bond type and electron location Stereochemistry –sp 2 and sp 3 only –Free rotation around single bonds

17 formula connectivity stereo isotope Chemical Substances “Layers”

18 4 Connectivity ‘Sublayers’ Disconnect H-atoms and metals –Create skeleton Reconnect Fixed H-atoms –Represent multiple species Reconnect mobile H-atoms –A single species Reconnect metals-non-metal bonds –Represent bonds to metals

19 Ignore Electron Density Not required for compound identification –Represent ‘excited states’ Simplify representations –Delocalization, aromaticity, zwitterions, coordination …

20 Münchnones Simplify - Ignore Electrons

21 Mobile H-atom (Tautomer) Sublayer H-migration between 1,3 heteroatoms

22 Nitrobenzene

23 MSG tautomeric

24 MSG fixed

25 Ferrocene

26 Auxiliary Output Confirmation –Label stereogenic atoms –Identify equivalent atoms Warnings/Errors –Unusual valences –Unrecognized input ‘Reversibility’ –Coordinates –Bond/Charge Location

27 Testing - OK

28 Beta Testing

29 50 ms – 2 GHz PC Performance: Most Challenging NCI-NIH Structure

30 INChI FAQs How can you represent chemistry without electrons? –Chemistry is not represented, just identity –Whole molecule properties may be added (state, phase,..). Do big molecules have big INChIs? –Yes, just like systematic names How to handle other tautomer types, substructures,..? –Other software Is INChI reversible? –Partly - contains only data needed for ‘naming’ –Auxiliary fields can carry structure depiction information Is INChI extensible? –New layers can add refinement

31 Started Oct. 2002

32

33

34

35

36 http://www.nicmila.org/Gold/Output / Miloslav Nic, Jiri Jirat, Czech Republic

37 Converted - XML

38

39

40

41 My Point of View A forest of data dictionaries is growing –Horizontally and vertically We need to consider forest management Some day all reusable data will be tagged


Download ppt "Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,"

Similar presentations


Ads by Google