Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

Similar presentations


Presentation on theme: "1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software."— Presentation transcript:

1 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK

2 2 Purpose of my 7 lectures How do you store chemical structures on computer? What can you do with them there? How do the computer systems used in chemical informatics work? Data Structures + Algorithms

3 3 Lecture topics Oct 28Introduction to structure representation; Introduction to Graph theory [video link] Oct 30Problems of structure representation [video link] Nov 4More graph theory; Structure analysis and processing [video link] Nov 11Structure searching I [video link] Nov 13Structure searching II [video link] Nov 18Chemical similarity [Indianapolis] Nov 20Cluster analysis etc. [Bloomington]

4 4 John Barnard B.Sc. in Biochemistry (Birmingham, UK) M.Sc. and Ph.D in Information Studies (Sheffield, UK) Has run chemical informatics software development and consultancy business since 1985 Barnard Chemical Information (BCI) Ltd http://www.bci.gb.com Adjunct Professor of Informatics at Indiana University

5 5 Lecture 1: Topics to be Covered Structure representations and computers structure diagrams nomenclature line notations connection tables Introduction to Graph Theory

6 6 Representing a chemical structure How much information do you want to include? atoms present connections between atoms o bond types stereochemical configuration charges isotopes 3D-coordinates for atoms C 8 H 9 NO 3

7 7 Representing a chemical structure How much information do you want to include? atoms present connections between atoms o bond types stereochemical configuration charges isotopes 3D-coordinates for atoms

8 8 Representing a chemical structure How much information do you want to include? atoms present connections between atoms o bond types (aromatic ring identification) stereochemical configuration charges isotopes 3D-coordinates for atoms

9 9 Representing a chemical structure How much information do you want to include? atoms present connections between atoms o bond types stereochemical configuration charges isotopes 3D-coordinates for atoms

10 10 Representing a chemical structure How much information do you want to include? atoms present connections between atoms o bond types stereochemical configuration charges isotopes 3D-coordinates for atoms

11 11 Representing a chemical structure How much information do you want to include? atoms present connections between atoms o bond types stereochemical configuration charges isotopes 3D-coordinates for atoms

12 12 Representing a chemical structure How much information do you want to include? atoms present connections between atoms o bond types stereochemical configuration charges isotopes 3D-coordinates for atoms

13 13 2D structure diagram chemists natural language used by most computer systems for display shows topology, optionally stereochemistry several commonly-used computer programs allow input/ editing of structure diagrams ISIS/Draw (MDL) http://www.mdl.com/downloads/downloadable/index.jsp ChemDraw (CambridgeSoft) http://www.cambridgesoft.com/products/ GRINS/JavaGRINS (Daylight) http://www.daylight.com/products/javatools.html MarvinSketch http://www.chemaxon.com/marvin/

14 14 2D structure diagram provides 2D pictorial representation of chemical structure display on screen cut/paste/embed in Word document etc. inter-convert with other forms for further processing database searching structure analysis property prediction database analysis

15 15 Chemical Nomenclature name that can be used to identify a substance potentially important for legislation represents chemical structure as text string which can (sometimes) be pronounced trivial names usually short and easy to pronounce do not usually give much information about structure systematic names usually long and difficult to pronounce usually describe structure in considerable detail

16 16 Trivial and Systematic Names Trivial name: tyrosine Systematic names: -(p-hydroxyphenyl)alanine -amino-p-hydroxyhydrocinnamic acid

17 17 Systematic Names several systems under continual revision and extension IUPAC Chemical Abstracts (lecture from Dr Davis on Sep 9) some special systems designed by individuals not usually designed for computer processing programs exist both to read (translate) and to generate systematic names from computer formats o http://www.beilstein.com/products/autonom/anm2000.shtml o http://www.acdlabs.com/products/name_lab/ have arguably outlived their usefulness IUPAC IChI (IUPAC Chemical Identifier) project

18 18 Registry Numbers unique identifiers for compounds or substances catalogue number most chemical databases have them Chemical Abstracts Beilstein private compound registries in pharmaceutical companies usually just idiot numbers no chemical information may have hierarchical structure parent compound stereoisomer salt batch need to decide what is a separate compound

19 19 Line Notations represent structures as compact linear string of alphanumeric symbols easily handled by computer compact storage easily transmitted over a network allow rapid manual coding/decoding by trained users much faster for input than using a structure drawing program

20 20 Line Notations: SMILES Simplified Molecular Input Line Entry System developed by Dave Weininger (Daylight) OC(=O)C(N)CC1=CC=C(O)C=C1

21 21 Simplified SMILES encoding rules atoms are shown by atomic symbols: B, C, N, O, F, P, S, Cl, Br, I hydrogen atoms are assumed to fill spare valencies adjacent atoms are connected by single bonds double bonds are shown `=', triple bonds are `#' branching is indicated by parentheses ring closures are shown by pairs of matching digits Full rules: http://www.daylight.com/smiles/smiles-intro.html

22 22 Other line notations ROSDAL (Beilstein) Representation Of Structure Diagram Arranged Linearly 1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O Sybyl Line Notation (Tripos) OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1 Wiswesser Line Notation (WLN) (obsolete) QVYZ1R DQ

23 23 Connection Tables (CTs) main form of structure representation in computer systems list atoms and bonds (and other data) as a table many different formats internal CTs (in memory) o algorithmic processing external CTs (disk files) o archival storage o data exchange between programs

24 24 Redundant Connection Table 1. O1 2 1 2. C0 1 1 3 2 4 1 3. O0 2 2 4. C 1 2 1 5 1 6 1 5. N2 4 1 6. C2 4 1 7 1 7. C0 6 1 8 212 1 8. C1 7 2 9 1 9. C1 8 110 2 10. C0 9 211 113 1 11. C110 112 2 12. C111 2 7 1 13. O110 1

25 25 Internal Connection Table usually redundant every bond shown twice, once for each atom implemented as array of records record for each atom might store atomic type hydrogen count formal charge 2D display co-ordinates bonds to neighbouring atoms etc.

26 26 MDL Connection Table proprietary file format developed by MDL http://www.mdl.com/downloads/public/ctfile/ctfile.jsp de facto standard for exchange of datasets several different flavours and versions Molfile (single molecule) SDfile (set of molecules and data) RGfile (Markush structure) Rxnfile (single reaction) RDfile (set of reactions with data) separates atoms and bonds into separate blocks

27 27 New MDL File Formats Since this lecture was delivered on Oct 28, 2003 MDL have published details of a new file format called XDfile XML-based data format for transferring structure/reaction information with associated data built around existing MDL connection table formats can incorporate Chime strings (encrypted format used to render structures and reactions on a Web page) can incorporate SMILES strings Details available in MDL documentation at: http://www.mdl.com/downloads/public/ctfile/ctfile.jsp

28 28 MDL Connection Table Header Block data on molecule name and file origin counts of atoms and bonds etc. Tyrosine -ISIS- 08220120432D 13 13 0 0 0 0 0 0 0 0999 V2000

29 29 MDL Connection Table Atoms block one line per atom specifies X,Y,Z-coords, atom symbol, isotope, charge, stereo code etc. 0.2459 -1.4736 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5815 -1.4724 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.9944 -2.1872 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5810 -2.9037 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.2495 -2.9008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.6586 -2.1854 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4836 -2.1830 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.9042 -2.1792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1027 -2.1870 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 -3.1359 -1.1516 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -3.9070 -2.1847 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.4070 -2.6845 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -4.4989 -1.5618 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

30 30 MDL Connection Table Bonds Block one line per bond (each bond shown once) specifies row numbers for atoms, and codes for bond type, bond stereochemistry etc. 1 2 2 0 0 0 0 6 7 1 0 0 0 0 3 4 2 0 0 0 0 3 8 1 0 0 0 0 4 5 1 0 0 0 0 9 10 1 0 0 0 0 2 3 1 0 0 0 0 9 11 1 0 0 0 0 5 6 2 0 0 0 0 11 12 1 0 0 0 0 6 1 1 0 0 0 0 11 13 2 0 0 0 0 8 9 1 0 0 0 0 M END

31 31 Standard Connection Table Formats different vendors have proprietary CT formats many attempts to establish agreed standard formats no real general success different user communities have failed to coordinate efforts some standards exist in restricted areas SMILES and MDL CT formats widely used most popular programs read/write several different formats

32 32 Standard Connection Table Formats Standard Molecular Data (SMD) format never gained wide acceptance Protein Data Bank (PDB) format Crystallographic Information File (CIF/mmCIF) Molecular Information File (MIF) developed from SMD and compatible with CIF Chemical Exchange Format (CXF) Chemical Abstracts Service

33 33 Standard Connection Table Formats Chemical Markup Language (CML) uses principles of the eXtensible Markup Language (XML) protocol for data exchange using the Internet http://www.xml-cml.org Chemical EXchange (CEX) exchange protocol for TCP/IP networks developed collaboratively by several organizations http://www.cgl.ucsf.edu/cex Chemical MIME incorporates several popular formats into protocols for exchange of molecular structures as e-mail attachments http://www.ch.ic.ac.uk/chemime/

34 34 IUPAC Chemical Identifier (IChI) Project being undertaken by International Union of Pure and Applied Chemistry Intended to provide unique identifier for compounds, but with chemical intelligence based on connection table canonicalised (see lecture 3 on November 4) compacted to short alphanumerical string http://www.iupac.org/projects/2000/2000-025-1-800.html see also Dr Nicklauss lecture on Oct 16

35 35 Topological Graph Theory branch of mathematics particularly useful in chemical informatics and in computer science generally study of graphs which consist of a set of nodes a set of edges joining pairs of nodes

36 36 Properties of graphs graphs are only about connectivity spatial position of nodes is irrelevant length of edges are irrelevant crossing edges are irrelevant

37 37 Properties of Graphs nodes and edges can be coloured to distinguish them

38 38 Structure Diagrams as Graphs 2D structure diagrams very like topological graphs atoms nodes bonds edges terminal hydrogen atoms are not normally shown as separate nodes (implicit hydrogens) reduces number of nodes by ~50% hydrogen count information used to colour neighbouring heavy atom atom separate nodes sometimes used for special hydrogens o deuterium, tritium o hydrogen bonded to more than one other atom o hydrogens attached to stereocentres

39 39 Advantages of using graphs mathematical theory is well understood graphs can be easily represented in computers many useful algorithms are known identical graphs identical molecules different graphs different molecules

40 40 Disadvantages of using graphs analogy between chemical structures and graphs is not perfect identical graphs identical molecules different graphs different molecules realities of chemical structures cause problems aromaticitystereochemistry tautomerismcoordination compounds multi-centre bondsinorganic compounds macromoleculespolymers incompletely-defined substances many graph algorithms are inherently slow

41 41 Lecture 1: Conclusions There are lots of ways of storing a chemical structure in a computer including different amounts of information Most important ones are line notations (e.g. SMILES) connection tables (e.g. MDL Molfile) nomenclature Structure diagrams used for input/output Chemical structures can be regarded as topological graphs

42 42 Lecture 2: Topics to be Covered Special problems of structure representation aromaticity and tautomerism multi-centre bonds stereochemistry and coordination compounds inorganic compounds macromolecules and polymers incompletely-defined substances Markush structures

43 43 Further reading A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Dordrecht: Kluwer, 2003 J. Gasteiger and T. Engel Chemoinformatics: a Textbook, Wiley-VCH 2003 J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Wiley-VCH, 2003 o Vol 1, Chapter II (Representation of chemical compounds)


Download ppt "1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software."

Similar presentations


Ads by Google