Computer Structure Codes (after lectures by Dr. J.M. Barnard) How do you store chemical structures on computer? What can you do with them there? How do.

Slides:



Advertisements
Similar presentations
Scientific & technical presentation JChem Cartridge for Oracle
Advertisements

May, 2008 Presenting: Szabolcs Csepregi The ChemAxon Markush project overview and development discussion.
Structural Search Using ChemAxon Tools
UGM, June, 2007 Presenting: Szabolcs Csepregi JChem Base and Cartridge latest.
1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Database Systems: Design, Implementation, and Management Tenth Edition
Unit 3 Stereochemistry.  Chirality and Stereoisomers  Configuration vs. Conformation  (R) and (S) Configurations  Optical Activity  Fischer Projections.
The Assembly Language Level
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
Mining Graphs.
Access Trisha Cummings. Access 1.Microsoft Access is a relational database management system from Microsoft, 2.Skilled software developers and data architects.
Representation of molecular structures
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Process Specifications and Structured Decisions Systems Analysis and Design, 8e Kendall.
The study of the three dimensional structure of molecules.
Atomic and Molecular Orbitals l The horizontal rows of the periodic table are called Periods. l Each period represents a different quantum energy level.
Chapter 9 Describing Process Specifications and Structured Decisions
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Chapter 5 Stereochemistry
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Chapter 6 Stereochemistry.
Automated Drawing of 2D chemical structures Kees Visser.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
1 Chemical Structure Representation and Search Systems Lecture 3. Nov 4, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Kendall & KendallCopyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall 9 Kendall & Kendall Systems Analysis and Design, 9e Process Specifications.
HydrocarbonsAliphatic Straight chainSaturatedAlkanesUnsaturatedAlkenesAlkynesCyclic Aromatic Cyclic.
3 3-1 Organic Chemistry William H. Brown & Christopher S. Foote.
Molecular Descriptors
Aniko T. Valko, Keymodule Ltd.
1 Chemical Structure Representation and Search Systems Lecture 2. Oct 30, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Chapter 9 Describing Process Specifications and Structured Decisions
AP Biology Chemistry of Carbon Chapter 4 Building Blocks of Life.
Stereochemistry & Chiral Molecules. Isomerism Isomers are different compounds with the same molecular formula 1) Constitutional isomers: their atoms are.
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
SDF File analysis Creation, composition, checking.
May 2009 ChemAxon - What’s New?. What’s new and hot? All products have seen enhancements in the past 12 months BUT WHAT’S REALLY HOT?
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
SMILES. Simplified molecular input line entry specification The simplified molecular input line entry specification or SMILES is a specification for unambiguously.
Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry.
Introduction to Chemistry Chapter 2. Introduction Matter - anything that has mass Made of elements (92 naturally occurring Element - substance that cannot.
Chapter 5 Stereochemistry Jo Blackburn Richland College, Dallas, TX Dallas County Community College District  2003,  Prentice Hall Organic Chemistry,
1 What is Organic Chemistry? Vitalism Examples of organic molecules What’s the common thread that ties these molecules together? Organic chemistry is the.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Stereochemistry Constitutional Isomers: same molecular formula, different connectivity. Stereoisomers: same molecular formula, same connectivity, different.
Lecture 2 System Development Lifecycles. Building a house Definition phase Analysis phase Design phase Programming phase System Test phase Acceptance.
Systems Analysis and Design 8th Edition
CIS 250 Advanced Computer Applications Database Management Systems.
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
Stereochemistry of organic compounds-i. Stereochemistry Stereochemistry, a subdiscipline of chemistry, involves the study of the relative spatial arrangement.
Introduction to Databases Angela Clark University of South Alabama.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Copyright © 2011 Pearson Education Process Specifications and Structured Decisions Systems Analysis and Design, 8e Kendall & Kendall Global Edition 9.
10.1 Introduction. Assessment Objectives Describe the features of a homologous series Predict and explain the trends in boiling points.
Enantiomers: R and S Nomenclature. Enantiomers To distinguish between enantiomers, chemists use the R and S classification system.
Logical Database Design and the Rational Model
Chapter 15 Principles of Stereochemistry
Daylight and Discovery
Chapter 5 Stereochemistry: Chiral Molecules
Objective of This Course
Chapter 9 Stereochemistry.
Aniko T. Valko, Keymodule Ltd.
240 Chem Stereochemistry Chapter 5.
240 Chem Stereochemistry Chapter 5.
Chapter 11 Describing Process Specifications and Structured Decisions
CARBON AND THE MOLECULAR DIVERSITY OF LIFE The Importance of Carbon
240 Chem Stereochemistry Chapter 5.
Presentation transcript:

Computer Structure Codes (after lectures by Dr. J.M. Barnard) How do you store chemical structures on computer? What can you do with them there? How do the computer systems used in chemical informatics work?

Representing a chemical structure How much information do you want to include? –atoms present –connections between atoms bond types –stereochemical configuration –charges –isotopes –3D-coordinates for atoms

Representing a chemical structure How much information do you want to include? –atoms present –connections between atoms bond types (aromatic ring identification) –stereochemical configuration –charges –isotopes –3D-coordinates for atoms

Representing a chemical structure How much information do you want to include? –atoms present –connections between atoms bond types –stereochemical configuration –charges –isotopes –3D-coordinates for atoms

Representing a chemical structure How much information do you want to include? –atoms present –connections between atoms bond types –stereochemical configuration –charges –isotopes –3D-coordinates for atoms

Representing a chemical structure How much information do you want to include? –atoms present –connections between atoms bond types –stereochemical configuration –charges –isotopes –3D-coordinates for atoms

2D structure diagram chemists’ “natural language” used by most computer systems for display shows topology, optionally stereochemistry several commonly-used computer programs allow input /editing of structure diagrams –ISIS/Draw (MDL) –ChemDraw (CambridgeSoft) –GRINS/JavaGRINS (Daylight)

2D structure diagram provides 2D pictorial representation of chemical structure –display on screen –cut/paste/embed in Word document etc. inter-convert with other forms for further processing –database searching –structure analysis –property prediction –database analysis

Registry Numbers unique identifiers for compounds or substances –catalog number most chemical databases have them –Chemical Abstracts –Beilstein –private compound registries in pharmaceutical companies usually just “idiot numbers” –no chemical information may have hierarchical structure parent compound  stereoisomer  salt  batch need to decide what is a separate compound

Line Notations represent structures as compact linear string of alphanumeric symbols easily handled by computer –compact storage –easily transmitted over a network allow rapid manual coding/decoding by trained users –much faster for input than using a structure drawing program

Line Notations: SMILES Simplified Molecular Input Line Entry System developed by Dave Weininger (Daylight) OC(=O)C(N)CC1=CC=C(O)C=C1

Other line notations ROSDAL (Beilstein) Representation Of Structure Diagram Arranged Linearly 1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O Sybyl Line Notation (Tripos) Wiswesser Line Notation (WLN) (obsolete) QVYZ1R DQ

Connection Tables (CTs) main form of structure representation in computer systems –list atoms and bonds (and other data) as a table many different formats –“internal” CTs (in memory) algorithmic processing –“external” CTs (disk files) archival storage data exchange between programs

Internal Connection Table usually “redundant” –every bond shown twice, once for each atom implemented as array of records record for each atom might store –atomic type –hydrogen count –formal charge –2D display co-ordinates –bonds to neighboring atoms –etc.

“Redundant” Connection Table 1. O C O C N C C C C C C C O110 1

MDL Connection Table proprietary file format developed by MDL – de facto standard for exchange of datasets several different flavours and versions –Molfile (single molecule) –SDfile (set of molecules and data) –RGfile (Markush structure) –Rxnfile (single reaction) –RDfile (set of reactions with data) separates atoms, bonds into separate blocks

Standard Connection Table Formats different vendors have proprietary CT formats many attempts to establish agreed “standard” formats –no real general success –different user communities have failed to coordinate efforts –some standards exist in restricted areas SMILES and MDL CT formats widely used most popular programs read/write several different formats

Standard Connection Table Formats Standard Molecular Data (SMD) format –never gained wide acceptance Protein Data Bank (PDB) format Crystallographic Information File (CIF) Molecular Information File (MIF) –developed from SMD and compatible with CIF Chemical Exchange Format (CXF) –Chemical Abstracts Service Chemical Markup Language (CML) –for data exchange using the Internet INChI (IUPAC/NIST Chemical Identifier)

Conclusions There are lots of ways of storing a chemical structure in a computer –including different amounts of information Most important ones are –line notations (e.g. SMILES) –connection tables (e.g. MDL Molfile) –nomenclature Structure diagrams used for input/output

Topological Graph Theory branch of mathematics –particularly useful in chemical informatics and in computer science generally study of “graphs” which consist of –a set of “nodes” –a set of “edges” joining pairs of nodes

Properties of graphs graphs are only about connectivity –spatial position of nodes is irrelevant –length of edges are irrelevant –crossing edges are irrelevant

Structure Diagrams as Graphs 2D structure diagrams very like topological graphs –atoms  nodes –bonds  edges terminal hydrogen atoms are not normally shown as separate nodes (“implicit” H) –reduces number of nodes by ~50% –“hydrogen count” information used to colour neighbouring “heavy atom” atom –separate nodes sometimes used for “special” hydrogens deuterium, tritium hydrogen bonded to more than one other atom hydrogens attached to stereocentres

Advantages of using graphs mathematical theory is well understood graphs can be easily represented in computers –many useful algorithms are known identical graphs  identical molecules different graphs  different molecules

Disadvantages of graphs analogy between chemical structures and graphs is not perfect –identical graphs identical molecules –different graphs different molecules realities of chemical structures cause problems –aromaticitystereochemistry –tautomerismcoordination compounds –multi-centre bondsinorganic compounds –macromoleculespolymers –incompletely-defined substances many graph algorithms are inherently slow

Aromaticity electronic property of certain ring systems, giving enhanced chemical stability bonds in aromatic rings have properties that are distinct from single and double bonds generally accepted definition is Hückel rule –4n+2 pi-electrons (n is a small integer) there are borderline cases aromaticity causes problems for computer representation –different systems deal with it in different ways

Aromaticity problems using single and double bonds can give different topological graphs for the same compound one solution is to use an aromatic bond type

Alternating bonds and aromaticity Chemical Abstracts Registry System uses a “normalised” bond type for all rings with alternating single and double bonds –this includes some systems that are not aromatic –and omits some that are

Representing aromaticity some systems represent aromaticity as an atom property –SMILES allows use of lower-case atomic symbols for aromatic atoms (adjacent aromatic atoms are assumed to be joined by aromatic bonds) problem: aromaticity is really a ring property

Tautomerism dynamic equilibrium between positional isomers (labile H) are they different compounds? –answer depends on what you want to do with them can use normalised bonds to represent them by a single graph –gets mixed up with ring alternating bonds –some tautomers may be aromatic, when others are not

Tautomerism tautomerism is a matter of degree tautomers can be defined in different ways HQ–X=R  Q=X–RH only certain elements can be Q, X or R keto-enol tautmers are not recognised by Chemical Abstracts mono-unsaturated carbon chains are not distinguished by Daylight

Structure conventions sometimes called “business rules” –some chemical groups can be shown in different but equally valid ways –conventions are needed to determine which is preferred –software may be needed to convert to preferred form

Stereochemistry different compounds with identical connectivity same topology, different topography S-tyrosine R-tyrosine

Stereochemistry configuration is often unknown –or partially known (relative stereochemistry) –or you may have a mixture of stereoisomers in which one isomer may occur in enantiomeric excess many different descriptors used by chemists –wedge (up) and hatched (down) bonds in structure diagrams –Cahn, Ingold, Prelog (CIP) designators (R, S, E, Z) –text-based descriptors (stereoparent, or optical rotation)

Stereochemistry: up/down bonds can be used as additional “colours” for graph edges –many connection table formats have special codes for up and down bonds –need to know which end of bond is which useful for re-generating diagrams for display can be used to calculate other stereo descriptors

Up/down bond problems different patterns of up/down bonds can show the same stereo- isomer –different graphs, same molecule some patterns of up and down bonds actually convey no useful information about configuration

Stereochemistry: CIP designators R.S. Cahn, C. Ingold, and V. Prelog, –Angewandte Chemie Intl. Ed. in English 1966, 5, one-letter designator for stereocenters –based on rules assigning priorities to groups around it –tetrahedral carbons (R, S) –double bonds (E, Z) additional colors for graph nodes or edges –useful for distinguishing stereoisomers when absolute configuration is known –less useful for matching parts of structures (substructure search) as priority rules can cause designator to change when remote part of structure is changed

Double bond stereo in SMILES / and \ used as “directional” single bonds –only meaningful when used on both atoms of a double bond –several ways of showing same configuration

Other complications Organometallic and co-ordination compounds –complex stereochemistry –special bond types may be needed (dative bonds etc.) –ambiguity over covalent/ionic character of bonds “business rules” rules usually needed Inorganic compounds –topological representation often not possible –composition may not involve integral ratios between elements

Macromolecules in principle can represent all atoms, as for small molecules some systems use “shortcuts” or “superatoms” for subunits (e.g. amino acids)

Macromolecules Each shortcut is defined with appropriate attachment points ordinary atoms can be mixed with shortcuts system can expand shortcuts when needed

Polymers special problems are presented because properties of polymer can be affected by polymerisation conditions –average number of subunits –extent of cross-linking –ratio between different subunits –random / block sequences of subunits –etc. Two main approaches –monomer representation –structural repeating unit (SRU) representation

Incompletely-defined substances unknown stereochemistry unknown attachment position unknown repetition

Markush (“Generic”) structures –structures with R-groups –shorthand for describing sets of structures with common features

Markush structures –also called “generic” structures –very important in chemical patents inventor claims whole class of related compounds –can be used to describe combinatorial libraries –can be used as queries in database searches

Canonicalization a given chemical structure (or graph) can have many valid and unambiguous representations –different order of rows in connection table –different order of atoms in SMILES for comparison purposes it would be useful to have a single unique or “canonical” representation process of converting input representation to canonical form is called “canonicalization” or “canonization” –process of applying “rules” (i.e. an algorithm)

Canonicalization an obvious approach: –generate all possible valid SMILES –choose the one that comes first alphabetically this would be very slow, but effective, and there is a danger of missing one –principle was used for canonicalizing Wiswesser Line Notation

Canonicalization most methods in use today involve renumbering the atoms in some unique and reproducible way –can be used to number rows in connection table –can determine order of atoms in SMILES normally involve a node labelling technique called “relaxation” –example is Morgan’s algorithm (1965)

Symmetry perception if ties between label values cannot be resolved on basis of atom/bond types, the atoms are symmetrically equivalent, and it doesn’t matter which is chosen next Morgan’s algorithm is thus also useful for identifying symmetry in molecules

Morgan’s algorithm Works by taking more of the graph into account at each iteration –essence of “relaxation” technique is iteratively updating a value by looking at its immediate neighbours It is not infallible –graphs (“isospectral” graphs) are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent There are many variations on it –and several theoretical papers analysing it mathematically

Ring perception How many rings are there in these structures and which ones are they? rings are important features of chemical structures –nomenclature generation –aromaticity perception –synthetic significance –fragment descriptor generation

Rings and ring systems A ring system is a subgraph in which every edge is part of a cycle

Which rings to perceive? Usually the smallest set of smallest rings –two 6-membered rather than one 6- and one 10-membered –two 5-membered rather than one 5- and one 6-membered But there may be more than one SSSR –C-S-C-C-C-C –C-C-C-C-O-C –C-S-C-C-O-C

Substructure Fragments Subgraphs can be identified in a structure graph corresponding to functional groups, rings etc. ––OH ––NH2 ––COOH –phenyl this can be done by tracing appropriate paths in the graph subgraphs may overlap

Fragment codes –many early chemical information systems were based on identifying fragments of this sort originally the fragments were identified manually and represented on punched cards –special fragment codes (dictionaries of fragments) were devised for different systems some of these are still in use, though with automated encoding of structures particularly important are the systems for “Markush” structures in patents (e.g. Derwent WPI code)

Fingerprints the fragments present in a structure can be represented as a sequence of 0s and 1s –0 means fragment is not present in structure –1 means fragment is present in structure (perhaps multiple times) each 0 or 1 can be represented as a single bit in the computer (a “bitstring”) for chemical structures often called structure “fingerprints”

Fingerprints fingerprints are typically bits long where a fixed dictionary of fragments is used there can be a 1:1 relationship between fragment and bit position in fingerprint –sometimes several related fragments will “set” the same bit disadvantage is that if structure contains few fragments from the dictionary, no bits are set –can be avoided if “generalised” fragments are used (involving e.g. “any atom”, “any ring bond” types)

2D structure depiction if structures are stored without 2D display coordinates, we need to generate them –SMILES “depiction” algorithms are used for this identify and lay out ring systems first –complications over orientation of some systems –Chemical Abstracts stores “standard depictions” of all ring systems it has encountered then add side chains, avoiding collisions –many features can be added to improve appearance

3D structure depiction much more complicated than 2D need to store standard bond lengths and angles need to distinguish atoms in different hybridisation states (sp 2 vs sp 3 carbon) need rotate single bonds to avoid “bumps” sophisticated “conformation generation” programs identify low-energy conformers –very useful for identifying molecules with the correct shape to fit into biological receptor sites

Nomenclature generation most systematic nomenclature is based on ring systems –need to identify/prioritise ring systems first –identify standard numbering for system frequently need to store this –add side chains and substituents with appropriate locants