1 Chemical Structure Representation and Search Systems Lecture 3. Nov 4, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

Slides:



Advertisements
Similar presentations
Lecture 15. Graph Algorithms
Advertisements

2012: J Paul GibsonT&MSP: Mathematical FoundationsMAT7003/L2-GraphsAndTrees.1 MAT 7003 : Mathematical Foundations (for Software Engineering) J Paul Gibson,
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Graphs Chapter Chapter Contents Some Examples and Terminology Road Maps Airline Routes Mazes Course Prerequisites Trees Traversals Breadth-First.
Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.
Graphs Chapter 20 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013.
Data Structures Using C++
Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Mining Graphs.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Graph & BFS.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Using Search in Problem Solving
Fall 2007CS 2251 Graphs Chapter 12. Fall 2007CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs To.
Graphs Chapter 20 Data Structures and Problem Solving with C++: Walls and Mirrors, Frank Carrano, © 2012.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Automated Drawing of 2D chemical structures Kees Visser.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Computer Structure Codes (after lectures by Dr. J.M. Barnard) How do you store chemical structures on computer? What can you do with them there? How do.
Chapter 15 Graph Theory © 2008 Pearson Addison-Wesley. All rights reserved.
C o n f i d e n t i a l HOME NEXT Subject Name: Data Structure Using C Unit Title: Graphs.
Molecular Descriptors
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
1 Chemical Structure Representation and Search Systems Lecture 2. Oct 30, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Similarity Methods C371 Fall 2004.
Chapter 14 Graphs. © 2004 Pearson Addison-Wesley. All rights reserved Terminology G = {V, E} A graph G consists of two sets –A set V of vertices,
INTRODUCTION TO THE THEORY OF COMPUTATION INTRODUCTION MICHAEL SIPSER, SECOND EDITION 1.
Representing and Using Graphs
Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.
Data Structures and Algorithms A. G. Malamos
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Theory of Algorithms: Brute Force. Outline Examples Brute-Force String Matching Closest-Pair Convex-Hull Exhaustive Search brute-force strengths and weaknesses.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Union-find Algorithm Presented by Michael Cassarino.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.
Graphs. Graphs Similar to the graphs you’ve known since the 5 th grade: line graphs, bar graphs, etc., but more general. Those mathematical graphs are.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
Use of Machine Learning in Chemoinformatics
Graphs and Paths : Chapter 15 Saurav Karmakar
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Graph Search Applications, Minimum Spanning Tree
Daylight and Discovery
i206: Lecture 14: Heaps, Graphs intro.
Minimum Spanning Tree.
Lectures on Graph Algorithms: searching, testing and sorting
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Presentation transcript:

1 Chemical Structure Representation and Search Systems Lecture 3. Nov 4, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK

2 Lecture 3: Topics to be Covered  More Graph Theory  Structure Analysis and Processing canonicalisation and symmetry perception ring perception functional group identification structure fingerprints and fragments structure depiction principles of structure searching

3 Graph Terminology  degree of a node number of edges meeting at it  leaf node a node of degree 1  path connected sequence of edges between two nodes

4 Graph Terminology  cycle path which returns to its starting node  tree graph with no cycles  subgraph graph containing a subset of the nodes and edges of another graph

5 Graph Terminology  spanning tree a tree subgraph that contains all the nodes (but not necessarily all the edges) of a graph

6 Graph Terminology  connected graph graph in which there is a path between every pair of nodes  fully-connected graph graph in which there is an edge between every pair of nodes (all nodes have degree n-1)

7 Graph Terminology  disconnected graph graph in which some pairs of nodes have no path between them  component subgraph in which all pairs of nodes are linked by a path, but no node has a path to a node in another component

8 Graph Terminology  forest graph containing two or more components that are trees

9 Canonicalisation  a given chemical structure (or graph) can have many valid and unambiguous representations different order of rows in connection table different order of atoms in SMILES  for comparison purposes it would be useful to have a single unique or “canonical” representation  process of converting input representation to canonical form is called “canonicalisation” or “canonisation” process of applying “rules” (i.e. an algorithm)

10 Canonicalisation  an obvious approach: generate all possible valid SMILES choose the one that comes first alphabetically  this would be very slow, but effective, and there is a danger of missing one principle was used for canonicalising Wiswesser Line Notation

11 Canonicalisation  most methods in use today involve renumbering the atoms in some unique and reproducible way can be used to number rows in connection table can determine order of atoms in SMILES  normally involve a node labelling technique called “relaxation” example is Morgan’s algorithm (1965)

12 Morgan’s algorithm 1. Label each node with its degree 2. Count number of different values

13 Morgan’s algorithm 3. Recalculate labels by summing label values at neighbour nodes 4. Count number of different values

14 Morgan’s algorithm 3. Recalculate labels by summing label values at neighbour nodes 4. Count number of different values 5. Repeat from step 3

15 Morgan’s algorithm 3. Recalculate labels by summing label values at neighbour nodes 4. Count number of different values 5. Repeat from step 3

16 Morgan’s algorithm 3. Recalculate labels by summing label values at neighbour nodes 4. Count number of different values 5. Repeat from step 3

17 Morgan’s algorithm 3. Recalculate labels by summing label values at neighbour nodes 4. Count number of different values 5. Repeat from step 3

18 Morgan’s algorithm 3. Recalculate labels by summing label values at neighbour nodes 4. Count number of different values 5. Repeat from step 3 until there is no increase in the number of different values

19 Morgan’s algorithm  most nodes now have different labels  choose node with highest label as node 1  number its neighbours in order of label values

20 Morgan’s algorithm  most nodes now have different labels  choose node with highest label as node 1  number its neighbours in order of label values

21 Morgan’s algorithm  move to node 2  number its remaining neighbours in order of label values because label values are tied, choose one with higher bond order (green) first  move to node 3

22 Morgan’s algorithm  continue till all nodes are numbered  we now have a numbering for the rows of the connection table  “breadth-first” trace nodes are dealt with in a “queue” (first in, first out)

23 Morgan’s algorithm  continue till all nodes are numbered  we now have a numbering for the rows of the connection table  “breadth-first” trace nodes are dealt with in a “queue” (first in, first out)

24 Morgan’s algorithm  “depth-first” trace is also possible nodes are dealt with in a “stack” (last in, first out)  more suitable for assigning atom numbers in SMILES where we want consecutive numbers to form a path OC(=O)C(N)CC1C=CC(O)=CC=1

25 Symmetry perception  if ties between label values cannot be resolved on basis of atom/bond types, the atoms are symmetrically equivalent, and it doesn’t matter which is chosen next  Morgan’s algorithm is thus also useful for identifying symmetry in molecules

26 Morgan’s algorithm  Provides canonical numbering for the nodes in a graph that doesn’t depend on any original numbering  Works by taking more of the graph into account at each iteration essence of “relaxation” technique is iteratively updating a value by looking at its immediate neighbours  It is not infallible some graphs are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent  There are many variations on it and several theoretical papers analysing it mathematically O. Ivanciuc, “Canonical numbering and constitutional symmetry”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp Wiley, 2003

27 Canonicalisation  Algorithms are applied to graphs not chemical structures  Issues such as aromaticity, tautomerism and stereochemistry need to be addressed before canonical numbering of the graph Daylight’s canonicalisation algorithm for SMILES perceives aromatic rings (using its own definition of aromaticity) as first step

28 Ring perception  How many rings are there in these structures and which ones are they?  rings are important features of chemical structures nomenclature generation aromaticity perception synthetic significance fragment descriptor generation

29 Rings and ring systems  A ring system is a subgraph in which every edge is part of a cycle

30 Ring perception  Euler Relationship nodes + rings = edges + components where rings is the number of edges that must be removed from the graph to turn it into a tree rings is also called the Frerejacques number or nullity this is the minimum possible number of rings; it may be useful to identify others

31 Which rings to perceive?  Usually the smallest set of smallest rings two 6-membered rather than one 6- and one 10-membered two 5-membered rather than one 5- and one 6-membered  But there may be more than one SSSR C-S-C-C-C-C C-C-C-C-O-C C-S-C-C-O-C three different 6-membered rings

32 Which rings to perceive?  Sometimes a large envelope ring may be aromatic, when smaller rings are not  Ring perception is a complex area where there are no right answers there is a lot of literature on the subject

33 Ring perception by spanning tree  start at an arbitrary node  “grow a spanning tree” add neighbours of current node to a queue o provided they are not already in it move to the next node in the queue repeat until queue is empty  those edges from original graph not in the spanning tree are ring closures

34 Substructure Fragments  Subgraphs can be identified in a structure graph corresponding to functional groups, rings etc. –OH –NH2 –COOH phenyl  this can be done by tracing appropriate paths in the graph  subgraphs may overlap

35 Substructure Fragments  More systematic subgraphs can also be identified (easier to do algorithmically) paths of connected atoms every atom and its immediate neighbours rings  Subgraphs can overlap (it’s difficult to show pictures with atoms in several colours at once!)

36 Substructure fragments fragments provide “index terms” for a chemical structure o analogous to keywords in a text document they can be used in searching for structures o retrieved structures must contain the same fragments as the query “ambiguous” representations o many different structures can have the same fragments, connected together in different ways fragments to be used may be a closed list o controlled “vocabulary” (dictionary) of structural features or an open-ended list (like free text searching) o e.g. all unbranched paths of up to 6 atoms

37 Fragment codes many early chemical information systems were based on identifying fragments of this sort o originally the fragments were identified manually o and represented on punched cards special fragment codes (dictionaries of fragments) were devised for different systems o some of these are still in use, though with automated encoding of structures o particularly important are the systems for “Markush” structures in patents (e.g. Derwent WPI code)

38 Fingerprints  the fragments present in a structure can be represented as a sequence of 0s and 1s means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple times)  each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)  for chemical structures often called structure “fingerprints”

39 Fingerprints  fingerprints are typically bits long  where a fixed dictionary of fragments is used there can be a 1:1 relationship between fragment and bit position in fingerprint sometimes several related fragments will “set” the same bit  disadvantage is that if structure contains no fragments from the dictionary, no bits are set can be avoided if “generalised” fragments are used (involving e.g. “any atom”, “any ring bond” types)

40 Fingerprints  if fragment set is open-ended, the fragment description (e.g. C-C-N-C-C-O) can be “hashed” to a number in fixed range (e.g. 1 to 1024) and this is the bit number to be set  disadvantages: different and unrelated fragments may “collide” at the same bit position difficult to work back from bit position to fragment this usually causes only slight degradation in search performance (false hits), but can be more of a problem in other applications of fingerprints

41 Fingerprints  Hashed fingerprints typically used in software from Daylight Chemical Information Systems Inc.  Dictionary fingerprints Chemical Abstracts Service MDL Information Systems Inc o ISIS or MACCS keys (166 and 960 bits) Barnard Chemical Information Ltd o customised dictionaries

42 2D structure depiction  if structures are stored without 2D display coordinates, we need to generate them SMILES  “depiction” algorithms are used for this  identify and lay out ring systems first complications over orientation of some systems Chemical Abstracts stores “standard depictions” of all ring systems it has encountered  then add side chains, avoiding collisions many features can be added to improve appearance

43 3D structure depiction  much more complicated than 2D  need to store standard bond lengths and angles  need to distinguish atoms in different hybridisation states (sp 2 vs sp 3 carbon)  need rotate single bonds to avoid “bumps”  sophisticated “conformation generation” programs identify low-energy conformers very useful for identifying molecules with the correct shape to fit into biological receptor sites J. Sadowski, “3D structure generation”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp Wiley, 2003

44 Nomenclature generation  most systematic nomenclature is based on ring systems need to identify/prioritise ring systems first identify standard numbering for system o frequently need to store this add side chains and substituents with appropriate locants J. L. Wisniewski, “Chemical nomenclature and structure representation: algorithmic generation and conversion”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp Wiley, 2003

45 Conclusions from Lecture 3  there are several important jargon terms used in graph theory, which crop up in chemical informatics  canonicalisation provides a unique numbering for the atoms in a molecule Morgan algorithm can be used to achieve it  it’s not always obvious how many rings there are, or which ones they are  fingerprints represent the presence or absence of substructure fragments in a molecule they are ambiguous representations of structure

46 Topic for Lecture 4: Structure searching  two main varieties of search full structure search o query is is complete molecule o is this molecule in the database? or tautomers, stereoisomers etc. of it, substructure search o query is a pattern of atoms and bonds o does this pattern occur as a substructure (subgraph) of any of the molecules in my database?