Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics.
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
NP-Hard Nattee Niparnan.
. Exact Inference in Bayesian Networks Lecture 9.
www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.
COMP 553: Algorithmic Game Theory Fall 2014 Yang Cai Lecture 21.
Fast Jensen-Shannon Graph Kernel Bai Lu and Edwin Hancock Department of Computer Science University of York Supported by a Royal Society Wolfson Research.
Tirgul 7 Review of graphs Graph algorithms: –DFS –Properties of DFS –Topological sort.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Mining Graphs.
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
An Expert System for Chemical Structure Elucidation Sean Walker COMP 4200 November 13, 2007.
Association Analysis (7) (Mining Graphs)
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 22 Instructor: Paul Beame.
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Pregel: A System for Large-Scale Graph Processing
Distributed Ligand and Monomer Object Database Milorad To s ic, John Westbrook, Helen Berman Rutgers, The State University of New Jersey Department of.
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Similarity Methods C371 Fall 2004.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Lecture 2 Computational Complexity
Image Segmentation Seminar III Xiaofeng Fan. Today ’ s Presentation Problem Definition Problem Definition Approach Approach Segmentation Methods Segmentation.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Tree Decomposition Benoit Vanalderweireldt Phan Quoc Trung Tram Minh Tri Vu Thi Phuong 1.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
NP Complexity By Mussie Araya. What is NP Complexity? Formal Definition: NP is the set of decision problems solvable in polynomial time by a non- deterministic.
Approximation Algorithms
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Graphs. Definitions A graph is two sets. A graph is two sets. –A set of nodes or vertices V –A set of edges E Edges connect nodes. Edges connect nodes.
Algorithm Analysis Data Structures and Algorithms (60-254)
Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System Abstract The use of the DD – TQSPR (Dominant-Descriptor Targeted QSPR) method.
Non-Manifold Multi-Tesselations From Meshes to Iconic Representations of Objects L. De Floriani, P. Magillo, E. Puppo, F. Morando DISI - University of.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
On the Relation between SAT and BDDs for Equivalence Checking Sherief Reda Rolf Drechsler Alex Orailoglu Computer Science & Engineering Dept. University.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
Graph Connectivity This discussion concerns connected components of a graph. Previously, we discussed depth-first search (DFS) as a means of determining.
Graph Indexing From managing and mining graph data.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
1 Algorithms Searching and Sorting Algorithm Efficiency.
1 Substructure Similarity Search in Graph Databases R 陳芃安.
Gspan: Graph-based Substructure Pattern Mining
Database Management System
ICS 353: Design and Analysis of Algorithms
Statistics 2 for Chemical Engineering lecture 5
Topological Index Calculator III
Trees-2, Graphs Data Structures with C Chpater-6 Course code: 10CS35
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Presentation transcript:

Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey Department of Chemistry

Size of the database Nature of structure’s data Search typeType of similarity Databases of Chemical Structures: Similarity Searching Features Couple of hundreds of thousands of structures Purified, consistent data Row, inconsistent data Structure search Substructure search [DOW96], [BAR93] [DOW96], [BAR93] Substructure similarity search [HAG92], [GWW98], [ART92] [HAG92], [GWW98], [ART92] Supstructure search (structures contained in target structure) Less general More general Graph isomorphism Subgraph isomorphism Maximal common subgraph

Substructure similarity search screening search –based on substructural features that are typically small, fragment substructures –many thousands of structures per second –precedes detailed and time-consuming atom-by-atom search atom-by-atom search (MCS) (Maximal Common Substructure search) –The MCS of a pair of structures is the largest substructure that is present in both structures. –The MCS is interpreted as similarity measure between two structures that corresponds favorably to an “intuitive” notion of chemical similarity –The MCS is of our primary concern because of it’s importance for the search quality and it’s exponential computational complexity. [DOW96], [BAR93], [HAG92], [GWW98], [ART92]

MCS - Maximal Common Substructure search NP-complete problem –Subgraph isomorphism is proven to be NP-complete problem which implies that the MCS is also NP-complete –(at least) Exponential computational complexity Average run-time can be reduced by: –Use faster computer –Use various heuristics –Carry out some computation in pre-processing phase [XUJ96] [BAR93]

Our strategy for MCS search Back-tracking –The back-tracking is used as an common background algorithm for problems with exponential complexity Distributed objects –Distributed computing is explored for increasing processing speed –Persistent objects are essential for robustness of the searching engine Topology-based comparison criteria –Topology-based features of chemical structures are found attractive for structure efficient description –Topological queries and indexing in collection of distributed objects are considered as promising approach in similar applications –Our heuristics for reducing average searching time and postponing computational explosion to the structures of the size as big as possible are based on substructure-by-substructure instead of atom-by-atom search [XUJ96], [EST98], [WAN98] [PSV99]

Experimental results - question Compare searching time with and without topology-based criteria, for the same set of target structures and the same set of database structures. The topology criterion based on loop number is used: An atom X matches atom Y iff they have the same atom types and number of loops that X belongs to is not greater than that Y belongs to..In order to examine how atom types influence searching process, the same set of target structures is applied including as well as excluding hydrogens. Is there any searching speed-up due to introduction of topology-based comparison criteria ?

Search with Hydrogens excluded

Search with Hydrogens included

Experimental results - answer Is there any searching speed-up due to introduction of topology-based comparison criteria ? - YES Searching speed-up is evident if topology-based criteria are applied. Oscillations in searching time indicate further potential for improving speed. Exponential complexity remains (both curves have the same growing tendency), but by introducing topology-based criteria point of the run- time explosion is translated into the area of much more complex structures. Relative improvement is higher for the case where structures without hydrogens are considered. If such a conclusion can be made for specific atom types, then much better results can be expected for the case of specific substructure type.

Experimental results - question Does topology-based comparison criteria improve substructure similarity measure? Compare structures from the sets of resulting structures obtained by searching with and without topology-based criteria, for the same set of target structures and the same set of database structures. Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ?

Target structure

Two of resulting structures The structure is eliminated

Experimental results - answer Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ? - YES Decreasing number of resulting structures. Increased probability for expected structures to be found in the set of resulting structures.

Serializable hyper-graph Different characteristic substructures are represented on an uniform way Efficient implementation of topology-based comparison criteria Pointer-based data structure with no extra delay due to serialization Persistent storage of such objects is straightforward Easy to adopt to any distributed objects technology

Hyper-graph: definitions Definition: A hyper-graph HG is an ordered two-tuple HG = (C,E), where C is set of hyper-graphs that are containers of HG, and E is a set of hyper-graphs that are elements of HG : C = { c | c > HG }, E = { e | e < HG } Definition: An undirected hyper-graph HG is an ordered two-tuple HG = ((C, E), I), where ( C,E) is hyper-graph, and I is set of undirected hyper-graphs that are neighbors of the HG. We say that HG is in undirected connection relation with its neighbors. Definition: The undirected connection relation is an equivalence relation.

Hyper-graph: definitions (con’t) Definition: An directed hyper-graph HG is an ordered three-tuple HG = ((C, E), I, O), where ( C,E) is hyper-graph, I is set of directed hyper-graphs that are input neighbors of the HG, and O is set of directed hyper-graphs that are output neighbors of the HG. We say that HG is in directed connection relation with its neighbors. Definition: The directed connection relation is an order relation. Note: We use the undirected hyper-graph in MCS.

Hyper-graph: example v1 v5 v7 v8 v6 v4 v2 v3 e23 e12 e45 e24 e35 e57 e46 e67 e68 v1: id = v1; type = VERTEX; Container = {G1}; Elements = {}; InElements = {e12}; v2: id = v2; type = VERTEX; Container = {G1}; Elements = {}; InElements = {e12, e23, e24}; G1: id = G1; type = GRAPH; Container = {}; Elements = {v1, …, v8, e12, e23, …,e68}; InElements = {};... e12: id = e12; type = EDGE; Container = {G1}; Elements = {}; InElements = {v1,v2}; e23: id = e23; type = EDGE; Container = {G1}; Elements = {}; InElements = {v2, v3};...

Hyper-graph: example (con’t) After simple-loop reduction v5 v7 v6 v4 e45 e57 e46 e67 G2: id = G2; type = GRAPH; Container = {}; Elements = {g1,g2,g3,g4, e1,e2,e3,e4}; InElements = {}; v1 v2 e12v5 v4 v2 v3 e23 e45 e24 e35 v8 v6 e68 g1g2g3g4 e1e2e3 g1: id = g1; type = GRAPH; Container = {G2}; Elements = {v1,v2,e12}; InElements = {e1}; g2: id = g2; type = LOOP; Container = {G2}; Elements = {v2,v3,v4,v5,e23,e24,e35,e45}; InElements = {e1, e2}; e1: id = e1; type = EDGE; Container = {G2}; Elements = {v2}; InElements = {g1,g2}; e2: id = e2; type = EDGE; Container = {G2}; Elements = {v4,v5,e45}; InElements = {g2, g3};

Hyper-graph: class hierarchy

Conclusions Experimental analysis proved again the fact pointed out in a literature that topological information about chemical structure (information about loops in the experiments) can improve substructure similarity searching. Because the MCS is NP-complete problem, efficiency of the applied computing model is very important. Distributed objects is currently the most promising computational approach. Hence, it should be applied to substructure similarity search in chemical structure databases. The proposed hyper-graph model is able to efficiently represent both topology and behavioral characteristics of a chemical structure, in a hierarchical way. Due to efficient serialization method, the object representation of the hyper-graph can be incorporated at any distributed technology (i.g. CORBA) without decreasing execution efficiency.

References [DOW96]Downs, G.M., and Willett, P. (1995), Similarity searching in databases of chemical structures., Rev. Comput. Chem., 7, [GWW96]Gillet, V.J., Wild, D.J., Willet, P., and Bradshaw, J. (1998), Similarity and dissimilarity methods for processing chemical structure databases., The Computer Journal, 41, No. 8, [HAG92]Hagadone, T.R., (1992), Molecule substructure similarity searching: Efficient retrival in two- dimensional structure databases., J. Chem. Inf. Comput. Sci., 32, [WAN98]Wang, T., and Zhou, J., (1998), 3DFS: A new 3D flexible searching system for use in drug design., J. Chem. Inf. Comput. Sci., 38, [XUJ96]Xu, J., (1996), GMA: A generic match algorithm for structural homomorphism, isomorphism, and maximal common substructure match and its applications., J. Chem. Inf. Comput. Sci., 36, [PSV99]Papadimitriou, C.H., Suciu, D., and Vianu, V., (1999), Topological queries in spatial databases., Journal of Comput. and Sys. Sci., 58, [ART92]Artymiuk, J., et. all., (1992), Similarity searching of three-dimensional molecules and macromolecules., J. Chem. Inf. Comput. Sci., 32, [BAR93]Barnard, J.M., (1993), Substructure searching methods: Old and New., J. Chem. Inf. Comput. Sci., 33, [EST98]Estrada, E., (1998), Spectral moments of the edge adjacency matrix in molecular graphs., J. Chem. Inf. Comput. Sci., 38,