Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. SMILES Multigram Compression Roger Sayle 1 and Jack Delany 2 1 Metaphorics LLC, Santa Fe, New Mexico.

Slides:



Advertisements
Similar presentations
Deep packet inspection – an algorithmic view Cristian Estan (U of Wisconsin-Madison) at IEEE CCW 2008.
Advertisements

Part IV: Memory Management
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
Character and String definitions, algorithms, library functions Characters and Strings.
February 1 & 31 Csci 2111: Data and File Structures Week4, Lectures 1 & 2 Organizing Files for Performance.
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Tries Standard Tries Compressed Tries Suffix Tries.
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
The SBC-Tree: An Index for Run- Length Compressed Sequences Mohamed El-tabakh 1, Wing-Kia Hon 2 Rahul Shah 3, Walid Aref 1, Jeffrey Vitter 1 1 Department.
Modified Data Structure of Aho-Corasick Project ECE-526 Spring 2006 Benfano Soewito, Ed Flanigan and John Pangrazio Southern Illinois University Carbondale.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
E.G.M. PetrakisTries1  Trees of order >= 2  Variable length keys  The decision on what path to follow is taken based on potion of the key  Static environment,
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
CS 104 Introduction to Computer Science and Graphics Problems
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
C++ for Engineers and Scientists Third Edition
Modified Data Structure of Aho-Corasick Project ECE-526 Spring 2006 Benfano Soewito, Ed Flanigan and John Pangrazio Southern Illinois University Carbondale.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Lossless Compression Multimedia Systems (Module 2 Lesson 3)
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
Source Coding-Compression
January 11, Files – Chapter 1 Introduction to the Design and Specification of File Structures.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Accelerating Multipattern Matching on Compressed HTTP Traffic Published in : IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 20, NO. 3, JUNE 2012 Authors : Bremler-Barr,
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
Algorithm Paradigms High Level Approach To solving a Class of Problems.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last.
Author : Sarang Dharmapurikar, John Lockwood Publisher : IEEE Journal on Selected Areas in Communications, 2006 Presenter : Jo-Ning Yu Date : 2010/12/29.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
1Computer Sciences Department. 2 Advanced Design and Analysis Techniques TUTORIAL 7.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Hierarchical packet classification using a Bloom filter and rule-priority tries Source : Computer Communications Authors : A. G. Alagu Priya 、 Hyesook.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
HUFFMAN CODES.
Tries 07/28/16 11:04 Text Compression
CPSC 231 Organizing Files for Performance (D.H.)
Tries 5/27/2018 3:08 AM Tries Tries.
Fast String Manipulation
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
HEXA: Compact Data Structures for Faster Packet Processing
Parsing Costas Busch - LSU.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Knuth-Morris-Pratt Algorithm.
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. SMILES Multigram Compression Roger Sayle 1 and Jack Delany 2 1 Metaphorics LLC, Santa Fe, New Mexico 2 Daylight CIS, Santa Fe, New Mexico

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Introduction One of the major benefits of line notations, such as SMILES, over traditional connection tables is their compact representation. For NCI95, the SMILES average 33 bytes for each molecule, but MDL.mol file, for example, is over 1400 bytes. This advantage has enabled Daylight’s software to store even the largest chemical databases in memory since the early 1980s, and to access and search this data much faster than disk-based systems.

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Encoding n There are only 70 different characters can occur in a valid SMILES string. n Allowing for a (null) terminator character, there are 185 byte values that cannot normally occur in a SMILES. n Multigram compression uses these unused values to represent commonly occurring SMILES substrings. n Compression occurs because the entire substring (or multigram) is encoded as a single byte.

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Advantages of Multigrams n Conceptually very simple to implement. n Extremely fast data decompression. n Each SMILES decompress independently. n Domain-specific ‘a priori’ statistical model. n Guaranteed worst case performance. n Uncompressed data is treated identically. n Efficient compression implementation. n Processing of compressed form possible.

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Examples of Multigrams n Canonical Smiles [nH]c1ccccc1[N+](=O)[O-] S(=O)(=O)c1ccc(cc1)[N+] Cl[n+](C) [O+]C=CC(=O)N n Isomeric Smiles n Reaction Smiles [cH:[CH:1 [CH2:[c:[O:

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Decompression n Decompression of multigram encoded SMILES is almost trivial: extern char *MultiGram[256]; extern char *MultiGram[256]; dst = outp; dst = outp; for( i=0; inp[i]; i++ ) { for( i=0; inp[i]; i++ ) { src = MultiGram[inp[i]]; src = MultiGram[inp[i]]; while( *src ) *dst++ = *src++; while( *src ) *dst++ = *src++; }

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Compression n Efficient compression is more tricky… Given a simple alphabet of only “A” and “B”. with the set of multigrams “A”, “B”, “AB” and “BAA”. Encode the string “ABAA”. The greedy solution uses 3 bytes “AB”, “A” and “A”. An optimal solution uses only 2 bytes, “A” and “BAA”.

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Dynamic Programming n The computer science solution to such 1D tiling problems is a two pass algorithm called “Dynamic Programming”. n For each prefix, the optimal length is the shortest sub-prefix before each valid suffix multigram. To Encode the string “ABAA” encode(“A”) = 1 encode(“AB”) = min(encode(“A”)+1,1) = 1 encode(“ABA”) = encode(“AB”)+1 = 2 encode(“ABAA”) = min(encode(“ABA”)+1,encode(“A”)+1) = 2

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Trie Construction

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. FSM Construction

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Training Sets n train.smi: SMILES from WDI, NCI, ACD, SYNTH, totaling bytes. [48.5 bytes/mol] n train.ism: isomeric SMILES from WDI and ACD totaling bytes [93.1 bytes/mol] n train.rism: reaction SMILES from SYNTH, totaling bytes. [287.2 bytes/mol]

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Training Set Performance n Train.smi smizip580818/ % smizip (renum)546094/ % gzip / % n Train.ism smizip654737/ % smizip (renum)610254/ % gzip / % n Train.rism smizip / % smizip (renum) / % gzip / %

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Cross-Validation Smi+ism is a combination of the 155 best absolute SMILES multigrams and the 30 best isomeric SMILES multigrams.

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. General Results n Chemical Database Results ACD002 ( SMILES) / % NCI00 ( SMILES) / % WDI011 (28298 ISOMERS) / % SYNTH97 ( ISORXNS) / % n Oracle Cartridge Results –No measurable effect on index creation/insertion time. –Cartridge index data is 20% smaller for NCI00. –Fingertest, Tanimoto and Tversky are 5-15% faster. –Contains and Matches (with triage) are 0-1% slower.

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Bibliography n A. Aho and M. Corasick, "Efficient String Matching: An Aid to Bibliographic Search", Communications of the ACM, Vol. 18, pp , n Wai-Hong Leung & Steven S. Skiena, "Inducing Codes from Examples", Proceedings of the 1991 Data Compression Conference (DCC91), Eds. James A. Storer and John H. Reif, Snowbird, Utah, Extended Abstract, pp , April n R.A. Wagner, "Common Phrases and Minimum-Space Text Storage", Communications of the ACM, Vol. 16, pp , 1974.