Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. SMILES Multigram Compression Roger Sayle 1 and Jack Delany 2 1 Metaphorics LLC, Santa Fe, New Mexico.

Similar presentations


Presentation on theme: "Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. SMILES Multigram Compression Roger Sayle 1 and Jack Delany 2 1 Metaphorics LLC, Santa Fe, New Mexico."— Presentation transcript:

1 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. SMILES Multigram Compression Roger Sayle 1 and Jack Delany 2 1 Metaphorics LLC, Santa Fe, New Mexico 2 Daylight CIS, Santa Fe, New Mexico

2 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Introduction One of the major benefits of line notations, such as SMILES, over traditional connection tables is their compact representation. For NCI95, the SMILES average 33 bytes for each molecule, but MDL.mol file, for example, is over 1400 bytes. This advantage has enabled Daylight’s software to store even the largest chemical databases in memory since the early 1980s, and to access and search this data much faster than disk-based systems.

3 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Encoding n There are only 70 different characters can occur in a valid SMILES string. #%()*+-./0123456789:=>@ABCDEFGHIKLMNOPRSTUVWXYZ[\]abcdefghiklmnoprstuy n Allowing for a (null) terminator character, there are 185 byte values that cannot normally occur in a SMILES. n Multigram compression uses these unused values to represent commonly occurring SMILES substrings. n Compression occurs because the entire substring (or multigram) is encoded as a single byte.

4 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Advantages of Multigrams n Conceptually very simple to implement. n Extremely fast data decompression. n Each SMILES decompress independently. n Domain-specific ‘a priori’ statistical model. n Guaranteed worst case performance. n Uncompressed data is treated identically. n Efficient compression implementation. n Processing of compressed form possible.

5 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Examples of Multigrams n Canonical Smiles [nH]c1ccccc1[N+](=O)[O-] S(=O)(=O)c1ccc(cc1)[N+] Cl[n+](C) [O+]C=CC(=O)N n Isomeric Smiles [C@@H][C@H][C@@] [C@]/C=C//C=C\ n Reaction Smiles [cH:[CH:1 [CH2:[c:[O:

6 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Decompression n Decompression of multigram encoded SMILES is almost trivial: extern char *MultiGram[256]; extern char *MultiGram[256]; dst = outp; dst = outp; for( i=0; inp[i]; i++ ) { for( i=0; inp[i]; i++ ) { src = MultiGram[inp[i]]; src = MultiGram[inp[i]]; while( *src ) *dst++ = *src++; while( *src ) *dst++ = *src++; }

7 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Compression n Efficient compression is more tricky… Given a simple alphabet of only “A” and “B”. with the set of multigrams “A”, “B”, “AB” and “BAA”. Encode the string “ABAA”. The greedy solution uses 3 bytes “AB”, “A” and “A”. An optimal solution uses only 2 bytes, “A” and “BAA”.

8 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Dynamic Programming n The computer science solution to such 1D tiling problems is a two pass algorithm called “Dynamic Programming”. n For each prefix, the optimal length is the shortest sub-prefix before each valid suffix multigram. To Encode the string “ABAA” encode(“A”) = 1 encode(“AB”) = min(encode(“A”)+1,1) = 1 encode(“ABA”) = encode(“AB”)+1 = 2 encode(“ABAA”) = min(encode(“ABA”)+1,encode(“A”)+1) = 2

9 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Trie Construction

10 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. FSM Construction

11 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Training Sets n train.smi: 36311 SMILES from WDI, NCI, ACD, SYNTH, totaling 1762189 bytes. [48.5 bytes/mol] n train.ism: 27451 isomeric SMILES from WDI and ACD totaling 2555727 bytes [93.1 bytes/mol] n train.rism: 17159 reaction SMILES from SYNTH, totaling 4927586 bytes. [287.2 bytes/mol]

12 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Training Set Performance n Train.smi smizip580818/176218933.0% smizip (renum)546094/176218931.0% gzip -9465891/176218926.4% n Train.ism smizip654737/254034725.8% smizip (renum)610254/254034724.0% gzip -9514941/254034720.3% n Train.rism smizip1425113/491837629.0% smizip (renum)1397922/491837628.4% gzip -91071673/491837621.8%

13 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Multigram Cross-Validation Smi+ism is a combination of the 155 best absolute SMILES multigrams and the 30 best isomeric SMILES multigrams.

14 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. General Results n Chemical Database Results ACD002 (254865 SMILES)3719633/1027725436.2% NCI00 (162148 SMILES)2389223/613297338.9% WDI011 (28298 ISOMERS) 893902/306404329.2% SYNTH97 (102934 ISORXNS)8592831/2939707429.3% n Oracle Cartridge Results –No measurable effect on index creation/insertion time. –Cartridge index data is 20% smaller for NCI00. –Fingertest, Tanimoto and Tversky are 5-15% faster. –Contains and Matches (with triage) are 0-1% slower.

15 Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. Bibliography n A. Aho and M. Corasick, "Efficient String Matching: An Aid to Bibliographic Search", Communications of the ACM, Vol. 18, pp. 333-340, 1975. n Wai-Hong Leung & Steven S. Skiena, "Inducing Codes from Examples", Proceedings of the 1991 Data Compression Conference (DCC91), Eds. James A. Storer and John H. Reif, Snowbird, Utah, Extended Abstract, pp. 267-276, April 1991. n R.A. Wagner, "Common Phrases and Minimum-Space Text Storage", Communications of the ACM, Vol. 16, pp. 148-152, 1974.


Download ppt "Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. SMILES Multigram Compression Roger Sayle 1 and Jack Delany 2 1 Metaphorics LLC, Santa Fe, New Mexico."

Similar presentations


Ads by Google