Download presentation
Presentation is loading. Please wait.
Published byErica Waters Modified over 9 years ago
1
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Markus Sitzmann 1, Wolf-Dietrich Ihlenfeldt 2, and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space
2
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Chemistry Space Analysis how many small-molecules are there currently? since the early 2000s: enormous increase of the number of databases containing small molecules, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap? many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms) growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)
3
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 chemical structure Chemical Identifier Resolver NCI/CADD Identifiers InChI/InChIKey ChemSpider ID PubChem SID/CID chemical names CAS Registry Number NSC number FDA UNII ChemNavigator SID SMILES SD File Chemical Formula ChEBI ID PDB Ligand ID MRV CML SYBYL Line Notation GIF image
4
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 http://cactus.nci.nih.gov/chemical/structure Works as a resolver for different chemical structure identifiers. Allows one to convert a given structure identifier into another representation or structure identifier. Chemical Identifier Resolver NCI/CADD Web Resources first beta release: July 2009 current release (beta 4): April 2011
5
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 it is usable by a simple URL API: example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas 204255-11-8 http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” MIME type: text/plain Chemical Identifier Resolver NCI/CADD Web Resources XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml if a request is not resolvable: HTTP404 status message
6
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 resolver chemical names IUPAC names (by OPSIN) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID ChemSpider ID ChemNavigator SID FDA UNII /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula /twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid “identifier” “representation” http://cactus.nci.nih.gov/chemcial/structure Chemical Identifier Resolver NCI/CADD Public Web Resources
7
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 identifier representation http request http response detection of the identifier type identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB)
8
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 ChemNavigator iResearch Library compilation of commercially available screening compounds from ~300 inter- national chemistry suppliers PubChem database including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider … Commercial Sources / others Asinex, Comgenex, eMolecules, ChEMBL, … currently: ~150 chemical structure databases ~120 million structure records ~81.6 million unique structures by NCI/CADD FICuS Identifier ~84 million unique structures by Std. InChIKey ChemNav. iResearch Lib. ~56% PubChem ~38% others ~6% Chemical Structure Database (CSDB) Chemical Identifier Resolver
9
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Structure Identifiers FICTS, FICuS, uuuuu
10
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 based on hashcodes calculated by the chemoinformatics toolkit CACTVS CACTVS hashcodes: represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned) high sensitivity to structural features of a compound change if connectivity changes NCI/CADD Structure Identifiers Unique Representation of Chemical Structures HN N NH 2 OH O 9850FD9F9E2B4E25
11
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 structure normalization parent structure NCI/CADD Identifier hashcode calculation E_HASHISY calculation of a set of parent structures with different sensitivity to chemical features representation of chemical structures on different levels FICTS original structure record Molfile SDF SMILES ChemDraw cdx PDB FICuS uuuuu SDF SMILES database NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
12
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 adjustable levels of sensitivity: Fragments sensitive keep only largest organic fragment Isotopes ignore isotope labels sensitive D D D D D D Charges uncharge sensitive find canonical tautomer OO Stereochemistry sensitive COOH NH 2 discard stereo information O - O NH 3 + OH O NH 2 un-sensitive sensitive OO H OO H Tautomers COOH H NH 2 COOH NH 2 H Na + O O - O OH un-sensitive NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
13
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges sensitive D D D D D D OOCOOH NH 2 un-sensitive O - O NH 3 + OH O NH 2 Tautomers Stereochemistry sensitive OO H OO HCOOH H NH 2 COOH NH 2 H Na + O O - O OH NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
14
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges sensitive D D D D D D OOCOOH NH 2 F I C representation of the exact drawing un-sensitive T O - O NH 3 + OH O NH 2 ≠ ≠ ≠ Tautomers Stereochemistry sensitive OO H OO HCOOH H NH 2 COOH NH 2 H ≠ ≠ S Na + O O - O OH ≠ ≠ FICTS NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
15
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges sensitive D D D D D D OOCOOH NH 2 F I C comes closest to how a chemist perceives a compound un-sensitive u O - O NH 3 + OH O NH 2 Tautomers Stereochemistry sensitive OO H OO HCOOH H NH 2 COOH NH 2 H = ≠ S Na + O O - O OH FICuS ≠ ≠ ≠≠ = NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
16
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges Tautomers Stereochemistry Na + sensitive O O - D D D D D D O - O NH 3 + OO H OO HCOOH H NH 2 COOH NH 2 H O OH OOCOOH NH 2 OH O NH 2 = = = = = = = = closely related forms of the same compound u uuuu un-sensitive uuuuu NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
17
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesChargesStereoTautomers FICTS FICuS uuuuu sensitive / not sensitive - - - HN N NH 2 O-O- O Na + 4A122D094098B50D-FICTS-01-1D 0E26B623DF7FAD30-FICuS-01-70 9850FD9F9E2B4E25-uuuuu-01-27 NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
18
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope salt stereoisomers “errors” histidine
19
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 A3DAE0788050DDE4-FICTS E5F83F10C5DB080A-FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25-FICTS E5F83F10C5DB080A-FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS6C16DE2351F9FF50-FICTS HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O 9850FD9F9E2B4E25-FICTS charged form tautomer isotope salt stereoisomers FICTS “errors”
20
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 A3DAE0788050DDE4-FICuS E5F83F10C5DB080A-FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25-FICuS E5F83F10C5DB080A-FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS9850FD9F9E2B4E25-FICuS HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope salt stereoisomers FICuS “errors”
21
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-FICuS 9850FD9F9E2B4E25-uuuuu HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope stereoisomers salt uuuuu “errors”
22
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 HNDVDQJCIGZPNO-UHFFFAOYSA-N HNDVDQJCIGZPNO-CDYZYAPPSA-N HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope stereoisomers salt Std. InChIKey “errors” HNDVDQJCIGZPNO-UHFFFAOYSA-N UHPNKBYGGMJTIM-UHFFFAOYSA-M
23
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICTS original record FICTS original record FICTS FICuS uuuuu 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
24
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICTS original record FICTS original record FICTS FICuS uuuuu tautomer- invariant 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
25
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers?
26
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism) rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS rule set is systematically applied to the original structure (and all tautomers that have been generated in previous steps) tautomer generation is limited to 1000 SMIRKS transform operations/structure all tautomers are ranked by a scoring function the highest ranked tautomer is defined as the canonical tautomer NCI/CADD Chemical Structure Database Tautomer Analysis
27
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 rule 12: furanones rule 11: 1.11 (aromatic) heteroatom H shift rule 10: 1.9 (aromatic) heteroatom H shift rule 9: 1.7 (aromatic) heteroatom H shift rule 8: 1.5 aromatic heteroatom H shift (2) rule 7: 1.5 (aromatic) heteroatom H shift (1) rule 6: 1.3 heteroatom H shift rule 5: 1.3 aromatic heteroatom H shift rule 4: special imine rule 3: simple (aliphatic) imine rule 2: 1.5 (thio)keto/(thio)enol rule 1: 1.3 (thio)keto/(thio)enol 21 SMIRKS transform rules: rule 21: phosphonic acids rule 20: isocyanides rule 19: formamidinesulfinic acids rule 18: cyanic/iso-cyanic acids rule 17: oxim/nitroso via phenol rule 16: oxim/nitroso rule 15: pentavalent nitro/aci-nitro rule 14: ionic nitro/aci-nitro rule 13: keten/ynol exchange NCI/CADD Chemical Structure Database Tautomer Analysis
28
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 [O,S,Se,Te;X1:1]=[C;z{1-2}:2][CX4R{0-2}:3][#1:4]>> [#1:4][O,S,Se,Te;X2:1][#6;z{1-2}:2]=[C,cz{0-1}R{0-1}:3] [N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2][N,n,S,O,Se,Te:3][#1:4]>> [#1:4][N,n,S,O,Se,Te:1][NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3] 3 2 O 1 H 4 3 2 O 1 H 4 N 2 S 1 N 3 H H 4 H N 2 S 1 N 3 H H 4 H 1.3 keto/enol 1.3 heteroatom H shift rule 1: 1.3 (thio)keto/(thio)enol rule 6: 1.3 heteroatom H shift NCI/CADD Chemical Structure Database Tautomer Analysis
29
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICTS 72.0 million FICTS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis FICuS 8.6% change tautomeric form during FICuS normalization FICTS parent structures 70.6 million FICuS parent structures structure counts are on basis of the 2009 version of CSDB (103.9 million structure records) FICuS parent structures 1.5% have an one-to-many relationship to several FICTS parent structures (“conflict”) 98.5% have an one-to-one relationship to a single FICTS parent structure
30
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.00.51.01.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records
31
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.00.51.01.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center EPA DSSTox Specs Ambinter BIND BindingDB ChemNavigator KEGG NCI Open Database NIST WebBook NLM ChemIDplus NMRShiftDB Thomson Pharma Wombat NCI/DTP PASS Training Set SGC-Ox ChemDB ZINC ChEBI ChemSpider
32
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database Tautomer Analysis 0 5 10 15 20 25 30 0.52.54.56.58.510.512.514.516.518.520.522.524.5 frequency number database releases percentage of FICuS parent structure in each database release occurring somewhere in CSDB with a conflict occurrence of “tautomerism-critical” molecules within each individual database release (%) average: ~9.5% of FICuS parent structures
33
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 HN N O O HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) HPMBP is used in liquid membranes (selective removal of metal ions) selectivity and efficiency depends on the tautomeric form of HPMBP the tautomeric form depends on solvent and concentration of HPMBP He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP. 1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947 Example for a Tautomer “Conflict”
34
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 N N OH O HN N O O HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) CACTVS generates 7 tautomers Example for a Tautomer “Conflict” canonical tautomer by CACTVS 5 have potential stereo center on atoms or bonds HN N O O R/S HN N OH OH HN N O OH E/Z N N O OH N N O O R/S
35
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 H N N O O HN N O O H 4551-69-1 33064-14-1 127117-31-1 859 references 49 references 3 references HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) 3 have CAS Registry Numbers assigned Example for a Tautomer “Conflict” (no stereo) (Z) HN N O O R/S HN N OH OH N N O OH E/Z N N O OH N N O O R/S
36
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 N N OH O N N O O HN N O O N N O OH HN N O OH HN N OH OH HN N O O 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” occurrences in databases indexed in CSDB R/S E/Z R/S
37
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases occurrences in databases N N OH O N N O O R/S HN N O O N N O OH E/Z HN N O OH E/Z HN N OH OH R/S HN N O O R/S 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID Scripps Screening Center Thomson Pharma ZINC ChemDB ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider DiscoveryGate EPA GCES MLSMR NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma Ambinter ChemDB ChemSpider DiscoveryGate ChemNavigator Thomson Pharma ChemSpider ZINC ChemSpider ECOTOX ZINC
38
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICuS 70.6 million FICuS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis how many tautomers are generated? how often is each rule applied (type of tautomerism)? how many tautomers per structure? starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS generated 680 million tautomers for 1.7% of the FICuS parent structures the enumeration was not exhaustive
39
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 2.617,860,604rule 12: furanones 0.21,374,235rule 11: 1.11 (aromatic) heteroatom H shift 0.75,061,731rule 10: 1.9 (aromatic) heteroatom H shift 8.457,242,472rule 9: 1.7 (aromatic) heteroatom H shift <0.126,819rule 8: 1.5 aromatic heteroatom H shift (2) 4.027,542,770rule 7: 1.5 (aromatic) heteroatom H shift (1) 36.8250,453,882rule 6: 1.3 heteroatom H shift 3.825,678,446rule 5: 1.3 aromatic heteroatom H shift 0.64,306,155rule 4: special imine 5.335,917,415rule 3: simple (aliphatic) imine 1.711,541,452rule 2: 1.5 (thio)keto/(thio)enol 25.4173,002,712rule 1: 1.3 (thio)keto/(thio)enol %count generated tautomerstautomer rule Tautomer Analysis NCI/CADD Chemical Structure Database usage of SMIRKS rules (1/2):
40
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 <0.154,926rule 21: phosphonic acids <0.1229rule 20: isocyanides <0.11392rule 19: formamidinesulfinic acids <0.1181rule 18: cyanic/iso-cyanic acids <0.1131,502rule 17: oxim/nitroso via phenol <0.1505,695rule 16: oxim/nitroso <0.1129rule 15: pentavalent nitro/aci-nitro <0.1428,266rule 14: ionic nitro/aci-nitro <0.157,989rule 13: keten/ynol exchange %count generated tautomerstautomer rule Tautomer Analysis NCI/CADD Chemical Structure Database usage of SMIRKS rules (2/2):
41
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 <0.13801–832 tautomers <0.1362701-800 tautomers <0.11,400601-700 tautomers <0.14,323501-600 tautomers <0.117,241401-500 tautomers <0.135,144301-400 tautomers <0.1104,875201-300 tautomers 0.8565,199101-200 tautomers 1.61,136,06651-100 tautomers 3.72,622,58725-50 tautomers 15.410,870,31211-25 tautomers 47.533,532,2842-10 tautomers 15.210,721,845one tautomer 13.89,756,186no tautomers %countFICuS structures with NCI/CADD Chemical Structure Database Tautomer Analysis number of tautomers per structure:
42
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 <0.13801–832 tautomers <0.1362701-800 tautomers <0.11,400601-700 tautomers <0.14,323501-600 tautomers <0.117,241401-500 tautomers <0.135,144301-400 tautomers 0.1104,875201-300 tautomers 0.8565,199101-200 tautomers 1.61,136,06651-100 tautomers 3.72,622,58725-50 tautomers 15.410,870,31211-25 tautomers 47.533,532,2842-10 tautomers 15.210,721,845one tautomer 13,89,756,186no tautomers %countFICuS structures with NCI/CADD Chemical Structure Database Tautomer Analysis number of tautomers per structure: many minor tautomeric forms (but you find them in databases)
43
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 45.6310,725,465>0.9-1.0 31.5214,747,976>0.8-0.9 16.4111,954,384>0.7-0.8 5.336,448,651>0.6-0.7 0.96,304,436>0.5-0.6 <0.1369,331>0.4-0.5 <0.1 6,580>0.3-0.4 <0.16>0.2-0.3 0.00>0.0-0.2 %Count Tanimoto index range Tautomer Analysis Tanimoto Similarities of Tautomers canonical tautomer vs. generated tautomers (680 million tautomer set) PubChem/CACTVS E_SCREEN bitvector (881 bits) ~ 23% below 0.8 Tanimoto similarity (although the same molecule)
44
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Scaffold Analysis
45
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Scaffold Analysis NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold Schuffenhauer et al. J. Chem. Inf. Model. 2007, 47, 47-58 Bemis et al. J. Med. Chem. 1996, 39, 2887-2893 SOO N N O N NH O N NH O N NH level 2 level 1 example
46
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold 76.2 million 8.1 million scaffolds 6.8 million scaffolds 0.8 million scaffolds CSDB Scaffold Analysis uuuuu compound set N NH O O N NH N NH level 2 level 1
47
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database 76.2 million number of unique scaffolds per hierarchy level CSDB Scaffold Analysis uuuuu compound set N NH O O N NH 8.1 million scaffolds 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 12345678910 Hierarchy Level Number of Unique Scaffolds (in millions) 0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 Number of unique structures (in million) level 2 level 1 molecular scaffold tree
48
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database 166758 5 1 2 33 11 2 N N O R 2 R 1 R 9 R 8 R 7 R 6 R 5 R 4 N N R 10 R 2 R 1 R 9 R 8 R 7 R 6 R 5 R 4 R 3 21 R 3 96 5 3 4 25 1693 16 7 73 44 2,281 uuuuu parent structures 2,726 uuuuu parent structures 744,469 uuuuu parent structures 5334 structure records in 64 databases 6007 structure records in 66 databases 1,069,046 structure records in 66 databases Scaffold Analysis SOO N N O N NH O N NH
49
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Atom Neighborhoods
50
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Multilevel Neighborhoods of Atoms (MNA) HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C)) CCOO N(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC -H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) NCI/CADD Chemical Structure Database Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J. Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670. N OH O HH MNA level 1 MNA level 2
51
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 76.2 million CSDB uuuuu compound set Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 per uuuuu parent structure ~ 30 per uuuuu parent structure
52
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider 76.2 million CSDB uuuuu compound set Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 per uuuuu parent structure ~ 30 per uuuuu parent structure
53
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Chemical Structure Web Services NCI/CADD web service NCI/CADD Chemical Structure Database (CSDB) CACTVS external web services http Chemical Identifier Resolver other software packages e.g. OPSIN Chemical Structure Web Services Indexing Chemical Space
54
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 http://cactus.nci.nih.gov/chemical/structure Chemical Identifier Resolver NCI/CADD Web Resources http://cactus.nci.nih.gov/blog
55
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Acknowledgments ChemNavigator Scott Hutton Tad Hurst CADD Group, CBL, NCI Igor Filippov Thanks to all database providers! http://cactus.nci.nih.gov Our web site: University of Cambridge Daniel Lowe Peter Murray-Rust Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular) Hans-Juergen Himmler
56
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Acknowledgments - Software CACTVS Python Web Framework ChemWriter Python SQL Library Javascript library Peter Ertl (Novartis)
57
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.