NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Markus Sitzmann 1, Wolf-Dietrich Ihlenfeldt 2, and Marc.

Slides:



Advertisements
Similar presentations
February 2013 Szilárd Dóránt Scientific & technical Presentation Pipeline Pilot Integration.
Advertisements

Solutions for Cheminformatics
Scientific & technical presentation JChem Cartridge for Oracle
May, 2008 Presenting: Szabolcs Csepregi The ChemAxon Markush project overview and development discussion.
Scientific & technical presentation Fragmenter Nóra Máté Sept 2005.
Java Solutions for Cheminformatics Feb 2008 Whats new for PP.
Scientific & technical presentation Standardizer January 2008.
Whats new in JChem back-end and Markush storage, search and enumeration Szabolcs Csepregi Solutions for Cheminformatics.
UGM, June, 2007 Presenting: Szabolcs Csepregi JChem Base and Cartridge latest.
June, 2007 Akos Papp Corporate Registration System - A future solution.
Solutions for Cheminformatics
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Wolf D. Ihlenfeldt Washington, Aug 2009.
Cheminformatics Apr 2010 Postgrad course on Comp Chem Noel M. O’Boyle.
EBI is an Outstation of the European Molecular Biology Laboratory. Chemoinformatics and Metabolism Paula de Matos.
1 PharmID: A New Algorithm for Pharmacophore Identification Stan Young Jun Feng and Ashish Sanil NISSMPDM 3 June 2005.
Lectures Molecular Bonding Theories 1) Lewis structures and octet rule
Jmol virtual model kit: An entirely new way to build and explore molecular structures Robert M. Hanson Lexington Section, American Chemical Society Centre.
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
3. Chemical Data and Data Bases. 2 Datasets and Databases Many small datasets are available Several commercial databases of compounds and reactions (e.g.
Jmol virtual model kit: An entirely new way to build and explore molecular structures Robert M. Hanson, Otis Rothenberger, Thomas Newton 241 st National.
Management and Distribution of Chemical Data in the Protein Data Bank John Westbrook, Dimitris Dimitropoulos, Jasmine Young, Peter Rose, Philip E. Bourne.
A Virtual File System for the PubChem Chemical Structure and Bioassay Database Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Königstein, Germany.
Molecular Descriptors
Accessing U.S. Government Chemical Structure Databases with the CACTVS Toolkit Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Lahntal, Germany
1 InstantJChem: a flexible chemical database system G. Marcou, D. Horvath + Laboratoire d’infochimie, Université de Strasbourg, 1, rue Blaise Pascal,
PubChem—Substance, Compound, BioAssay Part 3: Essentials.
AMBIT Chemoinformatics Software for Data Management Joanna Jaworska Nina Jeliazkova P&G Brussels, Ideaconsult Ltd., Belgium Bulgaria.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
EXPLORING CHEMICAL SPACE FOR DRUG DISCOVERY Daniel Svozil Laboratory of Informatics and Chemistry.
SDF File analysis Creation, composition, checking.
Chemical Databases, Identifiers, Tool Kits and Web Services October 16, 2003 Marc C. Nicklaus, CADD Group, Lab. of Medicinal Chemistry, CCR, NCI, NIH;
/slides/cactvs/acswashington2000.ppt © Ihlenfeldt 2000 C3C3 Chemical Visualization: The Art of Drawing a Chemical Structure W. D. Ihlenfeldt Computer-Chemistry-Center.
ChemModLab: A Web-based Cheminformatics Modeling Laboratory S. Stanley Young + ECCR and ChemSpider Teams.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
SMILES. Simplified molecular input line entry specification The simplified molecular input line entry specification or SMILES is a specification for unambiguously.
Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry.
Help: Strain Page Header Yeast ORF deletion: _d suffix : dubious ORF _p suffix : putative (uncharacterized) ORF Gene/Protein: The established name for.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
ChemBank Building a Public Web Resource Using Daycart Erik Brauner Head of Chemical and Biological Computing Harvard Institute of Chemistry and Cell Biology.
1 Cheminformatics David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
W e have discussed in previous lessons that acids and bases can balance each other to form a neutral solution when the acid particles and the base particles.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Help: Strain Page Header Yeast ORF deletion: _d suffix : dubious ORF _p suffix : putative (uncharacterized) ORF Gene/Protein: The established name for.
Delivering an online service for validating and standardizing chemical structure files using the ChemSpider platform.
Welcoming Remarks – and a Very Brief History of U.S. Govt. Chemical Databases and Open Chemistry Marc C. Nicklaus Computer-Aided Drug Design Group Chemical.
EMBL-EBI Chemistry & the PDB MSDchem Primary Developer: Dimitris Dimitropoulos.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
Spectral Interpretation General Process for Structure Elucidation of an Unknown Nat. Prod. Rep., 1999, 16,
Use of Machine Learning in Chemoinformatics
Identification of structurally diverse Growth Hormone Secretagogue (GHS) agonists by virtual screening and structure-activity relationship analysis of.
Organic Chemistry. Organic chemistry may be defined as the chemistry of carbon compounds. However, simple carbon-containing compounds (such as carbon.
PubChem—Substance, Compound, BioAssay Part 1: Essentials Principles of May 24, 2007.
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
Electronic Structure and Lewis Dot Structures
US EPA’s CompTox Chemistry Dashboard
Selcia Fragment Library
Open PHACTS 1.3 Release ( triples)
Dimitris Dimitropoulos
Overview of open resources to support automated structure verification
Daylight and Discovery
Virtual Screening.
Topological Index Calculator III
InChI Open Education Resource
Water = 2 Hydrogen atoms + 1 Oxygen atom “H2O”
Presentation transcript:

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Markus Sitzmann 1, Wolf-Dietrich Ihlenfeldt 2, and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D Lahntal, Germany NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Chemistry Space Analysis how many small-molecules are there currently? since the early 2000s: enormous increase of the number of databases containing small molecules, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap? many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms) growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 chemical structure Chemical Identifier Resolver NCI/CADD Identifiers InChI/InChIKey ChemSpider ID PubChem SID/CID chemical names CAS Registry Number NSC number FDA UNII ChemNavigator SID SMILES SD File Chemical Formula ChEBI ID PDB Ligand ID MRV CML SYBYL Line Notation GIF image

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Works as a resolver for different chemical structure identifiers. Allows one to convert a given structure identifier into another representation or structure identifier. Chemical Identifier Resolver NCI/CADD Web Resources first beta release: July 2009 current release (beta 4): April 2011

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 it is usable by a simple URL API: example: MIME type: text/plain Chemical Identifier Resolver NCI/CADD Web Resources XML format: if a request is not resolvable: HTTP404 status message

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 resolver chemical names IUPAC names (by OPSIN) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID ChemSpider ID ChemNavigator SID FDA UNII /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula /twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid “identifier” “representation” Chemical Identifier Resolver NCI/CADD Public Web Resources

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 identifier representation http request http response detection of the identifier type identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB)

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 ChemNavigator iResearch Library compilation of commercially available screening compounds from ~300 inter- national chemistry suppliers PubChem database including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider … Commercial Sources / others Asinex, Comgenex, eMolecules, ChEMBL, … currently: ~150 chemical structure databases ~120 million structure records ~81.6 million unique structures by NCI/CADD FICuS Identifier ~84 million unique structures by Std. InChIKey ChemNav. iResearch Lib. ~56% PubChem ~38% others ~6% Chemical Structure Database (CSDB) Chemical Identifier Resolver

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Structure Identifiers FICTS, FICuS, uuuuu

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 based on hashcodes calculated by the chemoinformatics toolkit CACTVS CACTVS hashcodes:  represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned)  high sensitivity to structural features of a compound  change if connectivity changes NCI/CADD Structure Identifiers Unique Representation of Chemical Structures HN N NH 2 OH O 9850FD9F9E2B4E25

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 structure normalization parent structure NCI/CADD Identifier hashcode calculation E_HASHISY calculation of a set of parent structures with different sensitivity to chemical features representation of chemical structures on different levels FICTS original structure record Molfile SDF SMILES ChemDraw cdx PDB FICuS uuuuu SDF SMILES database NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 adjustable levels of sensitivity: Fragments sensitive keep only largest organic fragment Isotopes ignore isotope labels sensitive D D D D D D Charges uncharge sensitive find canonical tautomer OO Stereochemistry sensitive COOH NH 2 discard stereo information O - O NH 3 + OH O NH 2 un-sensitive sensitive OO H OO H Tautomers COOH H NH 2 COOH NH 2 H Na + O O - O OH un-sensitive NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges sensitive D D D D D D OOCOOH NH 2 un-sensitive O - O NH 3 + OH O NH 2 Tautomers Stereochemistry sensitive OO H OO HCOOH H NH 2 COOH NH 2 H Na + O O - O OH NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges sensitive D D D D D D OOCOOH NH 2 F I C representation of the exact drawing un-sensitive T O - O NH 3 + OH O NH 2 ≠ ≠ ≠ Tautomers Stereochemistry sensitive OO H OO HCOOH H NH 2 COOH NH 2 H ≠ ≠ S Na + O O - O OH ≠ ≠ FICTS NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges sensitive D D D D D D OOCOOH NH 2 F I C comes closest to how a chemist perceives a compound un-sensitive u O - O NH 3 + OH O NH 2 Tautomers Stereochemistry sensitive OO H OO HCOOH H NH 2 COOH NH 2 H = ≠ S Na + O O - O OH FICuS ≠ ≠ ≠≠ = NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesCharges Tautomers Stereochemistry Na + sensitive O O - D D D D D D O - O NH 3 + OO H OO HCOOH H NH 2 COOH NH 2 H O OH OOCOOH NH 2 OH O NH 2 = = = = = = = = closely related forms of the same compound u uuuu un-sensitive uuuuu NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FragmentsIsotopesChargesStereoTautomers FICTS FICuS uuuuu sensitive / not sensitive HN N NH 2 O-O- O Na + 4A122D094098B50D-FICTS-01-1D 0E26B623DF7FAD30-FICuS FD9F9E2B4E25-uuuuu NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope salt stereoisomers “errors” histidine

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 A3DAE DDE4-FICTS E5F83F10C5DB080A-FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25-FICTS E5F83F10C5DB080A-FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS6C16DE2351F9FF50-FICTS HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O 9850FD9F9E2B4E25-FICTS charged form tautomer isotope salt stereoisomers FICTS “errors”

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 A3DAE DDE4-FICuS E5F83F10C5DB080A-FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25-FICuS E5F83F10C5DB080A-FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS9850FD9F9E2B4E25-FICuS HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope salt stereoisomers FICuS “errors”

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-FICuS 9850FD9F9E2B4E25-uuuuu HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope stereoisomers salt uuuuu “errors”

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 HNDVDQJCIGZPNO-UHFFFAOYSA-N HNDVDQJCIGZPNO-CDYZYAPPSA-N HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N HN N NH 2 OH O N NH NH 2 OH O HN N OHO NH 2 HN N OH O NH 2 HN N NH 2 O - O Na + HN N NH 3 + O - O O HN NNH 2 O Na HN NNH OH O NH N 15 NH 2 OH O charged form tautomer isotope stereoisomers salt Std. InChIKey “errors” HNDVDQJCIGZPNO-UHFFFAOYSA-N UHPNKBYGGMJTIM-UHFFFAOYSA-M

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICTS original record FICTS original record FICTS FICuS uuuuu 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICTS original record FICTS original record FICTS FICuS uuuuu tautomer- invariant 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers?

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism) rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS rule set is systematically applied to the original structure (and all tautomers that have been generated in previous steps) tautomer generation is limited to 1000 SMIRKS transform operations/structure all tautomers are ranked by a scoring function the highest ranked tautomer is defined as the canonical tautomer NCI/CADD Chemical Structure Database Tautomer Analysis

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 rule 12: furanones rule 11: 1.11 (aromatic) heteroatom H shift rule 10: 1.9 (aromatic) heteroatom H shift rule 9: 1.7 (aromatic) heteroatom H shift rule 8: 1.5 aromatic heteroatom H shift (2) rule 7: 1.5 (aromatic) heteroatom H shift (1) rule 6: 1.3 heteroatom H shift rule 5: 1.3 aromatic heteroatom H shift rule 4: special imine rule 3: simple (aliphatic) imine rule 2: 1.5 (thio)keto/(thio)enol rule 1: 1.3 (thio)keto/(thio)enol 21 SMIRKS transform rules: rule 21: phosphonic acids rule 20: isocyanides rule 19: formamidinesulfinic acids rule 18: cyanic/iso-cyanic acids rule 17: oxim/nitroso via phenol rule 16: oxim/nitroso rule 15: pentavalent nitro/aci-nitro rule 14: ionic nitro/aci-nitro rule 13: keten/ynol exchange NCI/CADD Chemical Structure Database Tautomer Analysis

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 [O,S,Se,Te;X1:1]=[C;z{1-2}:2][CX4R{0-2}:3][#1:4]>> [#1:4][O,S,Se,Te;X2:1][#6;z{1-2}:2]=[C,cz{0-1}R{0-1}:3] [N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2][N,n,S,O,Se,Te:3][#1:4]>> [#1:4][N,n,S,O,Se,Te:1][NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3] 3 2 O 1 H O 1 H 4 N 2 S 1 N 3 H H 4 H N 2 S 1 N 3 H H 4 H 1.3 keto/enol 1.3 heteroatom H shift rule 1: 1.3 (thio)keto/(thio)enol rule 6: 1.3 heteroatom H shift NCI/CADD Chemical Structure Database Tautomer Analysis

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICTS 72.0 million FICTS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis FICuS 8.6% change tautomeric form during FICuS normalization FICTS parent structures 70.6 million FICuS parent structures structure counts are on basis of the 2009 version of CSDB (103.9 million structure records) FICuS parent structures 1.5% have an one-to-many relationship to several FICTS parent structures (“conflict”) 98.5% have an one-to-one relationship to a single FICTS parent structure

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database Tautomer Analysis number database releases frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database Tautomer Analysis number database releases frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center EPA DSSTox Specs Ambinter BIND BindingDB ChemNavigator KEGG NCI Open Database NIST WebBook NLM ChemIDplus NMRShiftDB Thomson Pharma Wombat NCI/DTP PASS Training Set SGC-Ox ChemDB ZINC ChEBI ChemSpider

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database Tautomer Analysis frequency number database releases percentage of FICuS parent structure in each database release occurring somewhere in CSDB with a conflict occurrence of “tautomerism-critical” molecules within each individual database release (%) average: ~9.5% of FICuS parent structures

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 HN N O O HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) HPMBP is used in liquid membranes (selective removal of metal ions) selectivity and efficiency depends on the tautomeric form of HPMBP the tautomeric form depends on solvent and concentration of HPMBP He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP. 1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), Example for a Tautomer “Conflict”

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 N N OH O HN N O O HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) CACTVS generates 7 tautomers Example for a Tautomer “Conflict” canonical tautomer by CACTVS 5 have potential stereo center on atoms or bonds HN N O O R/S HN N OH OH HN N O OH E/Z N N O OH N N O O R/S

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 H N N O O HN N O O H references 49 references 3 references HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) 3 have CAS Registry Numbers assigned Example for a Tautomer “Conflict” (no stereo) (Z) HN N O O R/S HN N OH OH N N O OH E/Z N N O OH N N O O R/S

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 N N OH O N N O O HN N O O N N O OH HN N O OH HN N OH OH HN N O O 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” occurrences in databases indexed in CSDB R/S E/Z R/S

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases occurrences in databases N N OH O N N O O R/S HN N O O N N O OH E/Z HN N O OH E/Z HN N OH OH R/S HN N O O R/S 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID Scripps Screening Center Thomson Pharma ZINC ChemDB ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider DiscoveryGate EPA GCES MLSMR NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma Ambinter ChemDB ChemSpider DiscoveryGate ChemNavigator Thomson Pharma ChemSpider ZINC ChemSpider ECOTOX ZINC

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 FICuS 70.6 million FICuS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis how many tautomers are generated? how often is each rule applied (type of tautomerism)? how many tautomers per structure? starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS generated 680 million tautomers for 1.7% of the FICuS parent structures the enumeration was not exhaustive

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS ,860,604rule 12: furanones 0.21,374,235rule 11: 1.11 (aromatic) heteroatom H shift 0.75,061,731rule 10: 1.9 (aromatic) heteroatom H shift 8.457,242,472rule 9: 1.7 (aromatic) heteroatom H shift <0.126,819rule 8: 1.5 aromatic heteroatom H shift (2) 4.027,542,770rule 7: 1.5 (aromatic) heteroatom H shift (1) ,453,882rule 6: 1.3 heteroatom H shift 3.825,678,446rule 5: 1.3 aromatic heteroatom H shift 0.64,306,155rule 4: special imine 5.335,917,415rule 3: simple (aliphatic) imine 1.711,541,452rule 2: 1.5 (thio)keto/(thio)enol ,002,712rule 1: 1.3 (thio)keto/(thio)enol %count generated tautomerstautomer rule Tautomer Analysis NCI/CADD Chemical Structure Database usage of SMIRKS rules (1/2):

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 <0.154,926rule 21: phosphonic acids <0.1229rule 20: isocyanides < rule 19: formamidinesulfinic acids <0.1181rule 18: cyanic/iso-cyanic acids <0.1131,502rule 17: oxim/nitroso via phenol <0.1505,695rule 16: oxim/nitroso <0.1129rule 15: pentavalent nitro/aci-nitro <0.1428,266rule 14: ionic nitro/aci-nitro <0.157,989rule 13: keten/ynol exchange %count generated tautomerstautomer rule Tautomer Analysis NCI/CADD Chemical Structure Database usage of SMIRKS rules (2/2):

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 < –832 tautomers < tautomers <0.11, tautomers <0.14, tautomers <0.117, tautomers <0.135, tautomers <0.1104, tautomers , tautomers 1.61,136, tautomers 3.72,622, tautomers ,870, tautomers ,532, tautomers ,721,845one tautomer 13.89,756,186no tautomers %countFICuS structures with NCI/CADD Chemical Structure Database Tautomer Analysis number of tautomers per structure:

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 < –832 tautomers < tautomers <0.11, tautomers <0.14, tautomers <0.117, tautomers <0.135, tautomers , tautomers , tautomers 1.61,136, tautomers 3.72,622, tautomers ,870, tautomers ,532, tautomers ,721,845one tautomer 13,89,756,186no tautomers %countFICuS structures with NCI/CADD Chemical Structure Database Tautomer Analysis number of tautomers per structure: many minor tautomeric forms (but you find them in databases)

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS ,725,465> ,747,976> ,954,384> ,448,651> ,304,436> <0.1369,331> <0.1 6,580> <0.16> > %Count Tanimoto index range Tautomer Analysis Tanimoto Similarities of Tautomers canonical tautomer vs. generated tautomers (680 million tautomer set) PubChem/CACTVS E_SCREEN bitvector (881 bits) ~ 23% below 0.8 Tanimoto similarity (although the same molecule)

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Scaffold Analysis

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Scaffold Analysis NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold Schuffenhauer et al. J. Chem. Inf. Model. 2007, 47, Bemis et al. J. Med. Chem. 1996, 39, SOO N N O N NH O N NH O N NH level 2 level 1 example

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold 76.2 million 8.1 million scaffolds 6.8 million scaffolds 0.8 million scaffolds CSDB Scaffold Analysis uuuuu compound set N NH O O N NH N NH level 2 level 1

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database 76.2 million number of unique scaffolds per hierarchy level CSDB Scaffold Analysis uuuuu compound set N NH O O N NH 8.1 million scaffolds Hierarchy Level Number of Unique Scaffolds (in millions) Number of unique structures (in million) level 2 level 1 molecular scaffold tree

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 NCI/CADD Chemical Structure Database N N O R 2 R 1 R 9 R 8 R 7 R 6 R 5 R 4 N N R 10 R 2 R 1 R 9 R 8 R 7 R 6 R 5 R 4 R 3 21 R ,281 uuuuu parent structures 2,726 uuuuu parent structures 744,469 uuuuu parent structures 5334 structure records in 64 databases 6007 structure records in 66 databases 1,069,046 structure records in 66 databases Scaffold Analysis SOO N N O N NH O N NH

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Atom Neighborhoods

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Multilevel Neighborhoods of Atoms (MNA) HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C)) CCOO N(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC -H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) NCI/CADD Chemical Structure Database Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J. Chem. Inf. Comput. Sci., 1999, 39 (4), N OH O HH MNA level 1 MNA level 2

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 76.2 million CSDB uuuuu compound set Unique MNAs level 1 level 2 13, , billion relationships 1.3 billion relationships ~ 17 per uuuuu parent structure ~ 30 per uuuuu parent structure

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider 76.2 million CSDB uuuuu compound set Unique MNAs level 1 level 2 13, , billion relationships 1.3 billion relationships ~ 17 per uuuuu parent structure ~ 30 per uuuuu parent structure

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Chemical Structure Web Services NCI/CADD web service NCI/CADD Chemical Structure Database (CSDB) CACTVS external web services http Chemical Identifier Resolver other software packages e.g. OPSIN Chemical Structure Web Services Indexing Chemical Space

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Chemical Identifier Resolver NCI/CADD Web Resources

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Acknowledgments ChemNavigator Scott Hutton Tad Hurst CADD Group, CBL, NCI Igor Filippov Thanks to all database providers! Our web site: University of Cambridge Daniel Lowe Peter Murray-Rust Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular) Hans-Juergen Himmler

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Acknowledgments - Software CACTVS Python Web Framework ChemWriter Python SQL Library Javascript library Peter Ertl (Novartis)

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9