Download presentation
Presentation is loading. Please wait.
Published byCynthia Lindsey Modified over 9 years ago
1
Chemical Databases, Identifiers, Tool Kits and Web Services October 16, 2003 Marc C. Nicklaus, CADD Group, Lab. of Medicinal Chemistry, CCR, NCI, NIH; mn1@helix.nih.gov Thanks also to the other members of the CADD Group: Rajeshri G. Karki, Megan L. Peach, Karen M. Green, Guangyu Sun, Igor Filippov
2
Acknowledgements Wolf-Dietrich Ihlenfeldt (formerly Computer Chemistry Center [CCC], Erlangen, Germany) Frank Oellien (formerly CCC) Bruno Bienfait (formerly CCC and LMC) Johannes Voigt (formerly LMC)
3
Reasons to Deal with (Large) Chemical Databases (of Small Molecules) Inventory Source for drug design, (virtual) screening Repository for associated information (assay results, physicochemical data, environmentally important properties etc.) Chemoinformatics Link to other services, databases Source for individual structures (e.g. for comp. chem.)
4
Questions What databases are out there? What databases can we share? What tools can we share? What can we offer the public? [How] Can we standardize? How can we determine: have we, has anyone, see this structure before? What’s next (XML/CML… )?
5
The NCI Database & Enhanced NCI Database Browser
6
The NCI Database Approximately half a million compounds Collected since 1955 by the National Cancer Institute, NIH Tested in anti-cancer screens; since 80’s also in AIDS screens Managed by NCI’s Developmental Therapeutics Program (DTP); see http://dtp.nci.nih.gov Publicly available: currently 260,071 cpds. (“Open NCI Database”) Cancer screening data (60 cell lines) available for ca. 43,000 compounds AIDS screening data available for ca. 44,000 compounds Samples available from DTP for ~60% of compounds
7
Enhanced CACTVS Browser of the Open NCI Database Web-based interface for searching data from the Open NCI Database by numerous criteria, including 2D and 3D structural searches Augmented by many additional data, - derived: e.g. number of rotatable bonds - calculated/predicted: e.g. log P; biological activities - systematically determined: e.g. IUPAC names - cross-evaluated: e.g. commercial availability Boolean searches possible Requirements: Just a Web browser, several plug-ins are optional Based on chemical information toolkit CACTVS (Wolf-Dietrich Ihlenfeldt, see http://www2.chemie.uni- erlangen.de/software/cactvs/)
8
How To Get There…. URLs: http://cactus.nci.nih.gov/ncidb2 (U.S. mirror) http://www2.chemie.uni-erlangen.de/ncidb2 (European mirror)
9
http://cactus.nci.nih.gov
10
Enhanced NCI Database Browser: Query Form
11
Combined Search: PASS Antiangiogenesis Prediction & Name (Fragment) Exclusion
12
Search Result: Hitlist
13
Search Result: Image Gallery
14
Search Result: Detail View (top)
15
Search Result: Detail View (continued)
17
Update of Enhanced NCI Database Browser Most up-to-date DTP data sets Completion/curation of data where possible Many more additional calculated data New search capabilities New underlying database format
18
Update of Enhanced NCI Database Browser Most up-to-date DTP data sets 260,071 compounds (new and old releases of Open NCI Database merged, new entry prevails) 42,577: have cancer screen results>300 properties 43,905: have AIDS screening results 3 (EC50: 3,143; IC50: 39,352) 115,324: have animal model assay data (NEW) ~10 (string) 139,735: are plated 1 45,224: have name(s) (from DTP records) 0...>20 127,361: have CAS RN 1 3,576: have experimental log P 1
19
Update of Enhanced NCI Database Browser Completion/curation of data Add CAS numbers: cross-check with other DBs Correct structures and/or names: mostly manually, only occasionally possible, often dangerous Calculate averages and SD for cell-line assay data (No evaluation/curation planned for animal model assay data)
20
Download Page http://cactus.nci.nih.gov/ncidb2/download.html 260,071 2D or 3D structures with USMILES, IChI, IUPAC Name, CAS RN and additional “canonical properties” [in beta test stage: http://cactus.nci.nih.gov/ncidb2/download_NEW_09-03.html ] 250,251 2D structures in SDF format 250,251 structures in USMILES format 249,081 2D structures plus cancer and AIDS data (where present) in SDF format 32,557 2D structures with cancer test data as of August 1999, in SDF format 42,689 2D structures with AIDS test data as of October 1999, in SDF format 23,031 2D structures in SDF format for which both cancer and AIDS data are available Updated structure and combined structure+data files under preparation
21
Future Plans and Wishes… Apply/add IChI names Create Search&Display GUIs for more DBs Cross-complete databases (CAS RNs etc.) XML/CML-ize databases Move toward one virtual chemical database Link different database spaces together (small molecules, macromolecules, toxicity data, comp.chem. data, spectral data, patent data…)
22
Other Databases
23
Public U.S. Government Chemical Databases NCI Open (02/03)260,017 NIST WebBook (04/03) 31,167 NLM ChemIDplus (04/03)160,590 EPA GCES Database (03/02) 66,347 Combined USGovtPubDB ~515,000 More will hopefully follow from: NIAID, EPA, FDA…
24
Databases U.S. Government Databases NCI Open (09/03)260,071public NIST MS Library (2002)147,194 NIST WebBook (04/03) 31,167public NLM ChemIDplus (04/03)160,590 public EPA GCES Database (03/02) 66,347 public Commercial/Other Databases ACD (1.1999)221,668 ACX (1999)137,003 Ambinter (08/02)487,397 Asinex Gold (09/03)202,237 Asinex Platinum (09/03)117,518 Beilstein Natural Products (02/02)124,701 ChemBridge (02/02)100,000 ChemDiverse IDC (09/03)122,684 ChemDiverse CombiLab (09/03)201,139 ChemNavigator (08/03) 6,954,906 ChemStar (05/03) 72,407 Petrenko (PN07)115,554 Ryan (10/02)294,437 Sigma-Aldrich Rare Chem. (08/02)124,728 CSD (5.24, 11/02, organic only)103,860
25
Canonical Version Hydrogens added 3D coordinates calculated Some deficiencies fixed (nitro groups…) Canonical property fields added Original fields (mostly) left untouched Not yet done, but possible: completed data, e.g. CAS RN, name etc.
26
SD Files: NOT Standardized! Fields:IDChem. NameRec#CAS RNFormulaMDL Name (first line) Database:Field Names: NCI Open (02/03)NSC----NSC (id) NIST MS Library (2002)ID----- NIST WebBook (04/03)WEBBOOK.IDNAME(n) CAS.NUMBERFORMULA(name) [, ID: (id)] NLM ChemIDplus (04/03)IDmolname--- EPA GCES Database (03/02)(**came as list of SMILES strings** ACD (1.1999)NAME(1 st rec.)(n) --(name) ACX (1999)CsNum- (n)CASFormula“MDL Molfile” Ambinter :File 0ID-IDNUMBER--- File 1id-IDNUMBER--- File 2idnumber-ID--(id) File 3Code-ID--- Asinex Gold (06/03)IDNUMBER----- Asinex Platinum (06/03)IDNUMBER----- Beilstein Natural Products (02/02)BRNCNIDRN-- ChemBridge (02/02)ID----- ChemDiverse IDC (04/03)IDNUMBER----- ChemDiverse CombiLab (04/03)IDNUMBER----- ChemNavigator (06/03)CNC_ID----(id) ChemStar (05/03)IDNUMBER----- Petrenko (PN07)id----- Ryan (10/02)codenameID--- Sigma-Aldrich Rare Chem. (08/02)CAT_NO----- World Drug Index (3.1999)[RNEXTREG]MOLNAME(1)-CASMF(name) (1) Possibly, but not consistently, repeated in fields and/or
27
Canonical Fields Field/Property NamePossible values (or explanation) Properties dependent on structure: E_UNIQUE_IDto be concatenated from Origin + Version/Date + internal ID; example: NLM_04-03_08359076 E_CASCAS Reg. No. (if available; otherwise 999-99-9) E_NAMEChemical Name (if available) E_STEREO_SPECIFIEDno_stereocenter | unknown | partial | full (for most DBs, only choices 1 & 4 relevant) E_COMPOUND_TYPEunspecified(i.e. typically “normal organic”) metal_complex (if pre-processed as such)....(maybe others?) E_THREED_SOURCEis_2D | experimental | CORINA_x.y | 3D_unsuccessful E_ICHIIChI chemical identifier E_SMILESCACTVS-calculated Unique SMILES code (according to 1989 definition published by Daylight, Inc.) E_FORMULAMolecular formula calculated by CACTVS E_HASHYNon-stereospecific hash code E_HASHSYStereospecific hash code E_TAUTO_HASHNon-stereospecific tautomer-invariant hash code E_STEREO_TAUTO_HASHStereospecific tautomer-invariant hash code E_MAXFRAG_HASHYNon-stereospecific hash code of largest fragmnet E_MAXFRAG_HASHSYStereospecific hash code of largest fragmnet E_MAXFRAG_HASHTYNon-stereospecific tautomer-invariant hash code of largest fragmant E_MAXFRAG_HASHSTYStereospecific tautomer-invariant hash code of largest fragmant E_MULTIFRAGsingle_fragment | multi_fragment Properties constant for whole database: E_ORIGINDetailed origin of data file; e.g. “DTP February 2003” E_DB_VERSIONExplicit DB version number, if available E_DB_YEARYear data file was generated E_DB_TYPEgovernment public(e.g. NCI Open DB) government non-public(e.g. NIST WebBook) commercial free(e.g. Sig.-Al. RCL) commercial licensed(e.g. CSD, WDI, ACD...) E_IS_PUBLICyes | no | unknown E_SAMPLES_AVAILyes | on-demand | no | unknown (for whole DB!) E_SOURCESource of database: Company, Agency etc.; e.g. “NCI” E_CONTEXTNature, or context, of compound: e.g. environmentally relevant cpd., general chemical, drug… E_SUPPLIER_TYPEbroker | manufacturer | repository / N/A E_CACTVS_VERSIONx.yyy (version of the CACTVS toolkit used to process the database)
28
Can’t we all get along with each other?
29
Identifiers
30
CACTVS Hash Codes
31
Overlap Analysis via Hash Codes CACTVS hash codes Tautomer-invariant hash codes (NEW!) Stereospecific vs. non-stereospecific, tautomer-sensitive vs. tautomer-invariant, entire ensemble vs. max. fragment only, -- all eight combinations calculated and added to SD file. Index file created for rapid comparison:
32
Hash Codes
33
Index File 009F2C8EFECD9E45 009F2C8EFECD9E45 000140-95-4 EPA_GCES_03-02_000140-95-4 C3H8N2O3 009F2C8EFECD9E45 009F2C8EFECD9E45 025155-29-7 EPA_GCES_03-02_025155-29-7 C3H8N2O3 009F2C8EFECD9E45 009F2C8EFECD9E45 140-95-4 NIST_WebBook_04-03_C140954 C3H8N2O3 009F2C8EFECD9E45 009F2C8EFECD9E45 25155-29-7 NIST_WebBook_04-03_C25155297 C3H8N2O3 009F2C8EFECD9E45 009F2C8EFECD9E45 ChemStar_05-03_CHS_0025004 C3H8N2O3 009F2C8EFECD9E45 009F2C8EFECD9E45 NIST_MS-Lib_02_140954 C3H8N2O3 0106B6B3BC4B9B7C 0106B6B3BC4B9B7C AsinexGold_06-03_BAS_0605175 C15H12N2O2 0106B6B3BC4B9B7C 0106B6B3BC4B9B7C ChemStar_05-03_CHS_0816990 C15H12N2O2 0106B6B3BC4B9B7C 0106B6B3BC4B9B7C NIST_MS-Lib_02_304480737 C15H12N2O2 0106B6B3BC4B9B7C 0106B6B3BC4B9B7C PN07_10-02_PN07_025090 C15H12N2O2 056832FA004D9ED0 056832FA004D9ED0 ?? 010118-90-8 EPA_GCES_03-02_010118-90-8 C23H27N3O7 056832FA004D9ED0 056832FA004D9ED0 ?? NIST_MS-Lib_02_10118908 C23H27N3O7 SMILES code appended to each line (not shown here) Very rapid searches, overlap analyses, counts etc. with just Unix pipes.
34
IChI
35
The IUPAC Chemical Identifier Steve Stein, Steve Heller, Dmitrii Tchekhovskoi National Institute of Standards and Technology Gaithersburg, MD, USA U.S. Government Chemical Databases Frederick, MD July 22, 2003 with permission of Steve Stein…
36
Too Many Identifiers Structure diagrams –various conventions –contain ‘too much’ information Connection Tables –MolFiles, Smiles, ROSDAL,.. Pronounceable names –IUPAC, CAS, trivial Index Numbers –EINECS, FEMA, DOT, RTECS, CAS, Beilstein, USP, RTECS, EEC, RCRA, NCI, UN, USAF
37
What kind of Identifier is needed? Derived from structure by algorithm Accepts common drawing conventions Exactly one Identifier per structure Comprehensive Openly available
38
Requirements Different compounds have different identifiers –All distinguishing structural information is included IChI - 1 IChI - 2 = =
39
Requirements One compound has only one identifier –Include only necessary information Same IChI = ==
40
IChI First Version Discrete, bonded compounds –Include ‘dot disconnected’ compounds Stereochemistry –sp3 - tetrahedral –Z/E - double bond Tautomers
41
3 Steps to IChI ‘Normalize’ Input Structure –Defined input structure required –Remove conventions with chemical rules –Divide into ‘layers’ ‘Canonicalize’ (label the atoms) –Equivalent atoms get the same label ‘Serialize’ the Labeled Structure –A unique series of bytes
42
Simplifications Ignore ‘Electron Density’ –Double/Triple/Coordination bonds –Odd-electrons/Charges Free Rotation Around Single Bonds Separate structure information into ‘layers’
44
Output Format Example: Benzene Represent atoms as sequence number in formula C6H6 = C C C C C C H H H H H H tags 1 2 3 4 5 6 7 8 9 10 11 12 Basic Layer: C6H6 1-2-7 2-3-8 3-4-9 4-5-10 5-6-11 7-12
45
C5H8NO4.Na 6H2-5H(4(9)10)2H2-1H2-3(7)8,(H-,7,8,9,10); -1;+1 ; 5?;
46
DisconnectedConnected C5H5.C5H5.Fe 1H-2H-4H-5H-3H-1;1H-2H-4H-5H-3H-1; -1;-1;+2 C10H10Fe 1H-2H-4H-5H-3H(1)11(1,2,4,5)6H-7H(11)9H(11)10H(11)8H(6)11
47
Future Extensions Other Stereo Forms –Non-atom centered –Conformations –Hydrogen Bonding Polymers/Macromolecules Compound Classes –Markush structures
48
Tools
49
CACTVS Pipeline Pilot MDL Software Daylight Software Oracle Others…
50
Free Tools CACTVSScripts; GUI-based tools; Web services; free for academic and public use. Unix, Linux. GUI GeneratorGenerate Web interface automatically from SD file and data. CACTVS- based. Free for non-profit use. Under development.
51
http://www2.ccc.uni-erlangen.de/software/cactvs/tools.html
52
CACTVS Documentation The most complete documentation for CACTVS as of now resides on the LMC Intranet server. Please contact me (mn1@helix.nih.gov) for info.
53
Web Services
54
Additional Links and Information Other Public Chemical Data A list of URLs that point to other public chemical (or chemistry- related) information. Currently limited to U.S. Government web sites. These sites may contain search capabilities and/or other public datasets: http://cactus.nci.nih.gov/ncidb2/govt_dbs.html Chemistry Search Services on the Web A table of searchable small molecule databases available on the web, listing URLs and an (incomplete) survey of the services' features and capabilities. Contains (U.S.) Government, academic, and commercial web sites: http://cactus.nci.nih.gov/ncidb2/chem_www.html
55
Online SMILES Translator http://cactus.nci.nih.gov/services/translate/ Generate Unique SMILES; Translate Between Formats: SD, PDB, MOL, SMILES
56
“GIF” Creator http://cactus.nci.nih.gov/services/gifcreator/ Create High-Quality 2D Drawings of Your Chemical Structures: Many input formats supported: SD, MOL, PBD, SMILES, Sybyl…
57
NCI Screening Data 3D Miner http://cactus.nci.nih.gov/services/3DMiner/ Visualize & Mine 60-Cell Line Cancer Data for 41,000 NCI Compounds
59
GUI Generator CACTVS-based GUI that takes a structure file (SD or other format) and associated data files (TXT, Excel, etc.) as input and generates a Web service from it. Its output is a CACTVS database file (.cbase file) and the CACTVS script – generating JavaScript pages – that allows web-based searches and display of results in/from this database. Will be freely distributed for non-profit use. Since script=executable, also freely adaptable. Currently: alpha version.
67
Pipeline Pilot Scitegic, Inc. GUI-based Icon Pipeline paradigm Predefined components (really: scripts); you can modify, or write your own ones Drag&drop to form pipelines Ensemble of pipelines is the program (called ‘Protocol’), can be saved, shared… Now Win. 2000/XP only; fall ‘03: Linux Very fast: up to 20,000 cpds/sec Expensive
69
3D Pharmacophore Searches Up to 25 conformers pre- calculated by the program Catalyst (MSI) are stored for each compound. Searches are possible by distance constraints and other query features. Most ISIS features are implemented, such as exclusion spheres, centroids, points on lines …. There are two ways to define a query: Query file is prepared in an external program, such as Catalyst, ISIS/Draw etc., and submitted in.mol format; or use JME Editor available within the service. Example: 3-Point pharmacophore used in previous study on HIV-1 integrase inhibitor discovery. J.Med.Chem. 1997, 40(6), 920-929. 9.053±0.4 Å 8.711±0.4Å 2.548 ±0.3Å
70
3D Pharmacophore Search -- Result
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.