Download presentation
Presentation is loading. Please wait.
Published byMagdalene Haynes Modified over 9 years ago
1
Accessing U.S. Government Chemical Structure Databases with the CACTVS Toolkit Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Lahntal, Germany wdi@xemistry.com
2
The US Gov Chemical Structure Data Information Pool PubChem Depositor structures (SID) Unique structures (CID) Assays (AID) Links to the rest of NCBI Entrez
3
The US Government Chemical Structure Data Pool NIST Web Book Spectra, physical properties ChemIDPlus Phyical properties, biomedical links NCI Chemical Identifier Resolver From name and IDs to structure
4
Other sources ChemSpider (UK) EINECS (EU) KEGG (JP) EMolecules (US, commercial) ChEBI (UK) ChEMBL (UK) Drugbank (CA) PDB (US, academic) CommonChemistry (US, commercial) Wikipedia (World)
5
How to Work with the Data? Web interface for humans Hard to work with software Many DBs provide external links Prone to breaking, becoming outdated Data available as batch download Massive, difficult to manage Lack of formal interface documentation or programmatic access PubChem, Entrez, NCI Resolver good guys
6
The CACTVS Toolkit Generic chemistry toolkit Manages objects like structures, reactions, tables Extensible collection of properties, methods and I/O modules Implicit automatic method chaining Scripting language interface for RAD Ships with access properties and modules for all these databases Comprehensive solution for multi-DB projects
7
Basic Tasks Name/Identifier resolution NCI Resolver -> REST interface KEGG -> text query cactvs>ens create "vioxx" ens0 cactvs>dataset create [list ‚+morphine +methyl‘] dataset0 cactvs>dataset ens dataset0 ens1 ens2
8
Basic tasks: Get Database ID Text structure query (SMILES, InChI) NCBI PUG Web service cactvs>ens get ens0 E_SIDSET 9792 207247 535364 5146347 7847634 7980536 8146414 8153131 10486532 11341940 11362123 11362973 11364757 11365535… cactvs>ens get ens0 E_CHEMIDPLUS_ID 0162011907
9
Basic Tasks: Download Objects PubChem: from CID, SID, AID PDB, CHEMBL, KEGG: from codes Resolver: from name, identifiers cactvs>ens create 1 ens0 cactvs>ens create CHEMBL277500 ens1
10
Basic Tasks: Download Objects PubChem I/O via native ASN.1 cactvs>table create 198 table3 cactvs>table get table3 colnames SID SID_Source Version Date Outcome Score schedule endpoint vehicle dose tcprcnt toxicity cactvs> table get table3 T_NCBI_ASSAY_DESCRIPTION(description) {The antitumor activity of compounds was measured in mice bearing transplantable tumors. Survival or tumor size were measured and the…
11
Basic Tasks: I/O of ID Files Read files with CIDs, SIDs, CASNOs… cactvs>set fh [molfile open test.cas] molfile0 cactvs>molfile loop $fh eh { puts[ens get $eh E_CID] } 436534 321512 234 32532….
12
Implicit Property Lookup Yes, its controlled, with metadata and origin tracing cactvs>ens create benzene ens0 cactvs>ens get ens0 E_CAS 71-43-2 cactvs>ens get ens0 E_UVSPECTRUM 1 {INSTITUTE OF ENERGY PROBLEMS OF CHEMICAL PHYSICS, RAS} {INEP CP RAS, NIST OSRD Collection (C) 2007 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved.} 0 n.i.g. {} {{$NIST SQUIB} 1951ROM/VOD930-932 {$NIST SOURCE} TSGMTE {$REF AUTHOR} {Romand, J.; Vodar, B.} {$REF TITLE} {Spectres d'absorption du benzene a l'etat vapeur et a l'etat condense dans l'ultraviolet lointain} {$REF JOURNAL} {Compt. Rend.} {$REF VOLUME} 233 {$REF PAGE} 930-932 {$REF DATE} 1951} {} {RAS UV No. 118} 0.0 {} {} 0.0 162.418 206.9805 1.0 1.0 {Wavelength (nm)} {Logarithm epsilon} 317 {} {3.7038 3.7101 3.7161 3.722 3.722….
13
More Property Lookups cactvs>ens show ens0 E_NIST_WEBBOOK_ID C71432 cactvs>ens get ens0 E_AIDSET 330 421 426 427 433 434 435 445 530 540 541 542 543 544 545 546 584 585... cactvs>ens get ens0 E_NAMESET BENZENE 71-43-2 NCGC00090744-02 UN1114 {Benzen [Polish]} {Benzene + aniline combo} 270709_ALDRICH 311855_SIGMA {Benzene (including benzene from gasoline)} 676985_ALDRICH {Benzene [UN1114] [Flammable liquid]} 154628_SIAL {Benzene, labeled with carbon-14 and tritium}…
14
More Property Lookups cactvs>ens get ens0 E_MESH_TERMS {68001554 {Benzene Cyclohexatriene Benzol Benzole} http://www.ncbi.nlm.nih.gov/sites/entrez?Db=mesh &Cmd=ShowDetailView&TermToSearch=68001554 {68001554 Benzene {68006841 {Hydrocarbons, Aromatic}{68006844 {Hydrocarbons, Cyclic} {68006838 Hydrocarbons {68009930 {Organic Chemicals} {1000068 {Chemicals and Drugs Category} {1000048 {All MeSH Categories}}}}}}}}} {68009930 {{Organic Chemicals} {Chemicals, Organic}} http://www.ncbi.nlm.nih.gov/sites/entrez?Db=mesh &Cmd=ShowDetailView&TermToSearch=68009930...
15
Construction of Display URLs cactvs>ens get ens0 E_PUBCHEM_URL http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi ?cid=241 cactvs>ens get ens0 E_CHEMIDPLUS_URL http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet ?objectHandle=DBMaint&actionHandle=default&nextP age=jsp/chemidheavy/ResultScreen.jsp&ROW_NUM=0&T XTSUPERLISTID=0000071432 cactvs>ens metadata ens0 E_CHEMIDPLUS_URL info {JSESSIONID=257C452AAB26D395DCC4AC2652F05C99; Path=/chemidplus}
16
Ugliness under the Hood Absence of a clean programmatic interface hurts! set mdata [encode -url [molfile string $eh]] set pdata "indexes=&DT_ROWS_PER_PAGE2=1&objectHandle=Search&actionHandle=searchChemIdLite&nextPage=jsp%2Fc hemidheavy%2FChemidDataview.jsp&DT_ROWS_PER_PAGE=1&responseHandle=JSP&QV10=&QO10=Text+Search&QF1 1=Locator&QV11=&QO11=in&STRING_TO_FILE=$mdata&QF1=Name&QO1=%3D&QV1=&QV8=&QF8=ToxTestType&QO5=bet ween&QV5=&QF5=ToxResult&QV6=&QF6=ToxSpecies&QV7=&QF7=ToxRoute&QV9=&QF9=ToxEffect&ChemType=1001&Q F3=ChemType&QV3=&QF2=ChemProp&QO2=between&QV2=&ChemDataSourceType=0&QF4=ChemDataSourceType&QV4=& LocatorExpr1=&LocatorOper=AND&LocatorExpr2=&chemical_viewer=marvin&StructureSimilarPctg=80&QF10= StructureEqual&structurePref=marvin&QO12=between&QV12=&QF12=MolWeight&x=22&y=5" set data [post -contenttype application/x-www-form-urlencoded -raw http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?chemidheavy $pdata #auto status] if {![regexp {chemid=([0-9]+)} $data dummy id] && ![regexp {javascript:loadChemicalIndex[^0-9]*([0- 9]+)} $data dummy id]} { error "no ChemIDplus record" } ens set $eh E_CHEMIDPLUS_ID $id ens metadata $eh E_CHEMIDPLUS_ID info $status(cookies)
17
Power by Design In contrast, PubChem has a well- defined set of interfaces – PUG, EUtils, cookie-free download URLs No simulated Web form posting No HTML page scraping Support for more than just ID access
18
The PubChem Virtual File Project Improved access to PubChem database indistinguishable from a local, read-only structure file in Cactvs scripting environment Input functions transparently read structures and assay tables with all their data from PubChem, by decoding native binary ASN.1 Query functions convenient development and conservation of queries exceeding the capabilites of Web interfaces and PUG, maintaining standard Cactvs query and retrieval syntax
19
Transforming the PubChem Database into a Virtual File Cactvs toolkit uses file record as primary key PubChem uses CID (AID, SID) as primary key Establish mapping via record/CID map Precomputed as 20M bits bitmap Set bit indicates active CID Automatic download from Xemistry if needed, local caching, up-to-date check via Entrez query Checked and potentially updated every 30 mins on Xemistry server Data size 800K compressed, download <10s Download of full active CID set from Entrez ~10-25 mins
20
PubChem Virtual File I/O Code sample: filex load pubchem 19 molfile open molfile0 molfile count molfile0 19450023 molfile read molfile0 ens0 ens props ens0 …E_INCHI E_IUPAC_NAME E_NCBI_COMPOUND_ID E_EXACT_MASS E_TPSA E_SMILES E_SMILES/2…. ens get ens0 E_CID 1 molfile read molfile0 ens1 molfile set molfile0 record 999999 Contact Entrez e-utils, get database status, get CID Bitmap from Xemistry Single-record ASN.1 download via display page
21
Simple PubChem Queries Code sample: set fh [molfile open ] set cidlist [molfile scan $fh „structure >= $smarts“ \ {proplist E_CID}] Operations behind the scenes: Set-up of PUG record Post PUG, monitor return status Cache CID result data Direct access to result set, no structure download
22
Intermediate PubChem Queries Code sample: set fh [molfile open ] set elist [molfile scan $fh \ „or {structure = $smiles1} {structure = $smiles2}\ {structure = $smiles3}“ enslist] Operations behind the scenes: Create and post PUG records, get history keys Perform server-side e-utils result merge via history keys Retrieve CID set Download structures as ASN.1 blobs via CID
23
Power PubChem Queries Code sample: set th [molfile scan \ "and {structure >= c1cncc1} {E_PUBCHEM_AID_COUNT(active) > 25}„ \ {tablecollection image E_CID E_NAME E_SMILES E_PUBCHEM_AID_COUNT(active) E_PUBCHEM_AID_COUNT(inactive) E_ACTIVE_AIDSET} \ {} {maxhits 10}] table write $th active_pyrroles_in_pubchem.xls
24
Graphical Tools for the Masses Draw or read structure Compute database ID property Display data Compute lookup properties Display data Compute access URL property Load page into HTML widget
25
… in Stand-alone Tools
26
… and in Web Applications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.