Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accessing U.S. Government Chemical Structure Databases with the CACTVS Toolkit Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Lahntal, Germany

Similar presentations


Presentation on theme: "Accessing U.S. Government Chemical Structure Databases with the CACTVS Toolkit Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Lahntal, Germany"— Presentation transcript:

1 Accessing U.S. Government Chemical Structure Databases with the CACTVS Toolkit Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Lahntal, Germany wdi@xemistry.com

2 The US Gov Chemical Structure Data Information Pool  PubChem Depositor structures (SID) Unique structures (CID) Assays (AID) Links to the rest of NCBI Entrez

3 The US Government Chemical Structure Data Pool  NIST Web Book Spectra, physical properties  ChemIDPlus Phyical properties, biomedical links  NCI Chemical Identifier Resolver From name and IDs to structure

4 Other sources  ChemSpider (UK)  EINECS (EU)  KEGG (JP)  EMolecules (US, commercial)  ChEBI (UK)  ChEMBL (UK)  Drugbank (CA)  PDB (US, academic)  CommonChemistry (US, commercial)  Wikipedia (World)

5 How to Work with the Data?  Web interface for humans Hard to work with software  Many DBs provide external links Prone to breaking, becoming outdated  Data available as batch download Massive, difficult to manage  Lack of formal interface documentation or programmatic access PubChem, Entrez, NCI Resolver good guys

6 The CACTVS Toolkit  Generic chemistry toolkit  Manages objects like structures, reactions, tables  Extensible collection of properties, methods and I/O modules  Implicit automatic method chaining  Scripting language interface for RAD  Ships with access properties and modules for all these databases  Comprehensive solution for multi-DB projects

7 Basic Tasks  Name/Identifier resolution NCI Resolver -> REST interface KEGG -> text query cactvs>ens create "vioxx" ens0 cactvs>dataset create [list ‚+morphine +methyl‘] dataset0 cactvs>dataset ens dataset0 ens1 ens2

8 Basic tasks: Get Database ID  Text structure query (SMILES, InChI)  NCBI PUG Web service cactvs>ens get ens0 E_SIDSET 9792 207247 535364 5146347 7847634 7980536 8146414 8153131 10486532 11341940 11362123 11362973 11364757 11365535… cactvs>ens get ens0 E_CHEMIDPLUS_ID 0162011907

9 Basic Tasks: Download Objects  PubChem: from CID, SID, AID  PDB, CHEMBL, KEGG: from codes  Resolver: from name, identifiers cactvs>ens create 1 ens0 cactvs>ens create CHEMBL277500 ens1

10 Basic Tasks: Download Objects  PubChem I/O via native ASN.1 cactvs>table create 198 table3 cactvs>table get table3 colnames SID SID_Source Version Date Outcome Score schedule endpoint vehicle dose tcprcnt toxicity cactvs> table get table3 T_NCBI_ASSAY_DESCRIPTION(description) {The antitumor activity of compounds was measured in mice bearing transplantable tumors. Survival or tumor size were measured and the…

11 Basic Tasks: I/O of ID Files  Read files with CIDs, SIDs, CASNOs… cactvs>set fh [molfile open test.cas] molfile0 cactvs>molfile loop $fh eh { puts[ens get $eh E_CID] } 436534 321512 234 32532….

12 Implicit Property Lookup  Yes, its controlled, with metadata and origin tracing cactvs>ens create benzene ens0 cactvs>ens get ens0 E_CAS 71-43-2 cactvs>ens get ens0 E_UVSPECTRUM 1 {INSTITUTE OF ENERGY PROBLEMS OF CHEMICAL PHYSICS, RAS} {INEP CP RAS, NIST OSRD Collection (C) 2007 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved.} 0 n.i.g. {} {{$NIST SQUIB} 1951ROM/VOD930-932 {$NIST SOURCE} TSGMTE {$REF AUTHOR} {Romand, J.; Vodar, B.} {$REF TITLE} {Spectres d'absorption du benzene a l'etat vapeur et a l'etat condense dans l'ultraviolet lointain} {$REF JOURNAL} {Compt. Rend.} {$REF VOLUME} 233 {$REF PAGE} 930-932 {$REF DATE} 1951} {} {RAS UV No. 118} 0.0 {} {} 0.0 162.418 206.9805 1.0 1.0 {Wavelength (nm)} {Logarithm epsilon} 317 {} {3.7038 3.7101 3.7161 3.722 3.722….

13 More Property Lookups cactvs>ens show ens0 E_NIST_WEBBOOK_ID C71432 cactvs>ens get ens0 E_AIDSET 330 421 426 427 433 434 435 445 530 540 541 542 543 544 545 546 584 585... cactvs>ens get ens0 E_NAMESET BENZENE 71-43-2 NCGC00090744-02 UN1114 {Benzen [Polish]} {Benzene + aniline combo} 270709_ALDRICH 311855_SIGMA {Benzene (including benzene from gasoline)} 676985_ALDRICH {Benzene [UN1114] [Flammable liquid]} 154628_SIAL {Benzene, labeled with carbon-14 and tritium}…

14 More Property Lookups cactvs>ens get ens0 E_MESH_TERMS {68001554 {Benzene Cyclohexatriene Benzol Benzole} http://www.ncbi.nlm.nih.gov/sites/entrez?Db=mesh &Cmd=ShowDetailView&TermToSearch=68001554 {68001554 Benzene {68006841 {Hydrocarbons, Aromatic}{68006844 {Hydrocarbons, Cyclic} {68006838 Hydrocarbons {68009930 {Organic Chemicals} {1000068 {Chemicals and Drugs Category} {1000048 {All MeSH Categories}}}}}}}}} {68009930 {{Organic Chemicals} {Chemicals, Organic}} http://www.ncbi.nlm.nih.gov/sites/entrez?Db=mesh &Cmd=ShowDetailView&TermToSearch=68009930...

15 Construction of Display URLs cactvs>ens get ens0 E_PUBCHEM_URL http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi ?cid=241 cactvs>ens get ens0 E_CHEMIDPLUS_URL http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet ?objectHandle=DBMaint&actionHandle=default&nextP age=jsp/chemidheavy/ResultScreen.jsp&ROW_NUM=0&T XTSUPERLISTID=0000071432 cactvs>ens metadata ens0 E_CHEMIDPLUS_URL info {JSESSIONID=257C452AAB26D395DCC4AC2652F05C99; Path=/chemidplus}

16 Ugliness under the Hood  Absence of a clean programmatic interface hurts! set mdata [encode -url [molfile string $eh]] set pdata "indexes=&DT_ROWS_PER_PAGE2=1&objectHandle=Search&actionHandle=searchChemIdLite&nextPage=jsp%2Fc hemidheavy%2FChemidDataview.jsp&DT_ROWS_PER_PAGE=1&responseHandle=JSP&QV10=&QO10=Text+Search&QF1 1=Locator&QV11=&QO11=in&STRING_TO_FILE=$mdata&QF1=Name&QO1=%3D&QV1=&QV8=&QF8=ToxTestType&QO5=bet ween&QV5=&QF5=ToxResult&QV6=&QF6=ToxSpecies&QV7=&QF7=ToxRoute&QV9=&QF9=ToxEffect&ChemType=1001&Q F3=ChemType&QV3=&QF2=ChemProp&QO2=between&QV2=&ChemDataSourceType=0&QF4=ChemDataSourceType&QV4=& LocatorExpr1=&LocatorOper=AND&LocatorExpr2=&chemical_viewer=marvin&StructureSimilarPctg=80&QF10= StructureEqual&structurePref=marvin&QO12=between&QV12=&QF12=MolWeight&x=22&y=5" set data [post -contenttype application/x-www-form-urlencoded -raw http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?chemidheavy $pdata #auto status] if {![regexp {chemid=([0-9]+)} $data dummy id] && ![regexp {javascript:loadChemicalIndex[^0-9]*([0- 9]+)} $data dummy id]} { error "no ChemIDplus record" } ens set $eh E_CHEMIDPLUS_ID $id ens metadata $eh E_CHEMIDPLUS_ID info $status(cookies)

17 Power by Design  In contrast, PubChem has a well- defined set of interfaces – PUG, EUtils, cookie-free download URLs  No simulated Web form posting  No HTML page scraping  Support for more than just ID access

18 The PubChem Virtual File Project  Improved access to PubChem database indistinguishable from a local, read-only structure file in Cactvs scripting environment  Input functions transparently read structures and assay tables with all their data from PubChem, by decoding native binary ASN.1  Query functions convenient development and conservation of queries exceeding the capabilites of Web interfaces and PUG, maintaining standard Cactvs query and retrieval syntax

19 Transforming the PubChem Database into a Virtual File  Cactvs toolkit uses file record as primary key  PubChem uses CID (AID, SID) as primary key  Establish mapping via record/CID map  Precomputed as 20M bits bitmap  Set bit indicates active CID  Automatic download from Xemistry if needed, local caching, up-to-date check via Entrez query  Checked and potentially updated every 30 mins on Xemistry server  Data size 800K compressed, download <10s  Download of full active CID set from Entrez ~10-25 mins

20 PubChem Virtual File I/O Code sample:  filex load pubchem 19  molfile open molfile0  molfile count molfile0 19450023  molfile read molfile0 ens0  ens props ens0 …E_INCHI E_IUPAC_NAME E_NCBI_COMPOUND_ID E_EXACT_MASS E_TPSA E_SMILES E_SMILES/2….  ens get ens0 E_CID 1  molfile read molfile0 ens1  molfile set molfile0 record 999999 Contact Entrez e-utils, get database status, get CID Bitmap from Xemistry Single-record ASN.1 download via display page

21 Simple PubChem Queries Code sample: set fh [molfile open ] set cidlist [molfile scan $fh „structure >= $smarts“ \ {proplist E_CID}] Operations behind the scenes:  Set-up of PUG record  Post PUG, monitor return status  Cache CID result data  Direct access to result set, no structure download

22 Intermediate PubChem Queries Code sample: set fh [molfile open ] set elist [molfile scan $fh \ „or {structure = $smiles1} {structure = $smiles2}\ {structure = $smiles3}“ enslist] Operations behind the scenes:  Create and post PUG records, get history keys  Perform server-side e-utils result merge via history keys  Retrieve CID set  Download structures as ASN.1 blobs via CID

23 Power PubChem Queries Code sample: set th [molfile scan \ "and {structure >= c1cncc1} {E_PUBCHEM_AID_COUNT(active) > 25}„ \ {tablecollection image E_CID E_NAME E_SMILES E_PUBCHEM_AID_COUNT(active) E_PUBCHEM_AID_COUNT(inactive) E_ACTIVE_AIDSET} \ {} {maxhits 10}] table write $th active_pyrroles_in_pubchem.xls

24 Graphical Tools for the Masses  Draw or read structure  Compute database ID property Display data  Compute lookup properties Display data  Compute access URL property Load page into HTML widget

25 … in Stand-alone Tools

26 … and in Web Applications


Download ppt "Accessing U.S. Government Chemical Structure Databases with the CACTVS Toolkit Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Lahntal, Germany"

Similar presentations


Ads by Google