Directly Upload Data From An ELN Into PubChem Ben Shoemaker Ben.Shoemaker@nih.gov U.S. National Center for Biotechnology Information: NCBI / NLM / NIH
Maximize the impact of your research … with little effort PubChem is a global resource for open chemistry Data sources in PubChem found by internet search Open Access mandates satisfied with PubChem Data formats and web interfaces can impede upload Programmatic access to data uploads facilitates ELN integration
PubChem Mission PubChem is an open archive and a public resource with the primary aim to provide information on the biological activity of chemical substances
Unique chemical structure content of PubChem PubChem is an archive Submitted; SID accession Derived; CID accession Substance without structure is not in Compound Substance records keep provenance clear Unique chemical structure content of PubChem Compound helps to group Substance records Submitted; AID accession
Why does a user come to PubChem? Search result from Google/Yahoo/Baidu Purchase decision for molecule ‘X’ Publications about molecule/concept Patents/Biological activities What is known about the molecule? Physical properties Pharmacology Biological activity Safety information Spectroscopy Toxicity Pathways Etc. Launching pad for associations to related databases Image credit: http://blogs.egu.eu/network/palaeoblog/2012/10/31/why-bother-communicating/
Chemical information is everywhere now PubChem is helping to improve accessibility to chemical information
PubChem growth Sustained growth over 12 years of: Contributors Chemical substance descriptions Biological testing data Usage Top-10 chemistry website (#5?) ~1.5M monthly unique users at peak Heavy programmatic usage ~5% of unique IPs per month (~70K) Serve millions of web hits per day 2M-12M on average (0.5M interactive)
Benefit from PubChem by uploading data Minimal startup time Flexible interface Spreadsheet data accepted via file or web interface
Upload chemicals Draw or load structures Enter annotations & synonyms Link back to your site
Upload screening results Spreadsheet load File or web Include all test results E.g. an article table
Upload screening results Add annotations Specify targets Database links
Annotate with controlled vocabularies Include ontology terms such as from BAO, GO, MESH
How can data loading be improved? This works well, but… Issues: Web interfaces and file formats must be learned Open access data requirements add yet another step to lengthy and time-sensitive publishing process FTP uploads can be automated, but require custom scripts difficult for single-use
How can data loading be improved? Ideas: Ideally, a single “Make Public!” button would be added to existing end-user software This ‘publish’ button would require a standard implementation to make it simple to add Electronic Lab Notebooks (ELNs) would be good candidates for such functionality Great, so how do we that?
Build on public data: Programmatic access Outside websites create novel platforms for increased exposure REST: Easy, predictable access for research analysis http://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/activity/EC50/aids/JSON
PubChem Upload REST Extend programmatic access to “pushing” data Open suite of operations for loading data Create standard syntax to simplify interface Use secure login and key to restrict access
PubChem Upload REST The URL path Domain Operation https://pubchem.ncbi.nlm.nih.gov/rest/uplo ad/<domain specification>/<operation specification>/ [?<operation_options>] <domain specification> = substance | assay | account login, upload, set_record, get_record, pending, list_records, commit, export_file, get_sidlist, list_archived, get_viewcode, set_viewcode, delete_viewcode
PubChem Upload REST Example Let’s say that you have structure and annotation information for three chemicals including: Unique identifiers CAS registry numbers Common names SMILES Tag list found on help page: https://pubchem.ncbi.nlm.nih.gov/upload SDF, CSV and Excel accepted PUBCHEM_EXT_DATASOURCE_REGID PUBCHEM_SUBSTANCE_SYNONYM PUBCHEM_EXT_DATASOURCE_SMILES my_sub1 50-99-7 D-Glucose, anhydrous C(C1C(C(C(C(O1)O)O)O)O)O my_sub2 CCOC1=CC=CC=C1NC(=O)C2=CC3=CC=CC=C3C=C2O my_sub3 C1=CC=CC=C1
PubChem Upload REST Example Authenticate: Provide user credentials Security key returned for subsequent operations unix> curl -c cookie1.txt "https://pubchem.ncbi.nlm.nih.gov/rest/upload/account/login ?login=MyLogin&password=test-password" { "Response": { "ResponseCode": "Pass", "UserId": "999" } } Base Domain Operation Arguments pubchem../rest/upload account login login,password
PubChem Upload REST Example Upload From File unix> curl -b cookie1.txt -F "data=@test1.sdf" "https://pubchem.ncbi.nlm.nih.gov/rest/upload/substance/upload/SDF?process=1" Base Domain Operation Arguments Input pubchem../rest/upload substance upload process SDF
PubChem Upload REST Example Upload from a URL-encoded string unix> curl --cookie "deposit_ses_key=8F565CD7-46E0-4939-9CB5-B3449C5B70A5" -d "data= PUBCHEM_EXT_DATASOURCE_REGID%2CPUBCHEM_SUBSTANCE_SYNONYM%2CPUBCHEM_SUB STANCE_SYNONYM%2CPUBCHEM_EXT_DATASOURCE_SMILES%0Amy_sub1%2C50-99- 7%2C%22D- Glucose%2C%20anhydrous%22%2CC%28C1C%28C%28C%28C%28O1%29O%29O%29O%29O%29 O%0Amy_sub2%2C%2C%2CCCOC1%3DCC%3DCC%3DC1NC%28%3DO%29C2%3DCC3%3DCC%3D CC%3DC3C%3DC2O%0Amy_sub3%2C%2Cbenzene%2CC1%3DCC%3DCC%3DC1%0A" "https://pubchem.ncbi.nlm.nih.gov/rest/upload/substance/upload/CSV?process=1" Base Domain Operation Arguments Input pubchem../rest/upload substance upload process CSV
PubChem Upload REST Example Check the status of your pending submissions unix> curl -b cookie1.txt "https://pubchem.ncbi.nlm.nih.gov/rest/upload/substance/pending" {"Response": {"ResponseCode": "Pass","PendingSubmissions": [{"UploadId": "40637","Date": "2016/02/08 16:25","Status": "V1","DataSet": "form-data.sdf","Records": "3"},{"UploadId": "40638","Date": "2016/02/08 17:06","Status": "V1","DataSet": "form-data.sdf","Records": "3"}]}} Base Domain Operation Arguments pubchem../rest/upload substance pending
PubChem Upload REST Example Commit your submission into the public PubChem database unix> curl -b cookie1.txt "https://pubchem.ncbi.nlm.nih.gov/rest/upload/substance/commit?upload_id=40637" {"Response": {"ResponseCode": "Pass","OperationStatus": [{"UploadId": "40637","CommitStatus": "Pass"}]}} Base Domain Operation Arguments pubchem../rest/upload substance commit upload_id
Maximize the impact of your research … with little effort PubChem is a global resource for open chemistry 12 years of growth Top 10 chemistry website Data sources in PubChem found by internet search Uploading is easy for small and large submissions Programmatic access to data uploads facilitates ELN integration Leverage PubChem’s impact on chemistry
Acknowledgments: The PubChem Team Evan Bolton Jie Chen Tiejun Cheng Gang Fu Renata Geer Asta Gindulyte Lianyi Han Jane He Steve Bryant (PI) Siqian He Sunghwan Kim Paul Thiessen Jiyao Wang Yanli Wang Bo Yu Leonid Zaslavsky Jian Zhang All research supported by the Intramural Research Program of the NIH, National Library of Medicine. Ben.Shoemaker@nih.gov Special thanks: NCBI Help Desk and past PubChem group members.