PubChem—Substance, Compound, BioAssay Part 1: Essentials Principles of May 24, 2007.

Slides:



Advertisements
Similar presentations
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Advertisements

NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Introduction to PubMed® (pubmed.gov)
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
The Protein Data Bank (PDB)
Sequence/Structure Alignment Resources from NCBI Steve Bryant Protein Data Bank Rutgers University November 19, 2005.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
A Virtual File System for the PubChem Chemical Structure and Bioassay Database Wolf-D. Ihlenfeldt Wolf-D. Ihlenfeldt Xemistry GmbH Königstein, Germany.
Introductory Overview
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Gene Expression Omnibus (GEO)
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
PubChem—Substance, Compound, BioAssay Part 3: Essentials.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
ChemModLab: A Web-based Cheminformatics Modeling Laboratory S. Stanley Young + ECCR and ChemSpider Teams.
Organizing information in the post-genomic era The rise of bioinformatics.
ChemBank Building a Public Web Resource Using Daycart Erik Brauner Head of Chemical and Biological Computing Harvard Institute of Chemistry and Cell Biology.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Help: Strain Page Header Yeast ORF deletion: _d suffix : dubious ORF _p suffix : putative (uncharacterized) ORF Gene/Protein: The established name for.
NCBI Literature Databases: PubMed
Gene Expression Omnibus (GEO)
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Project BB201 Metabolism A.Nasser
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
PubChem: An Open Repository for Chemical Structure and Biological Activity Information Steve Bryant The NIH Biowulf Cluster: 10 Years of Scientific Supercomputing.
Use of Machine Learning in Chemoinformatics
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
E-utilities: Short course. The Entrez Query System at NCBI.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Introduction to PubChem BioAssay
NCBI Molecular Biology Resources
Using ArrayExpress.
Introduction to PubChem BioAssay
Gene Expression Omnibus (GEO)
Introduction to Bioinformatics
Lesson 3 Bioinformatics Laboratory
PubMed.
Presentation transcript:

PubChem—Substance, Compound, BioAssay Part 1: Essentials Principles of May 24, 2007

PubChem—Substance, Compound, BioAssay What is PubChem?  A public repository of electronic representations of small molecules and associated bioactivity assay data  A component of the NIH Molecular Libraries RoadMap  Part of the NCBI Entrez search and linking system  A system of four components:  PubChem Substance  PubChem Compound  PubChem BioAssay  PubChem Structure Search

PubChem—Substance, Compound, BioAssay

PubChem—Substance, Compound, BioAssay Chemical Diversity Technology Development Screening Instrumentation Assay Development Predictive ADMET Compound Repository (MLSMR) Informatics Chem- informatics Research Centers The Molecular Libraries Roadmap: An Integrated Initiative Molecular Libraries Screening Centers Network ( M L S C N )

PubChem—Substance, Compound, BioAssay Bethesda The National Center for Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH –Establish public databases –Research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information

PubChem—Substance, Compound, BioAssay What does NCBI do? Accepts submissions of primary data. Develops tools to analyze these data. Uses these tools to create derivative databases based on the primary data. Provides free search, linking, and retrieval of data, mainly through the Entrez system.

PubChem—Substance, Compound, BioAssay BLASTSequence VASTProteinStructure EntrezText PubChemStructureSearch Small Molecule Structure

PubChem—Substance, Compound, BioAssay Data Analysis Tools: Differential display of data via structure clustering, structure- activity heat maps and customizable result retrieval tables.

PubChem—Substance, Compound, BioAssay Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO, PubChem Substance and BioAssay Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: RefSeq, RefSNP, GDS, PubChem Compound

PubChem—Substance, Compound, BioAssay PubChem Databases  Composed of Experimental data with Background, Protocols and Results for bioactivity screens of chemical substances described in PubChem Substance  Submitters add “Hard” links to PubChem Substance records and outside sources.  Composed of Substances which may be of known or unknown composition and also may contain a discrete compound or mixtures of compounds.  Submitters add “Hard” links to PubChem BioAssay records and outside sources.  Composed of discrete compounds with known chemical structure.  Summary reports about the known chemical compounds described in PubChem Substance.  Addition of Automated “Soft” Links which can be replicated on PubChem Substance & BioAssay records. Primary Databases: information is provided, updated and “owned” by Submitters. Derivative Database: information is provided, updated and “owned” by Submitters.

PubChem—Substance, Compound, BioAssay How does data get into PubChem?

PubChem—Substance, Compound, BioAssay Top PubChem Depositors DiscoveryGate ZINC ChemDB Thomson Pharma ChemBridge ChemBank ChemIDplus Asinex DTP/NCI Specs DTP/NCI 173 NIH Chemical Genomics Center 60 Structural Genomics Consortium - Oxford 43 Scripps Research Institute 37 University of Pittsburg MLSC 33 Southern Research MLSC 29 San Diego Center for Chemical Genomics 22 BindingDB 20 Penn Center for Molecular Discovery 19 Emory MLSC; Vanderbilt MLSC current depositors 22 current depositors

PubChem—Substance, Compound, BioAssay PC Substance Record Substance ID Compound ID Link to depositor Synonyms supplied by the depositor Identical substances

PubChem—Substance, Compound, BioAssay Redundancy in PC Substance 13 completely identical records for (-)epinephrine!

PubChem—Substance, Compound, BioAssay Non-uniformity in PC Substance

PubChem—Substance, Compound, BioAssay The Bizarre in PC Substance Grapefruit extract Chamomile tea Blood hydrolysate

PubChem—Substance, Compound, BioAssay PubChem Compound What we do: Standardize Structures Verify Chemical Data ◦ Atom description (label, element) ◦ Functional group clean-up ◦ Atom valence verification to prevent non-sense structures “Normalize” and “Standardize” ◦ Valence-Bond canonicalize (for Tautomer invariance) ◦ Aromaticity detection and self- consistency ◦ Stereochemistry detection ◦ Explicit hydrogen assignment Structural Representations ◦ 2D Coordinate generation ◦ Images created Structures that fail to standardize… ◦ Have no records in PC Compound ◦ Cannot be searched by structure

PubChem—Substance, Compound, BioAssay Compound Substance

PubChem—Substance, Compound, BioAssay Known stereochemistry Unknown stereo Unknown E/Z isomers Compound Substance

PubChem—Substance, Compound, BioAssay Stereoisomers in PC Compound No stereochemical assignment (+)epinephrine (-)epinephrine No stereochemistry is a stereochemical assignment in PubChem!

PubChem—Substance, Compound, BioAssay MeSH is NLM’s controlled vocabulary used for indexing articles for MEDLINE/PubMed. PubChem Compound Calculate Properties and Links Nomenclature ◦ IUPAC ◦ SMILES & SMARTS ◦ InChI Structural Information ◦ Calculate & store “Fingerprints” ◦ Calculate & link to similar structures (90% level) Physical Properties ◦ Molecular Formula ◦ Molecular Weight ◦ Number of H-bonds donor/acceptor sites ◦ XLogP value ◦ Lipinski value (bioavailability) ◦ Number of Rotatable bonds Links to NCBI Database Records ◦ Structures (MMDB records) ◦ Protein sequences (from Structure links) ◦ Genes (from Protein links) Links to MeSH Terms through IUPAC name What we do: Standardize Structures Verify Chemical Data ◦ Atom description (label, element) ◦ Functional group clean-up ◦ Atom valence verification to prevent non-sense structures “Normalize” and “Standardize” ◦ Valence-Bond canonicalize (for Tautomer invariance) ◦ Aromaticity detection and self- consistency ◦ Stereochemistry detection ◦ Explicit hydrogen assignment Structural Representations ◦ 2D Coordinate generation ◦ Images created Structures that fail to standardize… ◦ Have no records in PC Compound ◦ Cannot be searched by structure

PubChem—Substance, Compound, BioAssay PC Compound Record

PubChem—Substance, Compound, BioAssay MeSH Links

PubChem—Substance, Compound, BioAssay Calculated Properties Links for downloading or viewing the full record

PubChem—Substance, Compound, BioAssay Handling Mixtures Asmatane mist CID for the mixture Each standardized component has its own CID

PubChem—Substance, Compound, BioAssay PubChem Databases  Composed of Experimental data with Background, Protocols and Results for bioactivity screens of chemical substances described in PubChem Substance  Submitters add “Hard” links to PubChem Substance records and outside sources.  Composed of Substances which may be of known or unknown composition and also may contain a discrete compound or mixtures of compounds.  Submitters add “Hard” links to PubChem BioAssay records and outside sources.  Composed of discrete compounds with known chemical structure.  Summary reports about the known chemical compounds described in PubChem Substance.  Addition of Automated “Soft” Links which can be replicated on PubChem Substance & BioAssay records. Primary Databases: information is provided, updated and “owned” by Submitters. Derivative Database: information is provided, updated and “owned” by Submitters.

PubChem—Substance, Compound, BioAssay PC BioAssay Record

PubChem—Substance, Compound, BioAssay BioAssay Protocol Description of the BioAssay methods Listing of the data fields provided in the BioAssay

PubChem—Substance, Compound, BioAssay PubChem integration in Entrez Protein Sequences Literature VAST Structure Similarity Bioactivity Assay Results Small Molecule Structures 3D Structures Term Frequency Statistics Chemical Structure Similarity Activity Profile Similarity

PubChem—Substance, Compound, BioAssay What is Entrez?  System of 31 linked databases  Text search engine  Tool for finding biologically linked data  Data retrieval engine  Virtual workspace for manipulating large datasets  Free public access

PubChem—Substance, Compound, BioAssay The Entrez Databases

PubChem—Substance, Compound, BioAssay Text Queries in Entrez term1[limit] OP term2[limit] OP … limit = Entrez indexing field (organism, author, …) OP = Boolean operator = AND, OR, NOT where term1 term2 Complex queries: ((A[limit1] OR B[limit2]) AND C[limit3]) NOT D[limit4] 1:200[MW] Ranges: Phrases in quotes: “malic acid”[synonym]

PubChem—Substance, Compound, BioAssay 300:500[MW] AND “pcsubstance structure”[Filter] epinephrine[CompleteSynonym] ca[Element] AND chemidplus[SourceName] “lipinski”[Filter] AND “antineoplastic agents”[PharmAction] Sample Entrez Queries epinephrine[synonym] Find records that have synonyms containing “epinephrine” Find records that have a synonym that is exactly “epinephrine” Find records deposited by ChemIDPlus that contain calcium Find substances with molecular weights of that are ligands in 3D protein structures Find antineoplastic agents that obey the Lipinski rule of 5

PubChem—Substance, Compound, BioAssay Entrez Limits x x x

PubChem—Substance, Compound, BioAssay Details Entrez query equivalent to your selected Limits

PubChem—Substance, Compound, BioAssay Preview/Index CompleteSynonym epinephrine

PubChem—Substance, Compound, BioAssay Entrez History Query keys —E ntrez History keeps track of all of your searches— Your history is deleted only after 8 hours of inactivity. You can purposefully store searches for later use. You can “concatenate” searches with ANDs, ORs, NOTs by using the query keys: #2 NOT 1:1000[mw] #6 AND #4

PubChem—Substance, Compound, BioAssay Downloading Reports

PubChem—Substance, Compound, BioAssay Downloading Bulk Data Output is a temporary FTP file

PubChem—Substance, Compound, BioAssay Linking in Entrez Follow links to related data: Links Hard Links: Curated links based on biology nucleotide  taxonomy (based on organism identifier) protein  domain relatives (based on domain assignment) domains  pubmed (based on supporting literature) pcsubstance  structures/mmdb (based on source information ) Soft Links: Pre-computed analyses nucleotide  related sequences (BLAST neighbors) protein  conserved domains (CDD/RPS-BLAST search) pccompound  pccompound (structure-based neighboring)

PubChem—Substance, Compound, BioAssay PubChem Links

PubChem—Substance, Compound, BioAssay Linking in Bulk Will return the corresponding Compounds for all of these Substances

PubChem—Substance, Compound, BioAssay The PubChem FTP Site

PubChem—Substance, Compound, BioAssay NCBI Toolbox: In-house source code useful for incorporating NCBI-like functionality into their programs. Three main parts: Data Model, Data Encoding and Programming Libraries. Examples: BLAST, Cn3D, Sequin, Data format conversion scripts Programming Tools E-Utilities: Guidelines for Entrez “URL calls” used to access data. Designed for use in scripts. Examples: ESearch, EPost, ESummary, EFetch and ELink Caution: Overuse may result in blocked IPs!

PubChem—Substance, Compound, BioAssay PubChem Help

PubChem—Substance, Compound, BioAssay PubChem: Bird’s Eye View Depositors PubChem BioAssays PubChem Compound PubChem Substance Chemical Structure Similarity