NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam
The National Institutes of Health Bethesda, MD
The National Center for Biotechnology Information Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information
Number of Users and Hits Per Day 1997 1998 1999 2000 2001 2002 2003 NCBI, Currently averaging 15,000,000 to 50,000,000 hits per day! The number refers to NCBI, the bulk of the hits and users just uses PubMed Christmas & New Year’s Days
Sunday ~6 million hits/day PubMed Hits for March 2005 PubMed averages 10,000,000 to 13,000,000 hits per day! 1113507870.graph_hits.png Bar chart shows data from last month Line graphs show historical data for comparisons: 2 months ago (thick solid line); 3 months ago (solid line); 4 months ago (dashed line); 5 months ago (dotted line) Saturday & Sunday ~6 million hits/day
Countries of Origin
Literature Databases
A part of the NCBI Bookshelf Part 2. Data Flow and Processing Part 1. The Databases Part 3. Querying and Linking the Data Part 4. User Support A part of the NCBI Bookshelf
OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases
PubMed URL: http://www.ncbi.nlm.nih.gov/ http://www.pubmed.gov/
How to Query a Particular Database term1 term2 (term1[tag delimiter] op term2[tag delimiter] op …) op = AND, OR, NOT Boolean operators MUST be in ALL CAPS! With the exception of PubMed, boolean operators must be in all CAPS, but since so many users do not follow this convention in PubMed, lower case and upper case boolean operators are both accepted. tag delimiter = Entrez indexing field Text Word Journal MeSH Terms Author Examples of tag delimiters
Sample PubMed Query Brauninger a c-src kinase Text Word Journal MeSH Terms Author Brauninger a (author name) & c-src kinase (name of protein) Click on “Details” to show the actual Entrez query structure. Order of word identification….. 1. Organism classification (Cancer = crab) 2. Journal name (jmb = journal of molecular biology) 3. Key word list (MESH terms) 4. Author name (“last name” “first initial”)
Using Fields to Find Records Affiliation All Fields Author EC/RN Number Entrez Date Filter Grant Number Issue Journal Language MeSH Date MeSH Major Topic MeSH Subheading MeSH Terms Pagination Pharmacological Action Publication Date Publication Type Secondary Source ID Substance Name Text Word Title Title/Abstract Volume
Using Fields to Find Records Accession All Fields Author EC/RN Number Feature Key Filter Gene Name Issue Journal Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Volume Most useful search field [Organism]: human[orgn] …or… bacteria[orgn] Useful search terms in [Properties] field: srcdb: “source database” ( srcdb_genbank[prop] ) gbdiv: “genbank division” ( gbdiv_est[prop] ) biomol: “biomolecular type” ( biomol_mrna[prop] )
Using Field Limits #1: thyroid peroxidase 340 #2: thyroid peroxidase AND human[orgn] 291 #3: thyroid peroxidase[title] AND human[orgn] 166 #4: #3 AND srcdb_refseq[prop] 5 #5: #3 AND srcdb_ddbj/embl/genbank[prop] 161 #6: #5 AND gbdiv_est[prop] 20 #7: #5 AND gbdiv_pri[prop] 141 #8: #7 AND biomol_genomic[prop] 25 #9: #7 AND biomol_mrna[prop] 116
Complex searches you can do with Preview/Index Terms used (and indexed) in Entrez fields can be searched to gain useful information! How many rat Unigene clusters contain at least one mRNA? Select the UniGene database. Find all the rat records. Find those that have ≥ 1 mRNAs. (“not 0”) NOT rat [organism]
Complex Queries with Preview/Index NOT 0 [mRNA Count]
Batch Downloads
Batch Downloads: FASTA and GI list
The (ever expanding) Entrez System PubChem NLM Catalog Entrez PopSet Structure PubMed Books 3D Domains Taxonomy GEO/GDS UniGene Nucleotide Protein Genome OMIM CDD/CDART Journals SNP UniSTS PubMed Central Gene HomoloGene Protein Sequence/Structure Compounds BioAssays Genome Sequence/Structure Substances Literature Organism Expression Sequence Nucleotide HomoloGene Based on key word searching (MESH terms, author names, gene names, accession or gi numbers, or just recognized patterns in the records). 15 database are included…. Gene GenSat GenomeProjects Cancer Chromosomes
Other Advanced Queries Nucleotide: Non-genomic sequences from the PLN division of Genbank gbdiv_pln [properties] NOT biomol_genomic [properties] Protein: RefSeq sequences with molecular weights of 80 to 100 kDa srcdb_refseq [properties] AND 080000:100000 [Molecular Weight] SNP: True SNPs that are uniquely mapped on the mouse genome Snp [SNP Class] AND 1 [Map Weight] AND mouse [organism] UniSTS: Markers on the Genethon map of human chromosome 12 Genethon [Map Name] AND human [organism] AND 12 [chromosome] Structure: Structures of bacterial kinases with resolutions below 2 Å Bacteria [organism] AND kinase AND 000.00:002.00 [resolution]
“Global Entrez Query”
NM_000249: PubMed Books
Books Link
OMIM: Human Disease Genes Conserved Domain
Search Engines OAIster - http://www.oaister.org/
Sources for Full Text Directory of Open Access Journals - http://www.doaj.org/ Health Internetwork Institutional Archives e.g. http://archives.eprints.org/ BioLine International - http://www.bioline.org.br/