NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources
NCBI FieldGuide Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, GEO Datasets, NCBI Protein, Structure, Conserved Domain
NCBI FieldGuide Accessing the Data: Entrez all[filter]
NCBI FieldGuide EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates International Sequence Database Collaboration
NCBI FieldGuide GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank Release 142June ,532,003Records 40,325,321,348Nucleotides >140,000Species 153 Gigabytes 634 files full release every two months incremental and cumulative updates daily available only through internet release notes: gbrel.txt
NCBI FieldGuide A GenBank Record LOCUS NM_ bp mRNA linear PRI 07-APR-2003 DEFINITION Homo sapiens interleukin 3 (colony-stimulatingfactor, multiple)(IL3), mRNA. ACCESSION NM_ VERSION NM_ GI: KEYWORDS.
NCBI FieldGuide GenBank Record: Feature Table /protein_id=“ NP_ ” /db_xref=“GI: GenPept identifiers
NCBI FieldGuide GenBank Record, Con’t
NCBI FieldGuide Sequence Revision History
NCBI FieldGuide NM_ Sequence Revision History: choose records
NCBI FieldGuide Display and Save Options
NCBI FieldGuide FASTA format (NCBI)
NCBI FieldGuide Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1
NCBI FieldGuide Bulk Divisions Expressed Sequence Tag –1 st pass single read cDNA Genome Survey Sequence –1 st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Batch submissions ( and ftp) Inaccurate Poorly characterized
NCBI FieldGuide NCBI’s Derivative Sequence Databases
NCBI FieldGuide Primary vs. Derivative Databases GenBank Sequencing Centers UniGene RefSeq: LocusLink and Genomes Pipelines RefSeq: Annotation Pipeline Labs Algorithms Updated ONLY by submitters EST UniSTS STS GSS HTG PRIRODPLNMAMBCT INVVRTPHGVRL Curators ATT GA ATT C GA C C C C ATT TA ACT Updated continually by NCBI RefSeq
NCBI FieldGuide Entrez Protein query: topoisomerase II alpha[title] AND human[organism] Why Make Reference Sequences? = AAC77388 splice variant Δ = 5 aa = P11388 RefSeq protein
NCBI FieldGuide RefSeq Benefits non-redundant, best representative updates to reflect current sequence data and biology distinct, stable accession series genomes transcripts proteins
NCBI FieldGuide Reference Sequence: RefSeq AccessionSequence Type NM_ mRNA NP_ protein, from NM_ NR_ non-coding RNA XM_ predicted mRNA XP_ predicted protein XR_ predicted non-coding RNA ZP_ predicted from NZ_ NC_ genomic, e.g., chromosomes NG_ genomic, incomplete region NT_ genomic, BAC assembly NW_ genomic, WGS assembly NZ_ABCD genomic, WGS collection blue=curated REFSEQ Key
NCBI FieldGuide RefSeq Status Codes REVIEWED: by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features. VALIDATED: in an initial review to provide the preferred sequence standard; not yet subjected to final review at which time additional functional information may be provided. PROVISIONAL: the record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein. PREDICTED: may represent an ab initio prediction or may be partially supported by other transcript data; the protein is predicted. INFERRED: by genome sequence analysis. MODEL: provided via automated processing and not subjected to individual review or revision between builds.
NCBI FieldGuide Third Party Annotation (TPA) Database Annotations of existing GenBank sequences Allows for community annotation of genomes Direct submissions –BankIt –Sequin
NCBI FieldGuide Other Databases at the NCBI dbSNP nucleotide polymorphisms GEO Gene Expression Omnibus microarray and other expression data GEO DataSets curated reports of GEO data collections of biologically and mathematically comparable GEO Samples. Structure imported structures (PDB) Cn3D viewer, NCBI curation CDD conserved domain database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)
NCBI FieldGuide NCBI’s SNP Database Primary and derivative (RefSNP) Single nucleotide polymorphisms Repeat polymorphisms Insertion-deletion polymorphisms 24 Species Over 11 million refSNPs (rsXXXXXXX)
NCBI FieldGuide Non-redundant Computational Analysis BLAST hits to genome, mRNA, protein RefSNP
NCBI FieldGuide Using Entrez An integrated database search and retrieval system
Genomes Taxonomy Entrez: Database Integration PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure Word weight VAST BLAST Phylogeny
NCBI FieldGuide Home Page: Global Entrez Portal hfe
NCBI FieldGuide Global Entrez Search: HFE
NCBI FieldGuide Entrez Nucleotide: HFE 218 records Not HFE [Title]
NCBI FieldGuide Smarter Query hfe[title] AND human[orgn] 39 records Curated HFE splice variants (11 total)
NCBI FieldGuide hfe[title] AND human[orgn] (con’t) Primary data
NCBI FieldGuide Finding Primary Sequences Entrez Nucleotide 99+% GenBank (primary data) –srcdb ddbj/embl/genbank[properties]= 39,849,856 records <1% RefSeq (curated data) –srcdb refseq[properties]= 304,945 records Useful search terms in [Properties]: – srcdb : source database (e.g., srcdb genbank[prop]) – gbdiv : GenBank division (e.g., gbdiv est[prop]) – biomol : biomolecule type (e.g., biomol mrna[prop])
NCBI FieldGuide Database Queries #1 hfe 116 #2 hfe[title] AND human[orgn] 42 #3 #2 AND srcdb refseq[prop] 11 #4 #2 AND srcdb ddbj/embl/genbank[prop] 31 #5 #2 AND gbdiv pri[prop] 29 #4 #2 AND gbdiv est[prop] 2 Primate divisiongbdiv pri[prop] EST divisiongbdiv est[prop]
NCBI FieldGuide Molecule Queries #1 hfe 116 #2 hfe[title] AND human[orgn] 42 #3 #2 AND biomol mrna[prop] 29 #4 #2 AND biomol genomic[prop] 13 Genomic DNAbiomol genomic[prop] cDNAbiomol mrna[prop]
NCBI FieldGuide More Queries… RefSeq status, variants: reviewed RefSeqs with transcript variants srcdb refseq reviewed[prop] AND has transcript variants[prop] Gene symbol: human hemochromatosis (HFE) hfe[sym] AND human[organism] Disease and Gene Ontology: membrane proteins linked to cancer integral to plasma membrane[gene ontology] AND cancer[dis] Chromosome, Links: genes on human chromosome 2 with OMIM links 2[chromosome] AND gene omim[filter] AND human[organism] Protein name: topoisomerase genes from Archaea topoisomerase[gene/protein name] AND archaea[organism]
NCBI FieldGuide Other Entrez Databases UniSTS: markers on the Genethon map of human chromosome 12 Genethon[Map Name] AND human[organism] AND 12[chromosome] UniGene: rat clusters that have at least one mRNA rat[organism] NOT 0[mrna count] Structure: structures of bacterial kinases with resolutions below 2 Å bacteria[organism] AND kinase AND :002.00[resolution] SNP: uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]
NCBI FieldGuide Search by Sequence
NCBI FieldGuide Related Sequences Most similar Least similar
NCBI FieldGuide Search by Sequence: protein
NCBI FieldGuide BLink (BLAST Link)
NCBI FieldGuide BLink Output
NCBI FieldGuide BLink → Multiple sequence alignment