NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University of Colorado Health Sciences Center
NCBI FieldGuide Topics About NCBI GenBank overview Primary vs derivative databases The Reference Sequence (RefSeq) project Entrez databases Genome resources Bookshelf -break- Entrez text searching BLAST sequence searching VAST structure searching An integrated example
NCBI FieldGuide The National Institutes of Health Bethesda, MD
NCBI FieldGuide The National Center for Biotechnology Information Accepts submissions of primary data Develops tools to analyze these data Creates derivative databases based on the primary data Provides free search, link, and retrieval of these data, primarily through the Entrez system
NCBI FieldGuide NCBI WWW Users per Day
NCBI FieldGuide Number of Users Per Day Christmas & New Year
NCBI FieldGuide Homepage - accessing the data all[filter]
NCBI FieldGuide all[filter] 1/11/2005 3/15/2005 8/15/2005
NCBI FieldGuide Entrez Nucleotide Primary Data GenBank / DDBJ / EMBL 57.3 million (97.4 %) Derivative Data RefSeq1.47 million (2.5 %) RefSeq reviewed 60,000 PDB(structures) 5,973 “Total” 59 million GenBank # records
NCBI FieldGuide GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank Release 149 August x 10 6 Records 52 x 10 9 Nucleotides 195 Gigabytes 816 files full release every two months incremental and cumulative updates daily available only through internet release notes: gbrel.txt Over 100 billion bases! Over 100 billion bases!
NCBI FieldGuide What is GenBank? Nucleotide only sequence database Archival in nature GenBank Data Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database
NCBI FieldGuide GenBank Divisions “Organismal” PRI (28) Primate ROD (15) Rodent PLN (13) Plant and Fungal BCT (11) Bacterial/Archeal INV (7) Invertebrate VRT (7) Other Vertebrate VRL (4) Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1) Unannotated “Functional” EST (377) Expressed Sequence Tag GSS (138) Genome Survey Sequence HTG (63) High Throughput Genomic PAT (17) Patent STS (9) Sequence Tagged Site CON (1) Contigs, virtual Organized by taxonomy (sort of) Direct submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by sequence type Batch submissions (ftp/ ) Inaccurate Poorly characterized
NCBI FieldGuide GenBank Functional (Bulk) Divisions GenBank EST STS GSS HTG Expressed Sequence Tag 1st pass single read cDNA Genome Survey Sequence 1st pass single read gDNA High Throughput Genomic incomplete sequences of genomic clones Sequence Tagged Site PCR-based mapping reagents Whole Genome Shotgun
NCBI FieldGuide EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones - sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
NCBI FieldGuide GSS, WGS, HTG shred Whole BAC insert (or genome) isolate clonessequence GSS division or trace archive Draft sequence ( HTG division ) assembly whole genome shotgun assemblies (traditional division)
NCBI FieldGuide HTG Example: Honeybee Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division LOCUS AC bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC VERSION AC GI: KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. LOCUS AC bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC VERSION AC GI: KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
NCBI FieldGuide Whole Genome Shotgun Projects 351 projects Bacteria (251) Environmental sequences (6) Archaea (6) Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (2) Yeasts (8), Aspergillus (2) Rice (2) 351 projects Bacteria (251) Environmental sequences (6) Archaea (6) Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (2) Yeasts (8), Aspergillus (2) Rice (2)
NCBI FieldGuide Whole Genome Shotgun (WGS) Projects wgs master[properties]
NCBI FieldGuide Derivative Databases GenBank Sequencing Centers UniGene RefSeq: Entrez Gene and annotation pipelines Labs Updated ONLY by submitters EST UniSTS STS HTG GSS PRIRODPLNMAMBCT INVVRTPHGVRL ATT GA ATT C GA C C C C ATT TA ACT Updated by NCBI RefSeq
NCBI FieldGuide Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]
NCBI FieldGuide Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]
NCBI FieldGuide human[organism] AND lipase[title] AND endothelial[title] 3927 bp 4150 bp 3927 bp 2323 bp 261 bp human[organism] AND lipase[title] AND endothelial[title]
NCBI FieldGuide RefSeq Benefits genomes transcripts proteins non-redundant; best representative updates to reflect current sequence data and biology distinct, stable accession series
NCBI FieldGuide Reference Sequence: RefSeq AccessionSequence Type NM_ mRNA NP_ protein, from NM_ NR_ non-coding RNA XM_ predicted mRNA XP_ predicted protein XR_ predicted non-coding RNA ZP_ predicted from NZ_ NC_ genomic, e.g., chromosomes NG_ genomic, incomplete region NT_ genomic, BAC assembly NW_ genomic, WGS assembly NZ_ABCD genomic, WGS collection blue=curated
NCBI FieldGuide Genomic DNA (NC, NT, NW) Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) Annotation Process Curated Protein (NP) Scanning.... Genbank Sequences RefSeq
NCBI FieldGuide Creating NM_ Records NM’s must have cDNA support Genome annotation Longest mRNA transcript variant 1 transcript variant 2 transcript variant 3
NCBI FieldGuide Where is RefSeq?
NCBI FieldGuide GENSAT The Entrez System Entrez Nucleotide PubMed Protein Taxonomy Structur e Domains3D Domains Journal s PMC OMIM Books PopSet SNP UniGene UniST S Genome Gene GEO MeSH CancerChromosomes Homologen e PubChe m
NCBI FieldGuide A Few Entrez Databases UniGene Clusters of ESTs, mRNAs dbSNP Single Nucleotide Polymorphisms GEO Gene Expression Omnibus microarray and other expression data CDD Conserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD) UniGene Clusters of ESTs, mRNAs dbSNP Single Nucleotide Polymorphisms GEO Gene Expression Omnibus microarray and other expression data CDD Conserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)
NCBI FieldGuide Gene-oriented clusters of expressed sequences Automatic clustering using MegaBlast Each cluster represents a unique gene Informed by genome hits Information on tissue types and map locations Useful for gene discovery and selection of mapping reagents UniGene unique gene
NCBI FieldGuide A Cluster of ESTs query 5’ EST hits 3’ EST hits
NCBI FieldGuide UniGene Collections
NCBI FieldGuide Example UniGene Cluster
NCBI FieldGuide Histogram of cluster sizes for UniGene Hs Build 177 (Now at Build #186)
NCBI FieldGuide UniGene Cluster Hs SELECTED PROTEIN SIMILARITES
NCBI FieldGuide UniGene Cluster Hs GENE EXPRESSION
NCBI FieldGuide UniGene Cluster Hs.95351: expression
NCBI FieldGuide UniGene Cluster Hs.95351: seqs
NCBI FieldGuide Download sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/
NCBI FieldGuide Entrez GEO
NCBI FieldGuide NCBI’s SNP Database Primary and derivative (RefSNP) Single nucleotide polymorphisms Repeat polymorphisms Insertion-deletion polymorphisms Over 19 million refSNPs (rsXXXXXXX) ( August, 2005)
NCBI FieldGuide Searching dbSNP
NCBI FieldGuide RefSNP
NCBI FieldGuide RefSNP
NCBI FieldGuide RefSNP
NCBI FieldGuide RefSNP Search Mouse SNP between strains
NCBI FieldGuide RefSNP MapView GeneView SeqView OMIM No 3D
NCBI FieldGuide RefSNP
NCBI FieldGuide Entrez GEO
NCBI FieldGuide GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GSE Grouping of slide/chip data “a single experiment” GDS Grouping of experiments Curated by NCBI Submitted by Experimentalists Submitted by Manufacturer* Entrez GEO Entrez GEO Datasets G EO S a M ple : experimental conditions G EO SE ries : set of related samples
NCBI FieldGuide What’s a DataSet? Platform (GPL) array definition Sample (GSM) hyb. measurements Series (GSE) related Samples Supplied by submitter DataSet (GDS) A collection of experimentally-related samples processed using the same platform. Samples within DataSets are organized into subgroups based on experimental variables. Form the basis of GEO’s query, analysis and data display tools. Assembled by GEO staff
NCBI FieldGuide Gene Expression Omnibus (GEO) Dataset browser
NCBI FieldGuide GEO Dataset Browser
NCBI FieldGuide GEO Dataset Report
NCBI FieldGuide GEO Profiles … of 12625
NCBI FieldGuide Entrez CDD
NCBI FieldGuide Conserved Domain Database Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments) Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments)
NCBI FieldGuide CDD >gi| |gb|AAS | ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE
NCBI FieldGuide CDD CD Pfam COG Click on a colored bar to align your sequence to the CD
NCBI FieldGuide Conserved Domain Database: cd , HMA
NCBI FieldGuide CDD
NCBI FieldGuide CDART: Conserved Domain Architecture Retrieval Tool
NCBI FieldGuide cdd Linking from Entrez Protein
NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology
NCBI FieldGuide Genomic Biology
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Genome Projects: microb
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Gen Biol: Gen Resources
NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology
NCBI FieldGuide Entrez Gene A single query interface to … Sequences - RefSeqs - GenBank - Homologene Maps – MapViewer Entrez links Linkouts More organisms, ~ 3000 Entrez integration More organisms, ~ 3000 Entrez integration
NCBI FieldGuide Global Entrez: NADH2
NCBI FieldGuide Entrez Gene: NADH2
NCBI FieldGuide Gene Record for Pongo NADH2 Homo sapiens Not found with “nadh2”
NCBI FieldGuide A Record With More Data: Human HFE
NCBI FieldGuide Human HFE: Transcripts Transcripts with experimental evidence
NCBI FieldGuide Gene Table
NCBI FieldGuide Introns/Exons: Gene Table links to sequence
NCBI FieldGuide Human HFE: Links
NCBI FieldGuide Genotype
NCBI FieldGuide Genotype
NCBI FieldGuide Human HFE: Links
NCBI FieldGuide GeneView in dbSNP
NCBI FieldGuide SNP in Structure
NCBI FieldGuide SNP in Structure
NCBI FieldGuide SNP in Structure H41 S43 C260
NCBI FieldGuide Another Variation Source: OMIM
NCBI FieldGuide Variants in OMIM
NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology
NCBI FieldGuide The New Homologene Automated detection of homologs among the annotated genes of completely sequenced eukaryotic genomes. No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs
NCBI FieldGuide The New Homologene Homologene Build 43.1 (8/23/05) Species Number of genes input grouped groups
NCBI FieldGuide RAG1 → Homologene
NCBI FieldGuide RAG1 → Homolgene RAG1
NCBI FieldGuide RAG1 RING-finger
NCBI FieldGuide RAG1 → Homolgene RAG1
NCBI FieldGuide RAG1 Sugar_tr
NCBI FieldGuide Homologene: alignment scores
NCBI FieldGuide BLASTP bl2seq
NCBI FieldGuide Genome ResourcesLocusLink Gene database UniGene Trace Archive Map Viewer Homologene
NCBI FieldGuide List View
NCBI FieldGuide Human MapViewer adar
NCBI FieldGuide MapViewer: Human ADAR
NCBI FieldGuide MV Hs ADAR 3’ UTR 5’ UTR
NCBI FieldGuide Maps & Options --Sequence maps-- Ab initio Assembly Repeats BES_Clone Clone NCI_Clone Contig Component CpG island dbSNP haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_Tag STS TCAG_RNA Transcript (RNA) Hs_UniGene Hs_EST --Cytogenetic maps-- Ideogram FISH Clone Gene_Cytogenetic Mitelman Breakpoint Morbid/Disease --Genetic Maps-- deCODE Genethon Marshfield --RH maps-- GeneMap99-G3 GeneMap99-GB4 NCBI RH Standford-G3 TNG Whitehead-RH Whitehead-YAC Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ssc_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variation Maps & Options = SNP
NCBI FieldGuide MapViewer UniGene Component Repeats Gene
NCBI FieldGuide Gene PhenotypeVariation
NCBI FieldGuide Maps & Options
NCBI FieldGuide Genome ResourcesLocusLink Gene database UniGene Trace Archive Map Viewer Homologene
NCBI FieldGuide Trace Archive Page
NCBI FieldGuide Macaca Mulatta Traces
NCBI FieldGuide
Trace Archive BLAST Page Access to sequences NOT in GenBank
NCBI FieldGuide Literature Links
NCBI FieldGuide BOOKS Database
NCBI FieldGuide BOOKS Database: hyperlinked
NCBI FieldGuide BOOKS Database
NCBI FieldGuide BOOKS Database
NCBI FieldGuide BOOKS Database
NCBI FieldGuide Genes & Dis
NCBI FieldGuide Genes & Dis
NCBI FieldGuide For More Information…
NCBI FieldGuide Intermission