Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Field Guide to GenBank and NCBI Molecular Biology Resources

Similar presentations


Presentation on theme: "A Field Guide to GenBank and NCBI Molecular Biology Resources"— Presentation transcript:

1 A Field Guide to GenBank and NCBI Molecular Biology Resources
slightly modified from Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/

2 NCBI Resources About NCBI NCBI Sequence Databases
Primary Database – GenBank Derivative Databases - RefSeq Entrez Databases and Text Searching BLAST Services Genomic Resources

3 The National Center for Biotechnology Information (NCBI)
Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001)

4 NCBI Home Page http://www.ncbi.nlm.nih.gov
To learn more, visit the “Site Map” and “About NCBI” web pages

5 About NCBI

6 Some NCBI Statistics….

7 Users per day Christmas Day

8 Molecular Databases Primary Databases Derivative Databases
Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly

9 What is GenBank? NCBI’s Primary Sequence Database
Nucleotide only sequence database GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)

10 NIH NIG EMBL The International Nucleotide Sequence
Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank Submissions Updates Submissions Updates EMBL DDBJ EBI CIB NIG Submissions Updates SRS EMBL getentry

11 GenBank: NCBI’s Primary Sequence Database
full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ Release 133 December 2002 22,318,883 Records 28,507,990,166 Nucleotides 110,000 + Species >90 Gigabytes of data

12 Entrez Nucleotide RefSeq 1% EMBL 9% DDBJ 19% GenBank 71%
23,464,770 records

13 Primary vs. Derivative Databases
ACGTGC Curators C GA ATT GA C GA C ATT GA C RefSeq TATAGCCG Sequencing Centers ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA ACGTGC TTGACA Genome Assembly TATAGCCG CGTGA ATTGACTA ACGTGC TATAGCCG CGTGA CGTGA TATAGCCG ATTGACTA ATTGACTA ATTGACTA ATTGACTA TATAGCCG TTGACA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank GA UniGene AT C C ATT C Algorithms ATT GA ATT GA GA ATT GA C GA ATT GA C GA C ATT GA C C

14 Traditional GenBank Divisions
Direct Submissions (Sequin and BankIt) Accurate Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate

15 A Traditional GenBank Record
Locus Field Molecule Type Modification Date GenBank Division Definition Line Accession Number Version GI (GenInfo) Keywords Taxonomy

16 A Traditional GenBank Record

17 Bulk Sequence Divisions of GenBank
Batch Submissions ( and ftp) Inaccurate Poorly Characterized EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTG High Throughput Genomic HTC High Throughput cDNA

18 Organization of GenBank
11 Traditional Divisions PAT 4% Traditional 8% 1 Patent Division STS, HTG, HTC 2% GSS 19% EST 67% 5 Bulk Divisions 23,087,196 records

19 What is UniGene? A gene-oriented view of sequence entries
MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents

20 Organisms Represented in UniGene

21 Genome Sequencing Whole BAC insert (or genome)
shredding sequencing cloning isolating GSS division or trace archive assembly Draft Sequence (HTG division)

22 Working Draft Sequence
gaps

23 HTG Division: High Throughput Genome
phase 1 phase 2 phase 3 ROD Acc = AC Acc =AC Acc = AC HTG

24 HTG Division: High Throughput Genome

25 NCBI’s Third Party Annotation (TPA) Database
NEW NCBI now accepts the submission of new annotations of existing GenBank sequences; Facilitates the annotation of genomes by experts;

26 A Sample TPA record

27 RefSeq: NCBI’s Derivative Sequence Database
Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs) draft human genome mouse genome Chromosome records Microbial viral organelle

28 The RefSeq Accession Numbers
mRNAs and Proteins NM_ Curated mRNA NP_ Curated Protein NR_ Curated non-coding RNA XM_ Predicted Transcript (human, mouse) XP_ Predicted Protein (human, mouse) XR_ Predicted non-coding RNA Gene Records NG_ Reference Genomic Sequence (human) Assemblies NT_ Contig (Mouse and Human) NW_ Supercontig (Mouse) NC_ Chromosome (Microbial,Viral,Arabidopsis ) NR_ Interim Identifier for Microbial Chromosomes human mouse rat fruit fly zebrafish Arabidopsis

29 Curated RefSeq Records: NM_, NP_

30 Entrez: Linking and Neighboring

31 The Entrez Databases

32 The (ever) Expanding Entrez System
Journals UniGene Books SNP PubMed PubMed Central UniSTS Nucleotide PopSet Protein Entrez ProbeSet Genome Structure Taxonomy CDD 3D Domains OMIM

33 glucose 6 phosphate dehydrogenase
Entrez Nucleotides glucose 6 phosphate dehydrogenase

34 Document Summaries: glucose 6 phosphate dehydrogenase[All Fields] = 748 hits

35 Entrez Nucleotides: Limits
Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume glucose 6 phosphate dehydrogenase

36 Entrez Nucleotides: Preview/Index

37 Adding Terms: Preview/Index
Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length . . .

38 Plant G6PD mRNAs

39 Display: Formats, Links, and Neighbors
Summary Brief ASN.1 FASTA XML GenBank GI list LinkOut Nucleotide Neighbors Genome Links ProbeSet Links OMIM Links PopSet Links Protein Links PubMed Links SNP Links Structure Links Taxonomy Links UniSTS Links

40 FASTA definition line >gi|603218|gb|U18238.1|MSU18238 >
>gi|603218|gb|U |MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA FASTA definition line >gi|603218|gb|U |MSU18238 gi number Database identifiers gb GenBank emb EMBL dbj DDBJ sp SWISS-PROT pdb Protein Databank pir PIR prf PRF ref RefSeq Accession number Locus name >

41 Entrez Genome

42 Organism Pages

43 The Map Viewer: a common platform for integrated display

44 The Map Viewer

45 Entrez PubMed

46 Online Books

47 Entrez Specialized Databases
Taxonomy Searchable taxonomic tree having nodes for all species with records in an Entrez database Online Mendelian Inheritance in Man: A database of genetically linked human diseases OMIM ProbeSet Expression data (GEO) and microarray datasets

48 Entrez Taxonomy

49 Entrez OMIM

50 Entrez ProbeSet

51 Trace Archive

52 Entrez Structure

53 Structure Summary Cn3D viewer Related Structures Conserved Domains

54 Cn3D: Displaying Structures

55 Structural Alignment

56


Download ppt "A Field Guide to GenBank and NCBI Molecular Biology Resources"

Similar presentations


Ads by Google