Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University.

Similar presentations


Presentation on theme: "NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University."— Presentation transcript:

1 NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University of Colorado Health Sciences Center

2 NCBI FieldGuide Topics  About NCBI  GenBank overview  Primary vs derivative databases  The Reference Sequence (RefSeq) project  Entrez databases  Genome resources  Bookshelf -break-  Entrez text searching  BLAST sequence searching  VAST structure searching  An integrated example

3 NCBI FieldGuide The National Institutes of Health Bethesda, MD

4 NCBI FieldGuide The National Center for Biotechnology Information  Accepts submissions of primary data  Develops tools to analyze these data  Creates derivative databases based on the primary data  Provides free search, link, and retrieval of these data, primarily through the Entrez system

5 NCBI FieldGuide NCBI WWW Users per Day

6 NCBI FieldGuide Number of Users Per Day 1997 1998 1999 2000 2001 2002 2003 Christmas & New Year

7 NCBI FieldGuide Homepage - accessing the data all[filter]

8 NCBI FieldGuide all[filter] 1/11/2005 3/15/2005 8/15/2005

9 NCBI FieldGuide Entrez Nucleotide Primary Data  GenBank / DDBJ / EMBL 57.3 million (97.4 %) Derivative Data  RefSeq1.47 million (2.5 %)  RefSeq reviewed 60,000  PDB(structures) 5,973 “Total” 59 million GenBank # records

10 NCBI FieldGuide GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank Release 149 August 2005 47 x 10 6 Records 52 x 10 9 Nucleotides 195 Gigabytes 816 files full release every two months incremental and cumulative updates daily available only through internet release notes: gbrel.txt Over 100 billion bases! Over 100 billion bases!

11 NCBI FieldGuide What is GenBank?  Nucleotide only sequence database  Archival in nature  GenBank Data  Direct submissions (traditional records)  Batch submissions (EST, GSS, STS)  ftp accounts (genome data)  Three collaborating databases  GenBank  DNA Database of Japan (DDBJ)  European Molecular Biology Laboratory (EMBL) Database

12 NCBI FieldGuide GenBank Divisions “Organismal” PRI (28) Primate ROD (15) Rodent PLN (13) Plant and Fungal BCT (11) Bacterial/Archeal INV (7) Invertebrate VRT (7) Other Vertebrate VRL (4) Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1) Unannotated “Functional” EST (377) Expressed Sequence Tag GSS (138) Genome Survey Sequence HTG (63) High Throughput Genomic PAT (17) Patent STS (9) Sequence Tagged Site CON (1) Contigs, virtual Organized by taxonomy (sort of) Direct submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by sequence type Batch submissions (ftp/email) Inaccurate Poorly characterized

13 NCBI FieldGuide GenBank Functional (Bulk) Divisions GenBank EST STS GSS HTG  Expressed Sequence Tag  1st pass single read cDNA  Genome Survey Sequence  1st pass single read gDNA  High Throughput Genomic  incomplete sequences of genomic clones  Sequence Tagged Site  PCR-based mapping reagents Whole Genome Shotgun

14 NCBI FieldGuide EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes 80-100,000 unique cDNA clones in library - isolate unique clones - sequence once from each end make cDNA library 5’ 3’ >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

15 NCBI FieldGuide GSS, WGS, HTG shred Whole BAC insert (or genome) isolate clonessequence GSS division or trace archive Draft sequence ( HTG division ) assembly whole genome shotgun assemblies (traditional division)

16 NCBI FieldGuide HTG Example: Honeybee Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

17 NCBI FieldGuide Whole Genome Shotgun Projects  351 projects  Bacteria (251)  Environmental sequences (6)  Archaea (6)  Eukaryotes (88), including:  Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human  Pufferfish (2)  Honeybee, Anopheles, Fruit Flies (3), Silkworm  Nematode (2)  Yeasts (8), Aspergillus (2)  Rice (2)  351 projects  Bacteria (251)  Environmental sequences (6)  Archaea (6)  Eukaryotes (88), including:  Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human  Pufferfish (2)  Honeybee, Anopheles, Fruit Flies (3), Silkworm  Nematode (2)  Yeasts (8), Aspergillus (2)  Rice (2)

18 NCBI FieldGuide Whole Genome Shotgun (WGS) Projects wgs master[properties]

19 NCBI FieldGuide Derivative Databases GenBank Sequencing Centers UniGene RefSeq: Entrez Gene and annotation pipelines Labs Updated ONLY by submitters EST UniSTS STS HTG GSS PRIRODPLNMAMBCT INVVRTPHGVRL ATT GA ATT C GA C C C C ATT TA ACT Updated by NCBI RefSeq

20 NCBI FieldGuide Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]

21 NCBI FieldGuide Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]

22 NCBI FieldGuide human[organism] AND lipase[title] AND endothelial[title] 3927 bp 4150 bp 3927 bp 2323 bp 261 bp human[organism] AND lipase[title] AND endothelial[title]

23 NCBI FieldGuide RefSeq Benefits genomes transcripts proteins non-redundant; best representative updates to reflect current sequence data and biology distinct, stable accession series

24 NCBI FieldGuide Reference Sequence: RefSeq AccessionSequence Type NM_123456789 mRNA NP_123456789 protein, from NM_ NR_123456 non-coding RNA XM_123456 predicted mRNA XP_123456 predicted protein XR_123456 predicted non-coding RNA ZP_12345678 predicted from NZ_ NC_123456 genomic, e.g., chromosomes NG_123455 genomic, incomplete region NT_123456 genomic, BAC assembly NW_123456 genomic, WGS assembly NZ_ABCD12345678 genomic, WGS collection blue=curated

25 NCBI FieldGuide Genomic DNA (NC, NT, NW) Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) Annotation Process Curated Protein (NP) Scanning.... Genbank Sequences RefSeq

26 NCBI FieldGuide Creating NM_ Records NM’s must have cDNA support Genome annotation Longest mRNA transcript variant 1 transcript variant 2 transcript variant 3

27 NCBI FieldGuide Where is RefSeq?

28 NCBI FieldGuide GENSAT The Entrez System Entrez Nucleotide PubMed Protein Taxonomy Structur e Domains3D Domains Journal s PMC OMIM Books PopSet SNP UniGene UniST S Genome Gene GEO MeSH CancerChromosomes Homologen e PubChe m

29 NCBI FieldGuide A Few Entrez Databases  UniGene Clusters of ESTs, mRNAs  dbSNP Single Nucleotide Polymorphisms  GEO Gene Expression Omnibus microarray and other expression data  CDD Conserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)  UniGene Clusters of ESTs, mRNAs  dbSNP Single Nucleotide Polymorphisms  GEO Gene Expression Omnibus microarray and other expression data  CDD Conserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)

30 NCBI FieldGuide Gene-oriented clusters of expressed sequences Automatic clustering using MegaBlast Each cluster represents a unique gene Informed by genome hits Information on tissue types and map locations Useful for gene discovery and selection of mapping reagents UniGene unique gene

31 NCBI FieldGuide A Cluster of ESTs query 5’ EST hits 3’ EST hits

32 NCBI FieldGuide UniGene Collections

33 NCBI FieldGuide Example UniGene Cluster

34 NCBI FieldGuide Histogram of cluster sizes for UniGene Hs Build 177 (Now at Build #186)

35 NCBI FieldGuide UniGene Cluster Hs.95351 SELECTED PROTEIN SIMILARITES

36 NCBI FieldGuide UniGene Cluster Hs.95351 GENE EXPRESSION

37 NCBI FieldGuide UniGene Cluster Hs.95351: expression

38 NCBI FieldGuide UniGene Cluster Hs.95351: seqs

39 NCBI FieldGuide Download sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

40 NCBI FieldGuide Entrez GEO

41 NCBI FieldGuide NCBI’s SNP Database  Primary and derivative (RefSNP)  Single nucleotide polymorphisms  Repeat polymorphisms  Insertion-deletion polymorphisms  Over 19 million refSNPs (rsXXXXXXX) ( August, 2005)

42 NCBI FieldGuide Searching dbSNP

43 NCBI FieldGuide RefSNP

44 NCBI FieldGuide RefSNP

45 NCBI FieldGuide RefSNP

46 NCBI FieldGuide RefSNP Search Mouse SNP between strains

47 NCBI FieldGuide RefSNP MapView GeneView SeqView OMIM No 3D

48 NCBI FieldGuide RefSNP

49 NCBI FieldGuide Entrez GEO

50 NCBI FieldGuide GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GSE Grouping of slide/chip data “a single experiment” GDS Grouping of experiments Curated by NCBI Submitted by Experimentalists Submitted by Manufacturer* Entrez GEO Entrez GEO Datasets G EO S a M ple : experimental conditions G EO SE ries : set of related samples

51 NCBI FieldGuide What’s a DataSet? Platform (GPL) array definition Sample (GSM) hyb. measurements Series (GSE) related Samples Supplied by submitter DataSet (GDS) A collection of experimentally-related samples processed using the same platform. Samples within DataSets are organized into subgroups based on experimental variables. Form the basis of GEO’s query, analysis and data display tools. Assembled by GEO staff

52 NCBI FieldGuide Gene Expression Omnibus (GEO) Dataset browser

53 NCBI FieldGuide GEO Dataset Browser

54 NCBI FieldGuide GEO Dataset Report

55 NCBI FieldGuide GEO Profiles … of 12625

56 NCBI FieldGuide Entrez CDD

57 NCBI FieldGuide Conserved Domain Database  Multiple sequence alignments  Position-specific scoring matrices (PSSM)  Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments)  Multiple sequence alignments  Position-specific scoring matrices (PSSM)  Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments)

58 NCBI FieldGuide CDD >gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE

59 NCBI FieldGuide CDD CD Pfam COG Click on a colored bar to align your sequence to the CD

60 NCBI FieldGuide Conserved Domain Database: cd00371.1, HMA

61 NCBI FieldGuide CDD

62 NCBI FieldGuide CDART: Conserved Domain Architecture Retrieval Tool

63 NCBI FieldGuide cdd Linking from Entrez Protein

64 NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology

65 NCBI FieldGuide Genomic Biology

66 NCBI FieldGuide Gen Biol: Gen Resources

67 NCBI FieldGuide Gen Biol: Gen Resources

68 NCBI FieldGuide Gen Biol: Gen Resources

69 NCBI FieldGuide Genome Projects: microb

70 NCBI FieldGuide Gen Biol: Gen Resources

71 NCBI FieldGuide Gen Biol: Gen Resources

72 NCBI FieldGuide Gen Biol: Gen Resources

73 NCBI FieldGuide Gen Biol: Gen Resources

74 NCBI FieldGuide Gen Biol: Gen Resources

75 NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology

76 NCBI FieldGuide Entrez Gene A single query interface to … Sequences - RefSeqs - GenBank - Homologene Maps – MapViewer Entrez links Linkouts More organisms, ~ 3000 Entrez integration More organisms, ~ 3000 Entrez integration

77 NCBI FieldGuide Global Entrez: NADH2

78 NCBI FieldGuide Entrez Gene: NADH2

79 NCBI FieldGuide Gene Record for Pongo NADH2 Homo sapiens Not found with “nadh2”

80 NCBI FieldGuide A Record With More Data: Human HFE

81 NCBI FieldGuide Human HFE: Transcripts Transcripts with experimental evidence

82 NCBI FieldGuide Gene Table

83 NCBI FieldGuide Introns/Exons: Gene Table links to sequence

84 NCBI FieldGuide Human HFE: Links

85 NCBI FieldGuide Genotype

86 NCBI FieldGuide Genotype

87 NCBI FieldGuide Human HFE: Links

88 NCBI FieldGuide GeneView in dbSNP

89 NCBI FieldGuide SNP in Structure

90 NCBI FieldGuide SNP in Structure

91 NCBI FieldGuide SNP in Structure H41 S43 C260

92 NCBI FieldGuide Another Variation Source: OMIM

93 NCBI FieldGuide Variants in OMIM

94 NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology

95 NCBI FieldGuide The New Homologene Automated detection of homologs among the annotated genes of completely sequenced eukaryotic genomes.  No longer UniGene based  Protein similarities first  Guided by taxonomic tree  Includes orthologs and paralogs  No longer UniGene based  Protein similarities first  Guided by taxonomic tree  Includes orthologs and paralogs

96 NCBI FieldGuide The New Homologene Homologene Build 43.1 (8/23/05) Species Number of genes input grouped groups

97 NCBI FieldGuide RAG1 → Homologene

98 NCBI FieldGuide RAG1 → Homolgene RAG1

99 NCBI FieldGuide RAG1 RING-finger

100 NCBI FieldGuide RAG1 → Homolgene RAG1

101 NCBI FieldGuide RAG1 Sugar_tr

102 NCBI FieldGuide Homologene: alignment scores

103 NCBI FieldGuide BLASTP bl2seq

104 NCBI FieldGuide Genome ResourcesLocusLink Gene database UniGene Trace Archive Map Viewer Homologene

105 NCBI FieldGuide List View

106 NCBI FieldGuide Human MapViewer adar

107 NCBI FieldGuide MapViewer: Human ADAR

108 NCBI FieldGuide MV Hs ADAR 3’ UTR 5’ UTR

109 NCBI FieldGuide Maps & Options --Sequence maps-- Ab initio Assembly Repeats BES_Clone Clone NCI_Clone Contig Component CpG island dbSNP haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_Tag STS TCAG_RNA Transcript (RNA) Hs_UniGene Hs_EST --Cytogenetic maps-- Ideogram FISH Clone Gene_Cytogenetic Mitelman Breakpoint Morbid/Disease --Genetic Maps-- deCODE Genethon Marshfield --RH maps-- GeneMap99-G3 GeneMap99-GB4 NCBI RH Standford-G3 TNG Whitehead-RH Whitehead-YAC Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ssc_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variation Maps & Options = SNP

110 NCBI FieldGuide MapViewer UniGene Component Repeats Gene

111 NCBI FieldGuide Gene PhenotypeVariation

112 NCBI FieldGuide Maps & Options

113 NCBI FieldGuide Genome ResourcesLocusLink Gene database UniGene Trace Archive Map Viewer Homologene

114 NCBI FieldGuide Trace Archive Page

115 NCBI FieldGuide Macaca Mulatta Traces

116 NCBI FieldGuide

117 Trace Archive BLAST Page Access to sequences NOT in GenBank

118 NCBI FieldGuide Literature Links

119 NCBI FieldGuide BOOKS Database

120 NCBI FieldGuide BOOKS Database: hyperlinked

121 NCBI FieldGuide BOOKS Database

122 NCBI FieldGuide BOOKS Database

123 NCBI FieldGuide BOOKS Database

124 NCBI FieldGuide Genes & Dis

125 NCBI FieldGuide Genes & Dis

126 NCBI FieldGuide For More Information…

127 NCBI FieldGuide Intermission


Download ppt "NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University."

Similar presentations


Ads by Google