Download presentation
Presentation is loading. Please wait.
Published byEileen Goodwin Modified over 9 years ago
1
NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University of Colorado Health Sciences Center
2
NCBI FieldGuide Topics About NCBI GenBank overview Primary vs derivative databases The Reference Sequence (RefSeq) project Entrez databases Genome resources Bookshelf -break- Entrez text searching BLAST sequence searching VAST structure searching An integrated example
3
NCBI FieldGuide The National Institutes of Health Bethesda, MD
4
NCBI FieldGuide The National Center for Biotechnology Information Accepts submissions of primary data Develops tools to analyze these data Creates derivative databases based on the primary data Provides free search, link, and retrieval of these data, primarily through the Entrez system
5
NCBI FieldGuide NCBI WWW Users per Day
6
NCBI FieldGuide Number of Users Per Day 1997 1998 1999 2000 2001 2002 2003 Christmas & New Year
7
NCBI FieldGuide Homepage - accessing the data all[filter]
8
NCBI FieldGuide all[filter] 1/11/2005 3/15/2005 8/15/2005
9
NCBI FieldGuide Entrez Nucleotide Primary Data GenBank / DDBJ / EMBL 57.3 million (97.4 %) Derivative Data RefSeq1.47 million (2.5 %) RefSeq reviewed 60,000 PDB(structures) 5,973 “Total” 59 million GenBank # records
10
NCBI FieldGuide GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank Release 149 August 2005 47 x 10 6 Records 52 x 10 9 Nucleotides 195 Gigabytes 816 files full release every two months incremental and cumulative updates daily available only through internet release notes: gbrel.txt Over 100 billion bases! Over 100 billion bases!
11
NCBI FieldGuide What is GenBank? Nucleotide only sequence database Archival in nature GenBank Data Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database
12
NCBI FieldGuide GenBank Divisions “Organismal” PRI (28) Primate ROD (15) Rodent PLN (13) Plant and Fungal BCT (11) Bacterial/Archeal INV (7) Invertebrate VRT (7) Other Vertebrate VRL (4) Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1) Unannotated “Functional” EST (377) Expressed Sequence Tag GSS (138) Genome Survey Sequence HTG (63) High Throughput Genomic PAT (17) Patent STS (9) Sequence Tagged Site CON (1) Contigs, virtual Organized by taxonomy (sort of) Direct submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by sequence type Batch submissions (ftp/email) Inaccurate Poorly characterized
13
NCBI FieldGuide GenBank Functional (Bulk) Divisions GenBank EST STS GSS HTG Expressed Sequence Tag 1st pass single read cDNA Genome Survey Sequence 1st pass single read gDNA High Throughput Genomic incomplete sequences of genomic clones Sequence Tagged Site PCR-based mapping reagents Whole Genome Shotgun
14
NCBI FieldGuide EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes 80-100,000 unique cDNA clones in library - isolate unique clones - sequence once from each end make cDNA library 5’ 3’ >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
15
NCBI FieldGuide GSS, WGS, HTG shred Whole BAC insert (or genome) isolate clonessequence GSS division or trace archive Draft sequence ( HTG division ) assembly whole genome shotgun assemblies (traditional division)
16
NCBI FieldGuide HTG Example: Honeybee Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
17
NCBI FieldGuide Whole Genome Shotgun Projects 351 projects Bacteria (251) Environmental sequences (6) Archaea (6) Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (2) Yeasts (8), Aspergillus (2) Rice (2) 351 projects Bacteria (251) Environmental sequences (6) Archaea (6) Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (2) Yeasts (8), Aspergillus (2) Rice (2)
18
NCBI FieldGuide Whole Genome Shotgun (WGS) Projects wgs master[properties]
19
NCBI FieldGuide Derivative Databases GenBank Sequencing Centers UniGene RefSeq: Entrez Gene and annotation pipelines Labs Updated ONLY by submitters EST UniSTS STS HTG GSS PRIRODPLNMAMBCT INVVRTPHGVRL ATT GA ATT C GA C C C C ATT TA ACT Updated by NCBI RefSeq
20
NCBI FieldGuide Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]
21
NCBI FieldGuide Why Make Reference Sequences? Entrez Nucleotide query: human[organism] AND lipase[title]
22
NCBI FieldGuide human[organism] AND lipase[title] AND endothelial[title] 3927 bp 4150 bp 3927 bp 2323 bp 261 bp human[organism] AND lipase[title] AND endothelial[title]
23
NCBI FieldGuide RefSeq Benefits genomes transcripts proteins non-redundant; best representative updates to reflect current sequence data and biology distinct, stable accession series
24
NCBI FieldGuide Reference Sequence: RefSeq AccessionSequence Type NM_123456789 mRNA NP_123456789 protein, from NM_ NR_123456 non-coding RNA XM_123456 predicted mRNA XP_123456 predicted protein XR_123456 predicted non-coding RNA ZP_12345678 predicted from NZ_ NC_123456 genomic, e.g., chromosomes NG_123455 genomic, incomplete region NT_123456 genomic, BAC assembly NW_123456 genomic, WGS assembly NZ_ABCD12345678 genomic, WGS collection blue=curated
25
NCBI FieldGuide Genomic DNA (NC, NT, NW) Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) Annotation Process Curated Protein (NP) Scanning.... Genbank Sequences RefSeq
26
NCBI FieldGuide Creating NM_ Records NM’s must have cDNA support Genome annotation Longest mRNA transcript variant 1 transcript variant 2 transcript variant 3
27
NCBI FieldGuide Where is RefSeq?
28
NCBI FieldGuide GENSAT The Entrez System Entrez Nucleotide PubMed Protein Taxonomy Structur e Domains3D Domains Journal s PMC OMIM Books PopSet SNP UniGene UniST S Genome Gene GEO MeSH CancerChromosomes Homologen e PubChe m
29
NCBI FieldGuide A Few Entrez Databases UniGene Clusters of ESTs, mRNAs dbSNP Single Nucleotide Polymorphisms GEO Gene Expression Omnibus microarray and other expression data CDD Conserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD) UniGene Clusters of ESTs, mRNAs dbSNP Single Nucleotide Polymorphisms GEO Gene Expression Omnibus microarray and other expression data CDD Conserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)
30
NCBI FieldGuide Gene-oriented clusters of expressed sequences Automatic clustering using MegaBlast Each cluster represents a unique gene Informed by genome hits Information on tissue types and map locations Useful for gene discovery and selection of mapping reagents UniGene unique gene
31
NCBI FieldGuide A Cluster of ESTs query 5’ EST hits 3’ EST hits
32
NCBI FieldGuide UniGene Collections
33
NCBI FieldGuide Example UniGene Cluster
34
NCBI FieldGuide Histogram of cluster sizes for UniGene Hs Build 177 (Now at Build #186)
35
NCBI FieldGuide UniGene Cluster Hs.95351 SELECTED PROTEIN SIMILARITES
36
NCBI FieldGuide UniGene Cluster Hs.95351 GENE EXPRESSION
37
NCBI FieldGuide UniGene Cluster Hs.95351: expression
38
NCBI FieldGuide UniGene Cluster Hs.95351: seqs
39
NCBI FieldGuide Download sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/
40
NCBI FieldGuide Entrez GEO
41
NCBI FieldGuide NCBI’s SNP Database Primary and derivative (RefSNP) Single nucleotide polymorphisms Repeat polymorphisms Insertion-deletion polymorphisms Over 19 million refSNPs (rsXXXXXXX) ( August, 2005)
42
NCBI FieldGuide Searching dbSNP
43
NCBI FieldGuide RefSNP
44
NCBI FieldGuide RefSNP
45
NCBI FieldGuide RefSNP
46
NCBI FieldGuide RefSNP Search Mouse SNP between strains
47
NCBI FieldGuide RefSNP MapView GeneView SeqView OMIM No 3D
48
NCBI FieldGuide RefSNP
49
NCBI FieldGuide Entrez GEO
50
NCBI FieldGuide GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GSE Grouping of slide/chip data “a single experiment” GDS Grouping of experiments Curated by NCBI Submitted by Experimentalists Submitted by Manufacturer* Entrez GEO Entrez GEO Datasets G EO S a M ple : experimental conditions G EO SE ries : set of related samples
51
NCBI FieldGuide What’s a DataSet? Platform (GPL) array definition Sample (GSM) hyb. measurements Series (GSE) related Samples Supplied by submitter DataSet (GDS) A collection of experimentally-related samples processed using the same platform. Samples within DataSets are organized into subgroups based on experimental variables. Form the basis of GEO’s query, analysis and data display tools. Assembled by GEO staff
52
NCBI FieldGuide Gene Expression Omnibus (GEO) Dataset browser
53
NCBI FieldGuide GEO Dataset Browser
54
NCBI FieldGuide GEO Dataset Report
55
NCBI FieldGuide GEO Profiles … of 12625
56
NCBI FieldGuide Entrez CDD
57
NCBI FieldGuide Conserved Domain Database Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments) Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments)
58
NCBI FieldGuide CDD >gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE
59
NCBI FieldGuide CDD CD Pfam COG Click on a colored bar to align your sequence to the CD
60
NCBI FieldGuide Conserved Domain Database: cd00371.1, HMA
61
NCBI FieldGuide CDD
62
NCBI FieldGuide CDART: Conserved Domain Architecture Retrieval Tool
63
NCBI FieldGuide cdd Linking from Entrez Protein
64
NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology
65
NCBI FieldGuide Genomic Biology
66
NCBI FieldGuide Gen Biol: Gen Resources
67
NCBI FieldGuide Gen Biol: Gen Resources
68
NCBI FieldGuide Gen Biol: Gen Resources
69
NCBI FieldGuide Genome Projects: microb
70
NCBI FieldGuide Gen Biol: Gen Resources
71
NCBI FieldGuide Gen Biol: Gen Resources
72
NCBI FieldGuide Gen Biol: Gen Resources
73
NCBI FieldGuide Gen Biol: Gen Resources
74
NCBI FieldGuide Gen Biol: Gen Resources
75
NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology
76
NCBI FieldGuide Entrez Gene A single query interface to … Sequences - RefSeqs - GenBank - Homologene Maps – MapViewer Entrez links Linkouts More organisms, ~ 3000 Entrez integration More organisms, ~ 3000 Entrez integration
77
NCBI FieldGuide Global Entrez: NADH2
78
NCBI FieldGuide Entrez Gene: NADH2
79
NCBI FieldGuide Gene Record for Pongo NADH2 Homo sapiens Not found with “nadh2”
80
NCBI FieldGuide A Record With More Data: Human HFE
81
NCBI FieldGuide Human HFE: Transcripts Transcripts with experimental evidence
82
NCBI FieldGuide Gene Table
83
NCBI FieldGuide Introns/Exons: Gene Table links to sequence
84
NCBI FieldGuide Human HFE: Links
85
NCBI FieldGuide Genotype
86
NCBI FieldGuide Genotype
87
NCBI FieldGuide Human HFE: Links
88
NCBI FieldGuide GeneView in dbSNP
89
NCBI FieldGuide SNP in Structure
90
NCBI FieldGuide SNP in Structure
91
NCBI FieldGuide SNP in Structure H41 S43 C260
92
NCBI FieldGuide Another Variation Source: OMIM
93
NCBI FieldGuide Variants in OMIM
94
NCBI FieldGuide Genome Resources Gene database Trace Archive Map Viewer Homologene Genomic Biology
95
NCBI FieldGuide The New Homologene Automated detection of homologs among the annotated genes of completely sequenced eukaryotic genomes. No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs
96
NCBI FieldGuide The New Homologene Homologene Build 43.1 (8/23/05) Species Number of genes input grouped groups
97
NCBI FieldGuide RAG1 → Homologene
98
NCBI FieldGuide RAG1 → Homolgene RAG1
99
NCBI FieldGuide RAG1 RING-finger
100
NCBI FieldGuide RAG1 → Homolgene RAG1
101
NCBI FieldGuide RAG1 Sugar_tr
102
NCBI FieldGuide Homologene: alignment scores
103
NCBI FieldGuide BLASTP bl2seq
104
NCBI FieldGuide Genome ResourcesLocusLink Gene database UniGene Trace Archive Map Viewer Homologene
105
NCBI FieldGuide List View
106
NCBI FieldGuide Human MapViewer adar
107
NCBI FieldGuide MapViewer: Human ADAR
108
NCBI FieldGuide MV Hs ADAR 3’ UTR 5’ UTR
109
NCBI FieldGuide Maps & Options --Sequence maps-- Ab initio Assembly Repeats BES_Clone Clone NCI_Clone Contig Component CpG island dbSNP haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_Tag STS TCAG_RNA Transcript (RNA) Hs_UniGene Hs_EST --Cytogenetic maps-- Ideogram FISH Clone Gene_Cytogenetic Mitelman Breakpoint Morbid/Disease --Genetic Maps-- deCODE Genethon Marshfield --RH maps-- GeneMap99-G3 GeneMap99-GB4 NCBI RH Standford-G3 TNG Whitehead-RH Whitehead-YAC Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ssc_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variation Maps & Options = SNP
110
NCBI FieldGuide MapViewer UniGene Component Repeats Gene
111
NCBI FieldGuide Gene PhenotypeVariation
112
NCBI FieldGuide Maps & Options
113
NCBI FieldGuide Genome ResourcesLocusLink Gene database UniGene Trace Archive Map Viewer Homologene
114
NCBI FieldGuide Trace Archive Page
115
NCBI FieldGuide Macaca Mulatta Traces
116
NCBI FieldGuide
117
Trace Archive BLAST Page Access to sequences NOT in GenBank
118
NCBI FieldGuide Literature Links
119
NCBI FieldGuide BOOKS Database
120
NCBI FieldGuide BOOKS Database: hyperlinked
121
NCBI FieldGuide BOOKS Database
122
NCBI FieldGuide BOOKS Database
123
NCBI FieldGuide BOOKS Database
124
NCBI FieldGuide Genes & Dis
125
NCBI FieldGuide Genes & Dis
126
NCBI FieldGuide For More Information…
127
NCBI FieldGuide Intermission
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.