Download presentation
Presentation is loading. Please wait.
Published byClement Short Modified over 9 years ago
1
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001) NCBI
2
NCBI Home Page www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov To learn more, visit “Site Map” and “About NCBI” web pages
5
Entrez: An Integrated Database Search and Retrieval System
6
Entrez The (ever) Expanding Entrez System Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure
7
Literature Databases PubMed Books PubMed Central Journals On-Line Mendelian Inheritance in Man (OMIM)
8
Molecular Sequence Databases Sequence Databases Nucleotide (GenBank) Taxonomy PopSet Protein Marker Databases Single Nucleotide Polymorphisms (SNP’s, dbSNP) Sequence Tagged Sites (STS’s, dbSTS) Expressed Sequence Tags (EST’s, dbEST) UniGene
9
Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly
10
ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
11
The International Nucleotide Sequence Database Collaboration NIH NIH NCBI NCBIENTREZGenBank NIG NIG CIB CIB Get Entry Get Entry DDBJ DDBJ EMBL EMBL EBI EBI SRS SRS EMBL EMBL
12
Entrez Nucleotide GenBank 71% DDBJ 19% EMBL 9% RefSeq 1% PDB 0.01%
13
What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via email (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)
14
The Old Way From Fran Lewitter, Whitehead Institute
15
GenBank: NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data Release 136June 2003 25,592,865Records 18,197,119(June 2002) 32,528,249,295 Nucleotides 22,616,937,182(June 2002) 110,000 +Species
16
GenBank Divisions Traditional Divisions BCTBacterial/Archeal INVInvertebrate MAMMammalian (ex. ROD/PRI) PHGPhage PLNPlant/Fungal PRIPrimate RODRodent SYNSynthetic (cloning vectors) VRLViral VRTOther Vertebrate Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site GSSGenome Survey Sequence HTGSHigh Throughput Genomic Sequence HTCHigh Throughput cDNA
17
A Traditional GenBank Record Locus FieldMolecule Type GenBank Division Modification Date Definition Line Taxonomy GI (GenInfo) Keywords Submission Field
18
Feature Table GenPept Record Genomic DNA Sequence
19
Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site HTGSHigh Throughput Genomic Sequence Batch Submission, e-mail, or ftp Inaccurate Poorly Characterized
20
EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes 80-100,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
21
A gene-oriented view of sequence entries MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?
22
EST hits to Homo sapiens muscle creatine kinase mRNA Query Sequence (muscle creatine kinase mRNA) 5’ EST Hits 3’ EST Hits
23
UniGene Entry for H. sapiens Muscle Creatine Kinase
24
STS Division : Sequence Tagged Sites Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) PCR with STS primers gives one product per genome Basis of Radiation Hybrid Mapping UniGene Genome Assembly Related resource: Electronic PCR
25
UniSTS: Database of Mapped Markers
27
40,000 to > 50,000 bp phase 1 phase 2 phase 3 ROD Acc = AC109609.1 Acc =AC109609.6 Acc = AC109609.10 HTG HTG Division: High Throughput Genome Same accession numbers, different versions unfinished, oriented,ordered,may have gaps unfinished, may be unordered,with gaps finished,no gaps
28
HTG Division: High Throughput Genome
29
RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs) draft human genome mouse genome Chromosome records Microbial viral organelle
30
Chromosome: NC_000000 mRNA: NM_000000 Model mRNA: XM_000000 protein: NP_000000 Model RNA: XR_000000 RNA: NR_000000 Gene: NG_000000 Curated Automated Model protein: XP_000000 Contig: NT_000000 NW_000000 Reference Sequences
31
LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_002695 VERSION NC_002695.1 GI:15829254 KEYWORDS. SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999) MEDLINE 20198780 PUBMED 10734605 RefSeq Chromosomes: NC_
32
RefSeq Contig: NT_, NW_
33
Curated RefSeq Records: NM_, NP_
34
Alignment Generated Transcripts: XM_,XP_
35
REFSEQ:Summary
48
BLAST a starting point for most bioinformatics related problems…
49
BLAST
50
One BLAST, many flavors
51
BLAST databases
52
Example: BLASTing protein sequence
53
BLAST output
54
BLAST output formatting
55
BLAST output
56
BLAST output low complexity filter
57
BLAST Scores we get from BLAST have an underlying distribution. E-value: the number of alignments with a particular score, or better score, that are expected to occur by chance when comparing two random sequences
58
BLAST
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.