Presentation is loading. Please wait.

Presentation is loading. Please wait.

Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.

Similar presentations


Presentation on theme: "Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate."— Presentation transcript:

1 Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001) NCBI

2 NCBI Home Page www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov To learn more, visit “Site Map” and “About NCBI” web pages

3

4

5 Entrez: An Integrated Database Search and Retrieval System

6 Entrez The (ever) Expanding Entrez System Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure

7 Literature Databases PubMed Books PubMed Central Journals On-Line Mendelian Inheritance in Man (OMIM)

8 Molecular Sequence Databases Sequence Databases Nucleotide (GenBank) Taxonomy PopSet Protein Marker Databases Single Nucleotide Polymorphisms (SNP’s, dbSNP) Sequence Tagged Sites (STS’s, dbSTS) Expressed Sequence Tags (EST’s, dbEST) UniGene

9 Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly

10 ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA

11 The International Nucleotide Sequence Database Collaboration NIH NIH NCBI NCBIENTREZGenBank NIG NIG CIB CIB Get Entry Get Entry DDBJ DDBJ EMBL EMBL EBI EBI SRS SRS EMBL EMBL

12 Entrez Nucleotide GenBank 71% DDBJ 19% EMBL 9% RefSeq 1% PDB 0.01%

13 What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via email (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)

14 The Old Way From Fran Lewitter, Whitehead Institute

15 GenBank: NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data Release 136June 2003 25,592,865Records 18,197,119(June 2002) 32,528,249,295 Nucleotides 22,616,937,182(June 2002) 110,000 +Species

16 GenBank Divisions Traditional Divisions BCTBacterial/Archeal INVInvertebrate MAMMammalian (ex. ROD/PRI) PHGPhage PLNPlant/Fungal PRIPrimate RODRodent SYNSynthetic (cloning vectors) VRLViral VRTOther Vertebrate Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site GSSGenome Survey Sequence HTGSHigh Throughput Genomic Sequence HTCHigh Throughput cDNA

17 A Traditional GenBank Record Locus FieldMolecule Type GenBank Division Modification Date Definition Line Taxonomy GI (GenInfo) Keywords Submission Field

18 Feature Table GenPept Record Genomic DNA Sequence

19 Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site HTGSHigh Throughput Genomic Sequence Batch Submission, e-mail, or ftp Inaccurate Poorly Characterized

20 EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes 80-100,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

21 A gene-oriented view of sequence entries MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?

22 EST hits to Homo sapiens muscle creatine kinase mRNA Query Sequence (muscle creatine kinase mRNA) 5’ EST Hits 3’ EST Hits

23 UniGene Entry for H. sapiens Muscle Creatine Kinase

24 STS Division : Sequence Tagged Sites Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) PCR with STS primers gives one product per genome Basis of Radiation Hybrid Mapping UniGene Genome Assembly Related resource: Electronic PCR

25 UniSTS: Database of Mapped Markers

26

27 40,000 to > 50,000 bp phase 1 phase 2 phase 3 ROD Acc = AC109609.1 Acc =AC109609.6 Acc = AC109609.10 HTG HTG Division: High Throughput Genome Same accession numbers, different versions unfinished, oriented,ordered,may have gaps unfinished, may be unordered,with gaps finished,no gaps

28 HTG Division: High Throughput Genome

29 RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs) draft human genome mouse genome Chromosome records Microbial viral organelle

30 Chromosome: NC_000000 mRNA: NM_000000 Model mRNA: XM_000000 protein: NP_000000 Model RNA: XR_000000 RNA: NR_000000 Gene: NG_000000 Curated Automated Model protein: XP_000000 Contig: NT_000000 NW_000000 Reference Sequences

31 LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_002695 VERSION NC_002695.1 GI:15829254 KEYWORDS. SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999) MEDLINE 20198780 PUBMED 10734605 RefSeq Chromosomes: NC_

32 RefSeq Contig: NT_, NW_

33 Curated RefSeq Records: NM_, NP_

34 Alignment Generated Transcripts: XM_,XP_

35 REFSEQ:Summary

36

37

38

39

40

41

42

43

44

45

46

47

48 BLAST a starting point for most bioinformatics related problems…

49 BLAST

50 One BLAST, many flavors

51 BLAST databases

52 Example: BLASTing protein sequence

53 BLAST output

54 BLAST output formatting

55 BLAST output

56 BLAST output low complexity filter

57 BLAST Scores we get from BLAST have an underlying distribution. E-value: the number of alignments with a particular score, or better score, that are expected to occur by chance when comparing two random sequences

58 BLAST


Download ppt "Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate."

Similar presentations


Ads by Google