Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001) NCBI
NCBI Home Page To learn more, visit “Site Map” and “About NCBI” web pages
Entrez: An Integrated Database Search and Retrieval System
Entrez The (ever) Expanding Entrez System Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure
Literature Databases PubMed Books PubMed Central Journals On-Line Mendelian Inheritance in Man (OMIM)
Molecular Sequence Databases Sequence Databases Nucleotide (GenBank) Taxonomy PopSet Protein Marker Databases Single Nucleotide Polymorphisms (SNP’s, dbSNP) Sequence Tagged Sites (STS’s, dbSTS) Expressed Sequence Tags (EST’s, dbEST) UniGene
Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly
ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
The International Nucleotide Sequence Database Collaboration NIH NIH NCBI NCBIENTREZGenBank NIG NIG CIB CIB Get Entry Get Entry DDBJ DDBJ EMBL EMBL EBI EBI SRS SRS EMBL EMBL
Entrez Nucleotide GenBank 71% DDBJ 19% EMBL 9% RefSeq 1% PDB 0.01%
What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)
The Old Way From Fran Lewitter, Whitehead Institute
GenBank: NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data Release 136June ,592,865Records 18,197,119(June 2002) 32,528,249,295 Nucleotides 22,616,937,182(June 2002) 110,000 +Species
GenBank Divisions Traditional Divisions BCTBacterial/Archeal INVInvertebrate MAMMammalian (ex. ROD/PRI) PHGPhage PLNPlant/Fungal PRIPrimate RODRodent SYNSynthetic (cloning vectors) VRLViral VRTOther Vertebrate Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site GSSGenome Survey Sequence HTGSHigh Throughput Genomic Sequence HTCHigh Throughput cDNA
A Traditional GenBank Record Locus FieldMolecule Type GenBank Division Modification Date Definition Line Taxonomy GI (GenInfo) Keywords Submission Field
Feature Table GenPept Record Genomic DNA Sequence
Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site HTGSHigh Throughput Genomic Sequence Batch Submission, , or ftp Inaccurate Poorly Characterized
EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
A gene-oriented view of sequence entries MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?
EST hits to Homo sapiens muscle creatine kinase mRNA Query Sequence (muscle creatine kinase mRNA) 5’ EST Hits 3’ EST Hits
UniGene Entry for H. sapiens Muscle Creatine Kinase
STS Division : Sequence Tagged Sites Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) PCR with STS primers gives one product per genome Basis of Radiation Hybrid Mapping UniGene Genome Assembly Related resource: Electronic PCR
UniSTS: Database of Mapped Markers
40,000 to > 50,000 bp phase 1 phase 2 phase 3 ROD Acc = AC Acc =AC Acc = AC HTG HTG Division: High Throughput Genome Same accession numbers, different versions unfinished, oriented,ordered,may have gaps unfinished, may be unordered,with gaps finished,no gaps
HTG Division: High Throughput Genome
RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs) draft human genome mouse genome Chromosome records Microbial viral organelle
Chromosome: NC_ mRNA: NM_ Model mRNA: XM_ protein: NP_ Model RNA: XR_ RNA: NR_ Gene: NG_ Curated Automated Model protein: XP_ Contig: NT_ NW_ Reference Sequences
LOCUS NC_ bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_ VERSION NC_ GI: KEYWORDS. SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), (1999) MEDLINE PUBMED RefSeq Chromosomes: NC_
RefSeq Contig: NT_, NW_
Curated RefSeq Records: NM_, NP_
Alignment Generated Transcripts: XM_,XP_
REFSEQ:Summary
BLAST a starting point for most bioinformatics related problems…
BLAST
One BLAST, many flavors
BLAST databases
Example: BLASTing protein sequence
BLAST output
BLAST output formatting
BLAST output
BLAST output low complexity filter
BLAST Scores we get from BLAST have an underlying distribution. E-value: the number of alignments with a particular score, or better score, that are expected to occur by chance when comparing two random sequences
BLAST