Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010
What is an accession number? An accession number is a label that is used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA
Accession MoleculeMethodNote AC_ GenomicMixedAlternate complete genomic AP_ ProteinMixedProtein products; alternate NC_ GenomicMixedComplete genomic molecules NG_ GenomicMixedIncomplete genomic regions NM_ mRNAMixedTranscript products; mRNA NM_ mRNAMixedTranscript products; 9-digit NP_ ProteinMixedProtein products; NP_ ProteinCurationProtein products; 9-digit NR_ RNAMixedNon-coding transcripts NT_ GenomicAutomatedGenomic assemblies NW_ GenomicAutomatedGenomic assemblies NZ_ABCD GenomicAutomatedWhole genome shotgun data XM_ mRNAAutomatedTranscript products XP_ ProteinAutomatedProtein products XR_ RNAAutomatedTranscript products YP_ ProteinAuto. & CuratedProtein products ZP_ ProteinAutomatedProtein products NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences
Six ways to access DNA and protein sequences 1) Entrez Gene with RefSeq database (NCBI) 2) UniGene 3) Nucleotide or Protein databases (NCBI) 4) European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) 5) ExPASy Sequence Retrieval System (separate from NCBI) 6) UCSC Genome Browser
What is an EST? Expressed Sequence Tag sequence “A short strand of DNA that is part of a cDNA molecule and can act as an identifier of a gene.” In essence, a single pass DNA sequencing reaction for a particular cDNA
UniGene: unique genes via ESTs UniGene at NCBI: UniGene clusters contain many ESTs, which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library. UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21
Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1
Cluster sizes in UniGene This is a gene (or 1 cluster) with10 ESTs associated; the cluster size is 10 Note: HTC= high thoroughput cDNAs
FASTA format
Orthologous genes for various model species can be easily identified using this site (curated database)