Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.

Similar presentations


Presentation on theme: "Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be."— Presentation transcript:

1 Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be able to search GenBank for information. Be able to explain the content difference between a header, features and sequence. Be able to say what distinguishes a primary database from a secondary database. Be able to use and talk about the RefSeq and dbEST databases as they fit into the objectives above. Be able to access and navigate the ENTREZ platform for biological data analysis.

2 BIOSEQs – entry common to all sequence databases BIOSEQ = Biological sequence Central element in the NCBI database model. Found in both the nucleotide and protein databases Comprises the sequence of a single continuous molecule of nucleic acid or protein. Entry must have At least one sequence identifier (Seq-id) Information on the physical type of molecule (DNA, RNA, or protein) Descriptors, which describe the entire Bioseq Annotations, which provide information regarding specific locations within the Bioseq

3 What is GenBank? The NIH genetic sequence database, an annotated collection of all publicly available NUCLEIC ACID sequences Each record represents a single contiguous stretch of DNA or RNA DNA stretches may have more than one coding region (i.e., more than one gene). RNA sequences are presented with T, not U Records are generated from direct submissions to the DNA sequence databases from the investigators (authors). GenBank is part of the International Nucleotide Sequence Database Collaboration.

4 The number of basepairs is now at over 85 billion. The number of sequences is approaching 83 million.

5 General Comments on GBFF Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented by a key. 3) Nucleotide sequence-each ends with // on last line of record. Nucleic acid (DNA or RNA (cDNA)) sequence translated to amino acid sequence is a “feature” Genbank Flat File (MyoD1 as an example)

6 Feature Keys Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to sequences Feature KeyDescription conflictSeparate determinations of the same seq. differ rep_originOrigin of replication protein_bindProtein binding site on DNA CDSProtein coding sequence

7 Feature Keys-Terminology Feature Key Location/Qualifiers CDS 23..400 /product=“alcohol dehydro.” /gene=“adhI” The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

8 Feature Keys-Terminology (Cont.) Feat. Key Location/Qualifiers CDS join (544..589,688..1032) /product=“T-cell recep. B-ch.” /partial The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain. (For MyoD1 – Accession number X61655)X61655

9 Record from GenBank LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS. SOURCE baker's yeast. ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Modification date GenBank division (plant, fungal and algal) Coding region Unique identifier (never changes) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) GeneInfo identifier (changes whenever there is a change) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. Common name for organism Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database Locus name

10 Record from GenBank (cont.1) REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) MEDLINE 95176709 REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) MEDLINE 96194260 Medline UID REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA Submitter of sequence (always the last reference)

11 Record from GenBank (cont.2) FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" The 5’ end of the coding sequence begins upstream of the first nucleotide of the sequence. The 3’ end is complete. There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) Keys Location Qualifiers Descriptive free text must be in quotations Start of open reading frame Database cross-refs Protein sequence ID # Note: only a partial sequence Values

12 Record from GenBank (cont.3) gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN... “ gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ... “ Cutoff Another location

13 Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct...//

14 Primary databases vs. Secondary databases Primary database comprises information submitted directly by the experimenter. is called an archival database. Secondary database comprises information derived from primary database. is a curated database.

15 NCBI site map To notice on the map General organization Where the following fit: RefSeq (nucleotide, protein) dbEST Others of interest to you NCBI site map: http://www.ncbi.nlm.nih.gov/Sitemap/index.html

16 Types of primary databases carrying biological infomation GenBank/EMBL/DDBJ PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships. http://www.expasy.org/prosite/

17 RNA cDNA DNA protein DNA databases derived from GenBank containing data for a single gene Non-redundant (nr) dbGSS (genome survey sequences) dbHTGS (high throughput) dbSTS (sequence tagged site) LocusLink RefSeq RNA (cDNA) databases derived from GenBank containing data for a single gene dbEST (expressed sequence tag) UniGene LocusLink RefSeq Protein databases derived from GenBank containing data for a single gene Non-redundant (nr) Swissprot PIR (Int’l. protein sequence) LocusLink RefSeq Secondary Databases

18 Types of secondary databases carrying biological infomation RefSeq- Comprehensive, integrated, non- redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. http://www.pubmedcentral.nih.gov/articlerender.fcgi?t ool=pubmed&pubmedid=15608248 http://www.pubmedcentral.nih.gov/articlerender.fcgi?t ool=pubmed&pubmedid=15608248

19 Types of secondary databases carrying biological infomation Some nucleotide secondary databases dbEST- Sequence data and other information on "single-pass" cDNA sequences, or "Expressed Sequence Tags", from a number of organisms Genome databases-(there are over 20 genome databases that can be searched) EPD:eukaryotic promoter database http://www.epd.isb-sib.ch/ NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one.

20 Types of secondary databases carrying biological infomation Some protein secondary databases ProDom http://protein.toulouse.inra.fr/prodom/current/html/ home.php http://protein.toulouse.inra.fr/prodom/current/html/ home.php PRINTS http://bioinf.man.ac.uk/dbbrowser/PRINTS/ BLOCKS http://bioinformatics.weizmann.ac.il/blocks/

21

22 References for understanding the NCBI sequence database model Here is the website for NCBI developer tools. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SD KDOCS/INDEX.HTML http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SD KDOCS/INDEX.HTML

23 Mature mRNA RNA, but NOT mRNA DNA  RNA  PROTEIN RNA processing


Download ppt "Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be."

Similar presentations


Ads by Google