Download presentation
Presentation is loading. Please wait.
1
Biological databases Nicky Mulder: nicola.mulder@uct.ac.za
2
What is a database an organized body of related infomation www.cogsci.princeton.edu/cgi-bin/webwn Data collection that is: –Structured (computer readable) –Searchable –Updatable –Cross-linked –Publicly available
3
Biological Databases Make data available to public So much data available, needs ordering Turn data into computer-readable form Ability to retrieve data from various sources Can have primary (archival) or secondary databases (curated) Most commonly used are sequence databases
4
Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological systems Protein families and domains Whole genome data Sequence data
5
Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological systems Protein families and domains Whole genome data Sequence data
6
Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological systems Protein families and domains Whole genome data Sequence data Ontologies -GO
7
Sequence databases Used for retrieving a known gene/protein sequence Useful for finding information on a gene/protein Can find out how many genes are available for a given organism Can comparing your sequence to the others in the database Can submit your sequence to store with the rest Main databases: nucleotide and protein sequence DBs
8
Requirements for good sequence database It must be complete with minimal redundancy It must contain as much up-to-date information (annotation) as possible on each sequence All the information items must be retrievable by computer programs in a consistent manner It must be highly interoperable with other databases
9
Nucleotide sequence databases EMBL, DDBJ, GenBank Data submitted by sequence owner Must provide certain information and CDS if applicable No additional annotation added Entries never merged –some redundancy Promoter Exons CDS (coding sequence)
10
Example EMBL entry 1: general info ID AB083336 standard; genomic DNA; MAM; 6116 BP. AC AB083336; XX SV AB083336.1 DT 06-JAN-2005 (Rel. 82, Created) DT 06-JAN-2005 (Rel. 82, Last updated, Version 1) DE Sus scrofa p27Kip1 gene for p27Kip1, p27Kip1R, complete cds, alternative DE splicing. OS Sus scrofa (pig) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Cetartiodactyla; Suina; Suidae; Sus. RN [1] RP 1-6116 RA Hirano K., Shintani Y., Hirano M., Kanaide H.; RT ; RL Submitted (08-APR-2002) to the EMBL/GenBank/DDBJ databases. RL Katsuya Hirano, Graduate School of Medical Sciences, Kyushu University, RL Division of Molecular Cardiology, Research Institute of Angiocardiology; RL 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka, 812-8582, Japan RL (E-mail:khirano@molcar.med.kyushu-u.ac.jp, Tel:81-92-642-5550, RL Fax:81-92-642-5552) RN [2] RA Shintani Y., Hirano K., Hirano M., Nishimura J., Nakano H., Kanaide H.; RT "Cloning and Charaterization of full sequence of porcine p27Kip1 gene and RT expression of splice isoform p27Kip1R"; RL Unpublished. References Description of gene Accession number
11
Example EMBL entry 2: features on the sequence -CDS FH Key Location/Qualifiers FT source 1..6116 FT /db_xref="taxon:9823" FT /mol_type="genomic DNA" FT /organism="Sus scrofa" FT /cell_type="liver" FT /clone_lib="lambda Fix II porcine genomic DNA" FT exon 784..1714 FT /evidence=NOT_EXPERIMENTAL FT /note="The residue 2591 corresponds to the transcription FT initiation site determined in human gene" FT CDS join(1240..1714,2261..2271,5104..5160) FT /codon_start=1 FT /gene="p27Kip1" FT /product="p27Kip1R" FT /protein_id="BAD83612.1" FT /translation="MSNVRVSNGSPSLERMDARQAEYPKPSACRNLFGPVNHEELTRDL FT EKHCRDMEEASQRKWNFDFQNHKPLEGKYEWQEVEKGSLPEFYYRPPRPPKGACKVPAQ FT EGQGVSGTRQAVPLIGSQANSEDTHLVDQKTDAPDSQTGLAEQCTGIRKRPATDDSSPP FT SVSLKIGMYQLNYSSVW" Corresponding protein sequence Feature type and location Feature name and information
12
FT intron 1715..2260 FT /cons_splice=(5'site:NO,3'site:NO) FT exon 2261..2390 FT /number=2 FT intron 2391..4494 FT /cons_splice=(5'site:NO,3'site:NO) FT exon 4495..5824 FT /note="ending at a putative poly A site following a polyA FT signal" FT /number=3 FT polyA_signal 5802..5807 XX SQ Sequence 6116 BP; 1583 A; 1392 C; 1438 G; 1703 T; 0 other; gcggccgcga gctcaattaa ccctcactaa agggagtcga ctcgatctcg aagccctttt 60 cttgttttta ttgagggaga gcttgggttc agaatacatt acaaatgcag catctattcc 120 agtctactta tagaaagacg tcctcctggg cttcccccct aagccccctg cctcccctag 180 aacagcacag acttctaggt taagggtgag ctaaccactg ctcaccccca gctaaggcac 240 ccaggctcag gggctccccg cctcccccgc tgagcgagcg gtgggggccc ccccgggaga 300 gagcccagct gggggccgag cgcccagcgg cgagcccagc tgcccgcccc tacccgctcg 360 gcgagcgagg ggaaaataag atcgccctcg gcgaggagag ggaggtcggg gctccggagc 420 Example EMBL entry 3: features on the sequence – introns and exons DNA sequence
13
Summary of information in EMBL entries Describes sequence type, e.g. genomic DNA, RNA, EST Provides taxonomy from which sequence came Provides information on submitters and references Describes features on a sequence NB for function, replication, recombination, structure etc. Shows if the DNA encodes a protein (CDS) and provides protein sequence Provides actual nucleotide sequence
14
Protein sequences DNA RNA Protein S S Ac Protein cleavage Protein modification Transported to organelle or membrane Folded into secondary or tertiary structure Performs a specific function All this info needs to be captured in a database
15
Protein Sequence Databases UniProt: –Swiss-Prot –manually curated, distinguishes between experimental and computationally derived annotation –TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation GenPept -GenBank translations RefSeq - Non-redundant sequences for certain organisms IPI –International protein Index –combination of many protein sequence databases
16
Example of a Swiss-Prot entry 1 References General information
17
Example of a Swiss-Prot entry 2 Cross- references Functional information
18
Example of a Swiss-Prot entry 3 Keywords Features Sequence
19
Swiss-Prot annotation mainly found in: Description (DE) lines –Protein name/function Comment (CC) lines –e.g. function, subcellular location, pathway, cofactor, disease, etc. Feature table (FT) –features on the sequence, e.g. domain, active site, modifications, variations, etc. Keyword (KW) lines –Set of a few hundred controlled vocabulary terms
20
Other parts to UniProt UniParc –archive of all sequences UniProt –Swiss-Prot + TrEMBL UniProt NREF100 (100% seqs merged) UniProt NREF90 (90% seqs merged) UniProt NREF50 (50% seqs merged)
21
Submitting sequences to EMBL or UniProt WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database. Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN
22
Sequence formats Not MSWord, but text! Most include an ID/name/annotation of some sort FASTA, E.g. >xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgc caatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtga ccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg Others specific to programs, e.g. GCG, abi, clustal, etc.
23
Literature database: PubMed/Medline Source of Medical-related & scientific literature PubMed has articles published after 1965 Can search by many different means, e.g. author, title, date, journal etc., or keywords for each Can save queries and results Can usually retrieve abstracts and full papers PubMed has list of tags to search specific fields, e.g. [AU], [TI], [DP] etc.
24
Search fields in PubMed Title Words [TI]MeSH Terms [MH] Title/Abstract Words [TIAB]Language [LA] Text Words [TW]Journal Title [TA] Substance Name [NM] Issue [IP] Subset [SB]Filter [FILTER] Secondary Source ID [SI]Entrez Date [EDAT] Subheadings [SH] EC/RN Number [RN] Publication Type [PT]Author Name [AU] Publication Date [DP]All Fields [ALL] Personal Name as Subject [PS]Affiliation [AD] Page Number [PG]Unique Identifiers [UID] Title Words [TI] MeSH Major Topic [MAJR] MeSH Date [MHDA]
25
Taxonomy Databases Most used is NCBI’s taxonomy database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxon omy Provides entries for all known organisms Provides taxonomic lineage and translation table for organisms Sequence entries for organism UniProt-specific taxonomy database is Newt: http://www.ebi.ac.uk/newt
26
Example taxonomy entry
27
Where to find the databases Table of addresses for major databases and tools Nucleic Acids Research Database issue January each year Nucleic Acids Research Software issue –new Amos’s list of tools: http://www.expasy.ch/alinks.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.