Download presentation
Presentation is loading. Please wait.
Published byPhoebe Bridges Modified over 9 years ago
1
Introduction to databases Tuomas Hätinen
2
Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system -Sequence retrieval system (eg SRS, Hands on session)
3
File formats
4
Fasta FASTA format is very common. Can be hand constructed when in a hurry Straightforward way for storing multiple sequences – just concatenate FASTA files Contents: Line 1: > all identifiers and descriptors Remaining lines: sequence >1NJR:A 32.1 KDA PROTEIN IN ADH3-RCA1 INTERGENIC REGION XTGSLNRHSLLNGVKKXRIILCDTNEVVTNLWQESIPHAYIQNDKYLCIHHGHLQSLXDS XRKGDAIHHGHSYAIVSPGNSYGYLGGGFDKALYNYFGGKPFETWFRNQLGGRYHTVGSA TVVDLQRCLEEKTIECRDGIRYIIHVPTVVAPSAPIFNPQNPLKTGFEPVFNAXWNALXH SPKDIDGLIIPGLCTGYAGVPPIISCKSXAFALRLYXAGDHISKELKNVLIXYYLQYPFE PFFPESCKIECQKLGIDIEXLKSFNVEKDAIELLIPRRILTLDL Example of FASTA sequence for PDB 1njr. Note X are ’any’ amino acid.
5
SwissPROT, EMBL, TrEMBL, UniProt format Each line begins with a 2 letter identifier UniProt format closely resembles EMBL format except that considerably more information about physical and biochemical properties is provided
6
SwissPROT format Example of SwissProt entry. Line types are fully explained in: http://au.expasy.org/sprot/user man.html#linet ypes
7
SwissPROT format Example of SwissProt entry. Line types are fully explained in: http://au.expasy.org/sprot/user man.html#linet ypes
8
Databases
9
Key concepts Experimental database Contains experimental meassurements E.g. EMBL, PDB Derived database Derived from experimental databases E.g. UniProtKB Database stability Accession numbers Non-redundancy Annotation
10
Nucleic sequence databases – experimental data GenBank DDBJ EMBL EBI NCBICIB *Submissions *Updates EMBL NIG NIH *Submissions *Updates EUROPE USA JAPAN
11
Raw Protein sequence databases EBI NCBI EMBL NIH Gen Bank DDBJ EMBL DNA sequences DBsProteins seq DBs Trans Gen Pept TrEMBL SwissPROT PIR-PSD UniPROT Trans Entrez Sub/Up SRS
12
UniProt Universal Protein Resource Protein Sequence database UniProt Consortium European Bioinformatics Institute Swiss Institute of Bioinformatics PIR Georgetown University Mission -Maintain high quality, stable, comprehensive, fully classified and annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces
13
Organization of UniProt databases UniProt Archive (UniParc) All available protein sequences UniProt Knowledgebase (UniProtKB) Annotated proteins sequences UniProt Reference Clusters (UniRef) Reduced redundancy for faster searching
14
Database size comparison
15
UniProtKB Annontated entries UniParc =>UniProtKB UniProt/TrEMBL Automated annotation UniProt/SwissProt Manual annotation
16
SWISSPROT Started as part of a Phd thesis, first version released in 1986. Now a collaboration between Swiss Institute of Bioinformatics and EBI. Rich source for protein sequence data A well annotated source for sequences Largely non-redundant Updated daily, cross referenced with more than 30 different databases. Let us view a sample entry
17
TrEMBL 1996: TrEMBL (Translation of EMBL) released Computer-annotated entries derived from the translation of all coding sequences in EMBL database except those already in SWISS-PROT complement to Swiss-Prot and sequence Sequences included to Swissprot by annotators
18
Errors in databases Be aware of errors in the databases: sequence errors: -genome projects’ error rate is 1/10,000 nts; -ESTs’ error rate is 1/100nts. annotation errors: -Programs do not always give correct annotations. -SwissProt is a protein database curated and annotated manually by biologists. -Manual curation doe
19
Errors in databases Be aware of errors in the databases: sequence errors: -genome projects’ error rate is 1/10,000nts; -ESTs’ error rate is 1/100nts. annotation errors: -Automated computer programs do not always give correct annotations. -SwissProt is a protein database curated and annotated manually by biologists. -most reliable database, but is not up-to-date
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.