Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.

Similar presentations


Presentation on theme: "Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system."— Presentation transcript:

1 Introduction to databases Tuomas Hätinen

2 Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system -Sequence retrieval system (eg SRS, Hands on session)

3 File formats

4 Fasta FASTA format is very common. Can be hand constructed when in a hurry Straightforward way for storing multiple sequences – just concatenate FASTA files Contents: Line 1: > all identifiers and descriptors Remaining lines: sequence >1NJR:A 32.1 KDA PROTEIN IN ADH3-RCA1 INTERGENIC REGION XTGSLNRHSLLNGVKKXRIILCDTNEVVTNLWQESIPHAYIQNDKYLCIHHGHLQSLXDS XRKGDAIHHGHSYAIVSPGNSYGYLGGGFDKALYNYFGGKPFETWFRNQLGGRYHTVGSA TVVDLQRCLEEKTIECRDGIRYIIHVPTVVAPSAPIFNPQNPLKTGFEPVFNAXWNALXH SPKDIDGLIIPGLCTGYAGVPPIISCKSXAFALRLYXAGDHISKELKNVLIXYYLQYPFE PFFPESCKIECQKLGIDIEXLKSFNVEKDAIELLIPRRILTLDL Example of FASTA sequence for PDB 1njr. Note X are ’any’ amino acid.

5 SwissPROT, EMBL, TrEMBL, UniProt format Each line begins with a 2 letter identifier UniProt format closely resembles EMBL format except that considerably more information about physical and biochemical properties is provided

6 SwissPROT format Example of SwissProt entry. Line types are fully explained in: http://au.expasy.org/sprot/user man.html#linet ypes

7 SwissPROT format Example of SwissProt entry. Line types are fully explained in: http://au.expasy.org/sprot/user man.html#linet ypes

8 Databases

9 Key concepts Experimental database Contains experimental meassurements E.g. EMBL, PDB Derived database Derived from experimental databases E.g. UniProtKB Database stability Accession numbers Non-redundancy Annotation

10 Nucleic sequence databases – experimental data GenBank DDBJ EMBL EBI NCBICIB *Submissions *Updates EMBL NIG NIH *Submissions *Updates EUROPE USA JAPAN

11 Raw Protein sequence databases EBI NCBI EMBL NIH Gen Bank DDBJ EMBL DNA sequences DBsProteins seq DBs Trans Gen Pept TrEMBL SwissPROT PIR-PSD UniPROT Trans Entrez Sub/Up SRS

12 UniProt Universal Protein Resource Protein Sequence database UniProt Consortium European Bioinformatics Institute Swiss Institute of Bioinformatics PIR Georgetown University Mission -Maintain high quality, stable, comprehensive, fully classified and annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces

13 Organization of UniProt databases UniProt Archive (UniParc) All available protein sequences UniProt Knowledgebase (UniProtKB) Annotated proteins sequences UniProt Reference Clusters (UniRef) Reduced redundancy for faster searching

14 Database size comparison

15 UniProtKB Annontated entries UniParc =>UniProtKB UniProt/TrEMBL Automated annotation UniProt/SwissProt Manual annotation

16 SWISSPROT Started as part of a Phd thesis, first version released in 1986. Now a collaboration between Swiss Institute of Bioinformatics and EBI. Rich source for protein sequence data A well annotated source for sequences Largely non-redundant Updated daily, cross referenced with more than 30 different databases. Let us view a sample entry

17 TrEMBL 1996: TrEMBL (Translation of EMBL) released Computer-annotated entries derived from the translation of all coding sequences in EMBL database except those already in SWISS-PROT complement to Swiss-Prot and sequence Sequences included to Swissprot by annotators

18 Errors in databases Be aware of errors in the databases: sequence errors: -genome projects’ error rate is 1/10,000 nts; -ESTs’ error rate is 1/100nts. annotation errors: -Programs do not always give correct annotations. -SwissProt is a protein database curated and annotated manually by biologists. -Manual curation doe

19 Errors in databases Be aware of errors in the databases: sequence errors: -genome projects’ error rate is 1/10,000nts; -ESTs’ error rate is 1/100nts. annotation errors: -Automated computer programs do not always give correct annotations. -SwissProt is a protein database curated and annotated manually by biologists. -most reliable database, but is not up-to-date


Download ppt "Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system."

Similar presentations


Ads by Google