Download presentation
Presentation is loading. Please wait.
Published bySylvia Merritt Modified over 9 years ago
1
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL
2
2 EMBL Outstation — The European Bioinformatics Institute SWISS-PROT F is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 F contains currently 75 000 protein sequence entries
3
3 EMBL Outstation — The European Bioinformatics Institute Essential criteria for a sequence data bank F it must be complete with minimal redundancy F it must contain as much up-to-date information as possible on each sequence F all the information items must be retrievable by computer programs in a consistent manner F it should be integrated (cross-referenced) with other sequence related data banks
4
4 EMBL Outstation — The European Bioinformatics Institute The Bottleneck: Annotation
5
5 EMBL Outstation — The European Bioinformatics Institute Annotation consists of the description of: F Function(s) of the protein F Post-translational modification(s) F Domains and sites F Secondary structure F Quaternary structure F Similarities to other proteins F Disease(s) associated with deficiencie(s) in the protein F Sequence conflicts, variants, etc.
6
6 EMBL Outstation — The European Bioinformatics Institute TrEMBL F is a Computer-annotated supplement to SWISS-PROT F consists of entries in SWISS-PROT format F translations of CDS in the Nucleotide Sequence Database not in SWISS-PROT F the translation tools used are based on the program trembl written by Thure Etzold at the EMBL in Heidelberg
7
7 EMBL Outstation — The European Bioinformatics Institute TrEMBLNEW F Weekly update of TrEMBL which contains protein coding sequences derived from EMBLNEW F TrEMBLNEW entries are moved into TrEMBL during the quarterly release building procedure
8
8 EMBL Outstation — The European Bioinformatics Institute The Production of TrEMBL F Translation and entry creation F Sorting the entries F Automated post-processing of the SP-TrEMBL entries
9
9 EMBL Outstation — The European Bioinformatics Institute Automated post-processing of TrEMBL entries F Redundancy removal: affects currently >10% of the entries F Improvements to annotation: affects currently >20% of the entries
10
10 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F Causes of redundancy and the detection of redundancy F Removing redundancy
11
11 EMBL Outstation — The European Bioinformatics Institute Causes of redundancy F Different literature and sequence reports for the same protein F Subfragments of longer sequences F Mutations, polymorphism, variations and conflicts of a sequence are often given as separate entries in EMBL
12
12 EMBL Outstation — The European Bioinformatics Institute Redundancy detection F The Cyclic Redundancy Check (CRC32) calculates a nearly unique and very compact checksum for each sequence F The Boyer-Moore sequence comparison algorithm for a fast string searching F An algorithm that finds strings with errors ( Landau- Vishkin)
13
13 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F Identical full length proteins are merged in one entry F Identical fragment proteins and subfragments of longer sequences from the same organism are merged
14
14 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F The ‘MERGE’ procedure - match CRC32 match TrEMBLNEW vs TrEMBLNEW (automatic merge) match TrEMBLNEW vs TrEMBL (automatic merge) match TrEMBLNEW vs SWISS-PROT (manual merge) - Subfragment assembly (LASSAP) match TrEMBLNEW vs TrEMBLNEW (automatic merge and manual check) match TrEMBLNEW vs TrEMBL (automatic merge and manual check) match TrEMBLNEW vs SWISS-PROT (manual merge)
15
15 EMBL Outstation — The European Bioinformatics Institute PID Check EMBLNEW trembl SP + TREMBL PIDS (Work Release) Day 1 Day 2 Day n TREMBLNEW Week 1 Week 2 Week n TREMBLNEW Updates Replace PIDs in SP+TREMBL SP TREMBL Merge Between releases Building Release
16
16 EMBL Outstation — The European Bioinformatics Institute Results EMBL Nucleotide Sequence Database (rel 55) has 326,000 CDS SWISS-PROT (rel 36) has 74,019 entries TrEMBL (rel 7) has 193,860 entries F 110,000 CDS were already in 74,000 SWISS-PROT entries F 207,000 CDS were in 194,000 TrEMBL entries F 9,000 currently being processed due to redundancy procedures
17
17 EMBL Outstation — The European Bioinformatics Institute Results F Results of redundancy removal within TrEMBL 7 production - 743 were already in SWISS-PROT - 3380 were merged due to CRC32 matches - 4736 were removed by subfragment matches F 8,859 entries were removed
18
18 EMBL Outstation — The European Bioinformatics Institute Credits SWISS-PROT at EBI F Rolf Apweiler F Sergio Contrino F Wolfgang Fleischmann F Henning Hermjakob F Viv Junker F Fiona Lang F Claire O'Donovan F Michele Magrane F Maria Jesus Martin F Nicoletta Mitaritonna F Steffen Moeller F Youla Karavidopoulou F Gill Fraser F Evguenia Kriventseva Collaborators F Amos Bairoch F Eric Glemet F Jean-Jacques Codani
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.