Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sandra Orchard EMBL-EBI

Similar presentations


Presentation on theme: "Sandra Orchard EMBL-EBI"— Presentation transcript:

1 Sandra Orchard EMBL-EBI
This presentation introduces the protein database UniProt. I’ll start with an introduction to UniProt giving a bit of background and describing what we’re trying to achieve. Then I’ll go through the various sections of a UniProtKB\Swiss-Prot entry to show you what kind of biological information we capture and present. I’ll explain our Automatic Annotation systems, this is the way in which we copy data from well studies proteins to those which haven’t been studied but when predict have a similar function. [Drop this for patent talks - I’ll talk a little bit of Proteomics, what is a proteome and how to access/download a proteome from UniProt.] The to finish off I’ll go through the functionality of the uniprot.org webiste.

2 Background of UniProt Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Prior to 2002 there used to be 3 separate protein databases located around the globe each with its own data and rules for annotation. In 2002 these databases merged to provide and unified source with common annotation standards and datasets. UniProt is funded mainly by…. Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database

3 We Aim To Provide… A high quality protein sequence database
A non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential. Easy protein identification Stable identifiers and consistent nomenclature/controlled vocabularies Thorough protein annotation Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source

4

5 The Two Sides of UniProtKB
UniProtKB/TrEMBL 1 entry per nucleotide submission UniProtKB/Swiss-Prot 1 entry per protein This is a very important aspect of is that there is two sides to the databases. Basically, UniProt entries which have been looked at by a Curator and those which haven’t. Those which haven’t been manually reviewed by a Curator reside in TrEMBL. These are automatic translations of ENA the European Nucleotide Archive (formerly know as EMBL). The European Nucleotide Archive is stuctured such that each sequence is owned by the person which submitted it. Thus sequences can not be merged. Hence why when sequences are translated and incorporated in to UniProt they can be redundancy by that I mean more than one UniProt entry per protein. Also TrEMBL entries are unreviewed, thus they do not have annotation added from literature only minimal annotation from computational algorithms. By contrast, once a Curator has reviewed an entry and added information from all relevant literature the entry swaps to the Swiss-Prot side of the UniProt database. This is non-redundant as all entries for a protein are merged. Indicators of which part of UniProt an entry belongs inlcude the colour of the starts and the ID.... Redundant, automatically annotated - unreviewed Non-redundant, high-quality manual annotation - reviewed

6 Curation of a UniProt/SwissProt entry
Sequence UniProt/TrEMBL References Sequence variants Literature Annotations Nomenclature Ontologies Sequence features We’ll look at this in more detail after the break but briefly UniProt is the gold-standard resource for information on proteins. Every entry initially receives automatic annotation so its not just a bare sequence, there’s a team of Curators that undertake manual curation using the literature and sequence analysis. We also use in-house bioinformatics tools for protein classification and domain prediction. Data from other databases is imported and cross-referenced. It comprises three different databases, but I haven’t shown all three here for the sake of simplicity. UniProtKB is the central database of protein sequences with accurate, consistent, and rich sequence and functional annotation. It comprises the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section. The UniProt archive is an archive of all the protein sequences in the public domain, and the UniRef databases are a series of three databases that store sequences of 100%, 90% and 50% identity in the same records to speed up searching without losing information. UniProtKB contains more than 29 million cross references to over 100 other data resources; a few key ones are shown here. UniProt/SwissProt

7 Searching UniProt – Simple Search
Text-based searching Logical operators ‘&’ (and), ‘|’ Master headline

8 Searching UniProt – Advanced Search
Master headline

9 Each linked to the UniProt entry
Searching UniProt – Search Results Each linked to the UniProt entry Master headline

10 Searching UniProt – Search Results
Master headline

11 Searching UniProt – Search Results
Master headline

12 Searching UniProt – Blast Search
Master headline

13 As on slide Just go through data types: > Entry Name – if the entry has been reviewed then the first part should represent the gene name. The second part denotes the organism, usually it consists of the first 3 letters of the genus name and the first 2 letters of the species name, so for the Fruit fly Drosophila melanogaster all entries will be *****_DROME. Exceptions are HUMAN, MOUSE and RAT. > Accession – an entry can have more than one accession if it is a Swiss-Prot. > Entry history – (pretty self-explanatory) the Complete history takes you to page where any two versions from each UniProt release can be selected and compared. > Entry status – whether its Reviewed (UniProtKB/Swiss-Prot) or Unreviewed (UniProtKB/TrEMBL). > Annotation project – usually species specific. > Disclaimer – only required for Human entry in case users try to self medicate base on the information we provide.

14 Protein names – There’s always a recommended name and if the protien is an enzyme the Enzyme Commision number is given which is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. Alternative names found in the literature are also added to aid finding the entry from a search. Gene names – usually tries to represent the protein name but not always possible Organism – with link to Complete proteome Tax ID – Numerical UniProt taxon number and link to NCBI taxonomy browser Taxonomic lineage.

15 Annotation comments FUNCTION PTM SUBCELLULAR LOCATION RNA EDITING
ALTERNATIVE PRODUCTS MASS SPECTROMETRY TISSUE SPECIFICITY DOMAIN DEVELOPMENTAL STAGE POLYMORPHISM INDUCTION DISRUPTION PHENOTYPE SIMILARITY ALLERGEN CATALYTIC ACTIVITY DISEASE COFACTOR TOXIC DOSE ENZYME REGULATION BIOTECHNOLOGY BIOPHYSICOCHEMICAL PROPERTIES PHARMACEUTICAL MISCELLANEOUS PATHWAY CAUTION SUBUNIT SEQUENCE CAUTION INTERACTION WEB RESOURCE There’s a wide range of topics for us to captures as much information as possible from the literature, not every topic is present in every entry. The green ones are very general, the blue ones are used a lot for enzymes and the pink ones are relevant to proteins involved in pathology/medicine.

16 Evidence tags to show source
Controlled vocabularies used whenever possible Its in this section that all the information obtained from the literature is summarized and organized via specific fields. Where possible we try to be consistent between protein with same function, this is aided by the use of controlled vocabularies. Evidence tags are listed at the end of the section to show the source of the information.

17 As far as I am aware UniProt is unique in annotating features to areas of amino acid sequence.
As shown above we show regions of biological interest like those required for an interaction and involved in subcellular localization. We also show smaller sites such as single amino acids that are modified post- translation. Master headline

18 Automatic Annotation for UniProtKB/TrEMBL
Automatic annotation is a method of copying annotation to similar proteins, so those proteins that have not been studied gain some information.

19 UniProtKB/Swiss-Prot Manually annotated
UniProtKB/TrEMBL Computationally annotated UniProtKB/Swiss-Prot Manually annotated These graphs illustrate the need for automatic annotation. The red dot puts the Swiss-Prot figure in the context of the TrEMBL graph The message here is that there are a lot more entries TrEMBL than Swiss-Prot so we need to find a way to transfer annotation from Swiss-Prot entries to those that haven’t been reviewed in TrEMBL. Web-pages for up to date Stats UniProt/TrEMBL - UniProt/SwissProt - Last updated 21/02/2012

20 InterPro Master headline

21 Master headline

22 Automatic Annotation UniProtKB employs two prediction programs which are referred to as UniRule and SAAS. SAAS, Statistical Automatic Annotation System, generates a new set of decision-trees with every UniProtKB release using data-mining. UniRule maintains a set of manually established and maintained annotation rules. Automatic annotation is produced by two methods. One method just uses a computer program to generate rules, this is named SAAS, which stands for Statistical Automatic Annotation System. The other method is called UniRule and is a collection of rules created by Scientists to propagate a specific set of data based on a defined criteria. Both of these systems use Swiss-Prot and InterPro as training sets. Swiss-Prot InterPro

23 Help / Feedback help@uniprot.org
Stuck? Just ask – active help and support team Feedback – if you find something incorrect, outdated, missing etc please tell us.

24 Introduction to Protein Signatures & InterPro
Introduction to InterPro

25 Protein Signatures Protein Signature =
an amino acid sequence (not necessarily consecutive) associated with a protein characteristic. Basically introduce the concept of protein signatures Introduction to InterPro

26 What value are signatures?
Better at finding proteins with common function Find more distant homologues than BLAST Better at finding proteins with common function Classification of proteins Associate proteins that share: Function Domains Sequence Structure Annotation of protein sequences Define conserved regions of a protein e.g. location and type of domains key structural or functional sites

27 How are protein signatures made?
Protein family/domain Build model Search Multiple sequence alignment Significant matches ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK SVPRSPVCGSDGVTYGTECDLK HPPPGPVCGTDGLTYDNRCELR E-value 1e-49 E-value 3e-42 E-value 5e-39 E-value 6e-10 Protein signature Refine Introduction to InterPro

28 Types of Protein signatures
(sequence based) Multiple protein alignment

29 Types of Protein signatures
(sequence based) Single motif methods Regular expression patterns x = any AA ( ) = number of AAs Must be this C - C - {P} - x(2) - C - [STDNEKPI] - C { } = cannot be.. [ ] = any of

30 Types of Protein signatures
(sequence based) Single motif methods Regular expression patterns 1 2 3 Multiple motif methods Identity matrices Fingerprints

31 Types of Protein signatures Regular expression patterns
(sequence based) Single motif methods Regular expression patterns Full domain alignment methods Profiles (Profile Library) M1 M2 M3 M4 I1 I2 I3 D2 D3 Multiple motif methods Hidden Markov Models Mathematical model of amino acid probability Identity matrices Fingerprints

32 CONTRIBUTING MEMBER DATA BASES
Models built on either sequence or structural alignments Each MDB has its own focus Hidden Markov Models Finger- Prints Profiles Patterns Sequence Clusters InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins, such as possible function and the potential location of functionally important sites and domains. Each member database creates signatures in different ways: some groups build them from manually-created sequence alignments, some use automatic processes with some human input and correction and others build their signatures entirely automatically. The signatures are represented using a variety of different model types (HMMs, Profiles, Regular Expressions, etc.) The member databases all have their own particular niche or focus; at InterPro we aim to be a combination of their individual strengths. To do this we integrate signatures from the member databases that represent the same protein family, domain or site into a single InterPro entry. We check the biological accuracy of the individual signatures and add concise information about the signatures and the types of proteins they match, including consistent names, descriptive abstracts (with links to original publications) and GO terms. Protein features (active sites…) Prediction of conserved domains Structural Domains Functional annotation of families/domains

33 Database Basis Institution Focus URL Built from Pfam HMM
Sanger Institute Sequence alignment Family & Domain based on conserved sequence Gene3D UCL Structure alignment Structural Domain Superfamily Uni. of Bristol Evolutionary domain relationships SMART EMBL Heidelberg Functional domain annotation TIGRFAM J. Craig Venter Inst. Microbial Functional Family Classification Panther Uni. S. California Family functional classification PIRSF PIR, Georgetown, Washington D.C. Functional classification PRINTS Fingerprints Uni. of Manchester PROSITE Patterns & Profiles SIB Functional annotation HAMAP Profiles Microbial protein family classification ProDom Sequence clustering PRABI : Rhône-Alpes Bioinformatics Center Conserved domain prediction

34 Foundations of InterPro
Integration of signatures InterPro Manual curation Master headline

35 InterPro Entry Groups similar signature together
Links related signatures Adds extensive annotation Linked to other databases Structural information and viewers Master headline

36 Applies to domains and families
Link related signatures - relationships Parent - Child (subgroup of more closely related proteins) * PFAM (100) Protein kinase PFAM (75) (100) SMART Protein kinase Serine kinase PFAM Protein kinase SMART PROSITE Serine kinase Tyrosine kinase Parent Children Applies to domains and families PROSITE (25) Tyrosine kinase No proteins in common SMART PROSITE Master headline

37 The InterPro entry types
Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Biological units with defined boundaries Short sequences typically repeated within a protein PTM Active Site Binding Conserved Master headline

38 InterPro Search protein ID wwwdev.ebi.ac.uk/interpro

39 Unintegrated signatures
InterPro Search Results Family Link to PDBe Domains and sites Unintegrated signatures Structural data

40 InterProScan – Searching New Sequence
Additional options Paste in unknown sequence wwwdev.ebi.ac.uk/interpro

41 Links to signature databases
InterProScan New Search Results Link to InterPro entry Links to signature databases MENTION ALSO SUMMARY TABLE

42 Contact and help We need your feedback! missing/additional references
reporting problems requests help - questions answered NOTE: Worth mentioning than sometimes we can not infer function, only homology to unknown proteins: is the case of Proteins of unknown function and Domains of unknown function. 42

43 Thanks for your attention


Download ppt "Sandra Orchard EMBL-EBI"

Similar presentations


Ads by Google