Download presentation
Presentation is loading. Please wait.
Published byAron Oliver Modified over 9 years ago
1
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator amaia@ebi.ac.uk Introduction to InterPro
2
What is InterPro? DIAGNOSTICS RESOURCE : InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins * Provides functional analysis of proteins by classifying them into families and predicting domains and important sites * Adds information about the signatures and the types of proteins they match
3
InterPro Consortium Consortium of 11 major signature databases
4
Why do we need predictive annotation tools?
5
Based on the original work on PIR, Swiss-Prot and TrEMBL Collaboration between EBI, SIB and PIR The mission of UniProt is to provide the scientific community with aUniProt comprehensive, high-quality and freely accessible resource of protein sequence and functional information. What is UniProt?
6
UniParc - Sequence archive Current and obsolete sequences UniMES Metagenomic and environmental sample sequences UniProtKB/Swiss-Prot Reviewed UniProtKB/TrEMBL Unreviewed UniProtKB Protein knowledgebase EMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resources UniRef Sequence clusters UniRef100 UniRef90 UniRef50 High-quality manual annotation Automatic annotation
7
Annotation using InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence
8
Protein family classification Given a set of sequences, we usually want to know: –what are these proteins; to what family do they belong? –what is their function; how can we explain this in structural terms?
9
Protein family classification : BLAST ( Protein family classification : BLAST ( pairwise comparisons )
10
Protein family classification: BLAST
11
Limitations with Pairwise comparisons BLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 species
12
Limitations with Pairwise comparisons
13
Protein family classification: signature databases Alternatively, we can seek ‘patterns’ that will allow us to infer relationships with previously-characterised sequences This is the approach taken by ‘signature’ databases
14
Protein signatures More sensitive homology searches Each member database creates signatures using different methods and methodologies: manually-created sequence alignments automatic processes with some human input and correction entirely automatically.
15
What are protein signatures? Multiple sequence alignment Protein family/domain Build model Search Mature model ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK UniProt it. Significant match Protein analysis
16
Member databases Hidden Markov Models Finger- Prints ProfilesPatterns Sequence Clusters Structural Domains Functional annotation of families/domains Prediction of conserved domains Protein features (active sites…) METHODS
17
Full domain alignment methods Single motif methods Multiple motif methods Regex patterns (PROSITE) Profiles (Profile Library) HMMs (Pfam) Identity matrices (PRINTS) Diagnostic approaches (sequence-based)
18
Patterns Extract pattern sequences xxxxxx Sequence alignment Motif Define pattern Pattern signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression PS 00000
19
Patterns Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites Anchoring the match to the extremity of a sequence <M-R-[DE]-x(2,4)-[ALT]-{AM} Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C Drawbacks Simple but less powerful Advantages
20
>sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SA NGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCE LDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGF GENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGI EERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK AAVEEGILPGGG VALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGA VIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLP KDESESGAAGAGMGGMGGMDY EXAMPLE: PS00296; Chaperonins cpn60 signature (PATTERN)PS00296 A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA] Pattern/motif in sequence regular expression Prosite patterns
21
Fingerprints Sequence alignment Correct order Correct spacing Motif 2Motif 3Motif 1 Define motifs Fingerprint signature 123 PR 00000 Extract motif sequences xxxxxx Weight matrices
22
The significance of motif context order interval Identify small conserved regions in proteins Several motifs characterise family Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours
23
PRINTS families are hierarchical Different motifs describe subfamilies G protein-coupled receptors rhodospin-likesecretin-like cAMP receptors metabotropic glutamate receptors etc adenosine receptors opsin receptors dopamine receptors somatostatin receptors histamine receptors etc somatostatin receptor type 1 somatostatin receptor type 2 somatostatin receptor type 3 etc
24
Profiles & HMMs Sequence alignment Entire domain Define coverage Whole protein Use entire alignment for domain or protein xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx Build model Models insertions and deletions Profile or HMM signature
25
Hidden Markov Models (HMM) Models insertions and deletions More flexible (can use partial alignments) Profiles Built using weight matrices More sophisticated algorithm
26
PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination. HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions. PROSITE and HAMAP profiles: a functional annotation perspective
27
HMM databases Sequence-based PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship PANTHER : families/subfamilies model the divergence of specific functions TIGRFAM: microbial functional family classification PFAM : families & domains based on conserved sequence SMART: functional domain annotation Structure-based SUPERFAMILY : models correspond to SCOP domains GENE3D : models correspond to CATH domains
28
Why we created InterPro By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database –to simplify & rationalise protein analysis –to facilitate automatic functional annotation of uncharacterised proteins –to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross- references to other databases
29
InterPro entry
31
The InterPro entry: types Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Family Distinct functional, structural or sequence units that may exist in a variety of biological contexts Domain Short sequences typically repeated within a protein Repeats PTM Active Site Binding Site Conserved Site Sites
32
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases Quality control Removes redundancy
33
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases Hierarchical classification
34
Interpro hierarchies: Families FAMILIES can have parent/child relationships with other Families Parent/Child relationships are based on: Comparison of protein hits child should be a subset of parent siblings should not have matches in common Existing hierarchies in member databases Biological knowledge of curators
35
Interpro hierarchies: Domains DOMAINS can have parent/child relationships with other domains
36
Domains and Families may be linked through Domain Organisation Hierarchy
37
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases
38
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics
39
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases UniProt KEGG... Reactome... IntAct... UniProt taxonomy PANDIT... MEROPS... Pfam clans... Pubmed
40
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases PDB 3-D Structures SCOP Structural domains CATH Structural domain classification
41
Understanding signatures:
42
Non-overlapping signatures can be describing the same thing Not always possible to use signature overlap to determine how family signatures are related PF03157 336 protein hits PR00210 331 protein hits Two very different signatures both describing the same thing! e.g. High molecular weight glutenins
43
PFAM shows domain is composed of two types of repeated sequence motifs SUPERFAMILY shows the potential domain boundaries www.ebi.ac.uk/interpro Some signatures give us similar, but complementary information
44
4) Non-contiguous domains 3) Repeated elements 2) Duplicated domains 1) Signature method www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation
45
e.g. PRINTS – discrete motifs Signature method 1) Signature method 3) Repeated elements 2) Duplicated domains 4) Non-contiguous domains www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation
46
1) Signature method Duplicated domains 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains e.g. SSF - duplication consisting of 2 domains with same fold www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation
47
Repeated elements 3) Repeated elements 2) Duplicated domains e.g. Kringle,WD40 4) Non-contiguous domains 1) Signature method www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation
48
3) Repeats Non-contiguous domains 4) Non-contiguous domains 2) Duplicated domains 1) Signature method Structural domains can consist of non-contiguous sequence www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation
49
4) Non-contiguous domains 3) Repeats 2) Duplicated domains 1) Signature method www.ebi.ac.uk/interpro
50
Searching InterPro:
51
WHEN TO USE INTERPRO Use InterPro to predict family, domain or active site information for a given protein or amino acid sequence. You can search InterPro if you have a protein sequence a UniProtKB protein identifier,UniProtKB a Gene Ontology term, a protein structure code a general search term keyword short phrase and require further information regarding your protein of interest.
52
http://www.ebi.ac.uk/interpro/ Search tools include: Text Search InterProScan (sequence search) BioMart (builds queries) Beta version: http://wwwdev.ebi.ac.uk/interpro/
53
InterPro Search wwwdev.ebi.ac.uk/interpro Search using: text protein ID InterPro ID GO term ID: GO:0006915 Name : apoptosis
54
InterPro Search Search results for GO:0006915 (apoptosis )
55
InterPro Search wwwdev.ebi.ac.uk/interpro protein ID
56
InterPro Search Results Structural data Link to PDBe Unintegrated signatures Domains and sites Family
57
Structural information CATH and SCOP divide PDB structures into domains Swiss-Model and ModBase can predict structure for regions not covered by PDB Note that one domain is discontiguous
58
Searching InterPro: InterProScan
59
InterProScan – Searching New Sequence wwwdev.ebi.ac.uk/interpro Paste in unknown sequence Additional options
60
InterProScan New Search Results Links to signature database s Link to InterPro entry
61
Searching InterPro: BioMart
62
Large volumes of data can be queried efficiently The interface is shared with many other bioinformatics resources It allows federation with other databases PRIDE (mass spectrometry-derived proteins and peptides REACTOME (biological pathways) BioMart Search BioMart allows more powerful and flexible queries
63
BioMart Search 1)Choose Dataset a. Choose InterPro BioMart
64
BioMart Search 1)Choose Dataset a. Choose InterPro BioMart b. Choose InterPro entries or protein matches
65
BioMart Search 2)Choose Filters Search specific entries, signatures or proteins
66
BioMart Search 2)Choose Filters e.g. Filter by specific proteins
67
BioMart Search 3)Choose Attributes What results you want
68
BioMart Search 4)Choose additional Dataset (optional) This is where you link results to Pride and Reactome
69
BioMart Search Results User manual HTML = web-formatted table CSV = comma-separated values TSV = tab-separated values XLS = excel spreadsheet Click to view results
70
InterPro – the numbers Our member databases all have their particular niche or focus......but InterPro is a combination of all their areas of expertise! InterPro 32.0: 21516 entries 101175 signatures covering 85.5% of UniProtKB Frequent releases – both protein and method updates 45 000 unique visitors per month The database has grown almost 10-fold in ~11 years
71
Caveats We need your feedback! missing/additional references reporting problems requests InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented. for example, inactive peptidases, such as Q8N3Z0, Q9W3H0Q8N3Z0Q9W3H0 InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry! EBI support pageEBI support page.
72
InterPro Team: Acknowledgements Amaia Sangrador David Lonsdale Craig McAnulla Matthew Fraser Anthony Quinn Maxim Scheremetjew Phil Jones Siew-Yit Yong Alex Mitchell Sebastien Pesseat Prudence Mutowo Sarah Hunter Christopher Hunter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.