EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro
What is InterPro? DIAGNOSTICS RESOURCE : InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins * Provides functional analysis of proteins by classifying them into families and predicting domains and important sites * Adds information about the signatures and the types of proteins they match
InterPro Consortium Consortium of 11 major signature databases
Why do we need predictive annotation tools?
Based on the original work on PIR, Swiss-Prot and TrEMBL Funded by NIH, EC, EMBL and the Swiss Federal Government to be the highest quality, most thoroughly annotated protein sequence database Collaboration between EBI, SIB and PIR The mission of UniProt is to provide the scientific community with aUniProt comprehensive, high-quality and freely accessible resource of protein sequence and functional information. What is UniProt?
UniParc - Sequence archive Current and obsolete sequences UniMES Metagenomic and environmental sample sequences UniProtKB/Swiss-Prot Reviewed UniProtKB/TrEMBL Unreviewed UniProtKB Protein knowledgebase EMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resources UniRef Sequence clusters UniRef100 UniRef90 UniRef50 High-quality manual annotation Automatic annotation
Annotation using InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence
Protein family classification Given a set of sequences, we usually want to know: –what are these proteins; to what family do they belong? –what is their function; how can we explain this in structural terms?
Protein family classification : BLAST ( Protein family classification : BLAST ( Pairwise comparisons )
Protein family classification: BLAST
Limitations with Pairwise comparisons BLAST alignment of 2 proteins (60S acidic ribosomal protein P0) from 2 species
Limitations with Pairwise comparisons
Protein family classification: pattern databases Alternatively, we can seek signatures that will allow us to infer relationships with previously-characterised sequences This is the approach taken by ‘signature’ databases
Protein signatures More sensitive homology searches Each member database creates signatures using different methods and methodologies: manually-created sequence alignments automatic processes with some human input and correction entirely automatically.
What are protein signatures? Multiple sequence alignment Protein family/domain Build model Search Mature model ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK UniProt it. Significant match Protein analysis
Member databases Hidden Markov Models Finger- Prints ProfilesPatterns Sequence Clusters Structural Domains Functional annotation of families/domains Prediction of conserved domains Protein features (active sites…) METHODS
Full domain alignment methods Single motif methods Multiple motif methods Regex patterns (PROSITE) Profiles (Profile Library) HMMs (Pfam) Identity matrices (PRINTS) Diagnostic approaches (sequence-based)
Patterns Extract pattern sequences xxxxxx Sequence alignment Motif Define pattern Pattern signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression PS 00000
Patterns Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites Anchoring the match to the extremity of a sequence <M-R-[DE]-x(2,4)-[ALT]-{AM} Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C Drawbacks Simple but less powerful Advantages
>sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SA NGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCE LDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGF GENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGI EERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK AAVEEGILPGGG VALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGA VIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLP KDESESGAAGAGMGGMGGMDY EXAMPLE: PS00296; Chaperonins cpn60 signature (PATTERN)PS00296 A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA] Pattern/motif in sequence regular expression Prosite patterns
Fingerprints Sequence alignment Correct order Correct spacing Motif 2Motif 3Motif 1 Define motifs Fingerprint signature 123 PR Extract motif sequences xxxxxx Weight matrices
The significance of motif context order interval Identify small conserved regions in proteins Several motifs characterise family Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours
PRINTS families are hierarchical Different motifs describe subfamilies G protein-coupled receptors rhodospin-likesecretin-like cAMP receptors metabotropic glutamate receptors etc adenosine receptors opsin receptors dopamine receptors somatostatin receptors histamine receptors etc somatostatin receptor type 1 somatostatin receptor type 2 somatostatin receptor type 3 etc
Profiles & HMMs Sequence alignment Entire domain Define coverage Whole protein Use entire alignment for domain or protein xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx Build model Models insertions and deletions Profile or HMM signature
Hidden Markov Models (HMM) Models insertions and deletions More flexible (can use partial alignments) Profiles Built using weight matrices More sophisticated algorithm
PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination. HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions. PROSITE and HAMAP profiles: a functional annotation perspective
HMM databases Sequence-based PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship PANTHER : families/subfamilies model the divergence of specific functions TIGRFAM: microbial functional family classification PFAM : families & domains based on conserved sequence SMART: functional domain annotation Structure-based SUPERFAMILY : models correspond to SCOP domains GENE3D : models correspond to CATH domains
Why we created InterPro By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database –to simplify & rationalise protein analysis ensuring that entries & their linked signatures pointed to related information on the same biological object –to facilitate automatic functional annotation of uncharacterised proteins –to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross- references to other databases
InterPro entry
The InterPro entry: types Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Family Distinct functional, structural or sequence units that may exist in a variety of biological contexts Domain Short sequences typically repeated within a protein Repeats PTM Active Site Binding Site Conserved Site Sites
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases Quality control Removes redundancy
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases Hierarchical classification
Interpro hierarchies: Families FAMILIES can have parent/child relationships with other Families Parent/child relationships must be based in biology! Parent/Child relationships are based on: Comparison of protein hits child should be a subset of parent siblings should not have matches in common Existing hierarchies in member databases Biological knowledge of curators
Interpro hierarchies: Domains DOMAINS can have parent/child relationships with other domains
Domains and Families may be linked through Domain Organisation Hierarchy
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases … PDB … Pubmed … GO … SCOP … CATH … UniProt … Modbase … Swiss-model … UniProt taxonomy … PANDIT…
InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases PDB 3-D Structures SCOP Structural domains CATH Structural domain classification
Understanding signatures: why Biology is important
Non-overlapping signatures can be describing the same thing Not always possible to use signature overlap to determine how family signatures are related PF protein hits PR protein hits Two very different signatures both describing the same thing! e.g. High molecular weight glutenins
PFAM shows domain is composed of two types of repeated sequence motifs SUPERFAMILY shows the potential domain boundaries Some signatures give us similar, but complementary information
4) Non-contiguous domains 3) Repeated elements 2) Duplicated domains 1) Signature method Discontinuous Signatures Require Interpretation
e.g. PRINTS – discrete motifs Signature method 1) Signature method 3) Repeated elements 2) Duplicated domains 4) Non-contiguous domains Discontinuous Signatures Require Interpretation
1) Signature method Duplicated domains 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains e.g. SSF - duplication consisting of 2 domains with same fold Discontinuous Signatures Require Interpretation
Repeated elements 3) Repeated elements 2) Duplicated domains e.g. Kringle,WD40 4) Non-contiguous domains 1) Signature method Discontinuous Signatures Require Interpretation
3) Repeats Non-contiguous domains 4) Non-contiguous domains 2) Duplicated domains 1) Signature method Structural domains can consist of non-contiguous sequence Discontinuous Signatures Require Interpretation
4) Non-contiguous domains 3) Repeats 2) Duplicated domains 1) Signature method
Searching InterPro:
Protein families and domains are invaluable pointers that help biologists to find distantly related proteins and to predict their functions. Searching genomes and proteomes with ‘protein signatures’ – is now an established way of layering functional information onto your sequence information Phylogenetic tree, domain structure, and multiple sequence alignments of the GCK and PAK subfamilies of STE20-type kinases from humans and C. elegans, Strange et al. 2006
WHEN TO USE INTERPRO Use InterPro to predict family, domain or active site information for a given protein or amino acid sequence. You can search InterPro if you have a protein sequence a UniProtKB protein identifier,UniProtKB a Gene Ontology term, a protein structure code a general search term keyword short phrase and require further information regarding your protein of interest.
Search tools include: Text Search InterProScan (sequence search) BioMart (builds queries) Beta version:
InterPro Search wwwdev.ebi.ac.uk/interpro protein ID Sequence
InterPro Search Results Structural data Link to PDBe and AstexViewer® Unintegrated signatures Domains and sites Family
Structural information CATH and SCOP divide PDB structures into domains Swiss-Model and ModBase can predict structure for regions not covered by PDB Note that one domain is discontiguous
InterPro Search wwwdev.ebi.ac.uk/interpro Search using: text protein ID InterPro ID GO term ID: GO: Name : apoptosis
InterPro Search Search results for GO: (apoptosis )
InterProScan – Searching New Sequence wwwdev.ebi.ac.uk/interpro Paste in unknown sequence
InterProScan New Search Results Links to signature database s Link to InterPro entry
BioMart Search 1)Choose Dataset a. Choose InterPro BioMart
BioMart Search 1)Choose Dataset a. Choose InterPro BioMart b. Choose InterPro entries or protein matches
BioMart Search 2)Choose Filters Search specific entries, signatures or proteins
BioMart Search 2)Choose Filters e.g. Filter by specific proteins
BioMart Search 3)Choose Attributes What results you want
BioMart Search 4)Choose additional Dataset (optional) This is where you link results to Pride and Reactome
BioMart Search Results User manual HTML = web-formatted table CSV = comma-separated values TSV = tab-separated values XLS = excel spreadsheet Click to view results
InterPro – the numbers Our member databases all have their particular niche or focus......but InterPro is a combination of all their areas of expertise! InterPro 32.0: entries signatures covering 85.5% of UniProtKB Frequent releases – both protein and method updates unique visitors per month The database has grown almost 10-fold in ~11 years
Caveats We need your feedback! missing/additional references reporting problems requests InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented. for example, inactive peptidases, such as Q8N3Z0, Q9W3H0Q8N3Z0Q9W3H0 InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry!
InterPro Team: Acknowledgements Amaia Sangrador David Lonsdale Craig McAnulla Matthew Fraser Anthony Quinn Maxim Scheremetjew Phil Jones Siew-Yit Yong Alex Mitchell Sebastien Pesseat Prudence Mutowo Sarah Hunter Christopher Hunter