Download presentation
Presentation is loading. Please wait.
1
Proteins to Proteomes The InterPro Database
2
Origins of InterPro raw data UniProt Swiss-Prot TrEMBL 5M ??? InterPro
290K annotated 5M ??? automated annotation InterPro
3
uncharacterised sequence feed back common annotation
Curated Annotation in InterPro TrEMBL uncharacterised sequence TrEMBL feed back common annotation multiple signatures InterPro groups of related proteins (same family or share domains) annotated sequence Swiss-Prot
4
Finding Conserved Signatures
Pattern Simplest (limited) Fingerprint Sequence clustering HMM More information
5
Patterns Pattern/motif in sequence regular expression
Can define important sites Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: Insulin
6
Patterns Pattern/motif in sequence regular expression
Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
7
Patterns Pattern/motif in sequence regular expression
Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
8
Patterns C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Pattern/motif in sequence regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Regular expression C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
9
Patterns PS00000 C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Sequence alignment Insulin family motif Define pattern Extract pattern sequences xxxxxx C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression Pattern signature PS00000
10
Fingerprints Several motifs characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
11
Fingerprints Several motifs characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site
12
Fingerprints Several motifs characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site Ser phosphorylation site
13
Fingerprints Several motifs characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE His phosphorylation site Conserved site Ser phosphorylation site
14
Fingerprints Several motifs characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE 1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE 3-motif fingerprint
15
Fingerprints 1 2 3 PR00000 Correct order Correct spacing
Ser phosphorylation site Conserved site His phosphorylation site Define motifs Sequence alignment Extract motif sequences xxxxxx Fingerprint signature 1 2 3 Correct order Correct spacing PR00000
16
Recruit homologous domains
Sequence clustering Automatic clustering of homologous domains **Rarely covers entire domain (conserved core) **Signature size can change with release Known domain families Recruit homologous domains PSI-BLAST MKDOM2 Automatic clustering ProDomAlign Align domain families
17
Hidden Markov Models (HMM)
Can characterise protein over entire length Models conserved and divergent regions (position-specific scoring) Models insertions and deletions Outperform in sensitivity and specificity More flexible (can use partial alignments)
18
(residue frequency at each position in alignment)
Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Scoring matrix (residue frequency at each position in alignment) Profile
19
Phe, Tyr and Leu found at position 1 of alignment
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Phe, Tyr and Leu found at position 1 of alignment Phe most conserved highest match value
20
Probability method gauges scoring parameters
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Tyr and Leu found at equal frequency at position 1 Tyr closer to Phe than Leu Scores: F > Y > L Probability method gauges scoring parameters
21
Hidden Markov Models (HMM)
Sequence alignment M1 M2 M3 M4 Begin End M = match state
22
Hidden Markov Models (HMM)
I = insert state, I2 I3 M1 M2 M3 M4 Begin End D1 D4 D2 D = delete state D3 M = match state,
23
SAM Profile HMMs Homologous structural superfamilies
Start with single seed sequence Create 1 model for every protein in superfamily combine results Few proteins in family have PDB structures Proteins in superfamily may have low sequence identity
24
Specialisation of Databases
PRINTS Describe sibling families PROSITE Identify binding and active sites PRODOM Describe conserved core of domains PFAM Wide coverage of domains & families SMART Signalling, extracellular & nuclear domains TIGRFAM Functional classification of families PIRSF Families conserved in domain composition PANTHER Functional classification of families GENE3D Structural-based domain classification Superfam Structural-based domain classification
25
Foundations of InterPro
Integration of signatures InterPro Manual curation
26
InterPro Entry Groups similar signature together
Links related signatures Adds extensive annotation Linked to other databases Structural information and viewers
27
Assigning Type Family Full-length signatures grouping related proteins
Domain Biological units with defined boundaries Repeat Signature repeated as a series of short motifs Site Protein feature described by a Prosite pattern Region Any signature that doesn’t fit the above
28
Grouping Signatures Together
PFAM PROSITE 1) (100) Same positions Same protein hits IPR000001 Same positions Different protein hits 2) PFAM PROSITE (100) (50) IPR000001 IPR000002 PROSITE PFAM 3) (100) Different positions Same protein hits IPR000001 IPR000002 Different positions 4) PFAM PROSITE (100) IPR000001 IPR000002
29
Applies to domains and families
Link related signatures - relationships 1) Parent - Child (subgroup of more closely related proteins) * PFAM (100) Protein kinase PFAM (75) (100) SMART Protein kinase Serine kinase PFAM Protein kinase SMART PROSITE Serine kinase Tyrosine kinase Parent Children PROSITE (25) Tyrosine kinase No proteins in common SMART PROSITE Applies to domains and families
30
Both families and domains can contain domains
Link related signatures - relationships 2) Contains – Found in (Describes domain composition) PFAM Receptor family PROSITE C-terminal domain SMART N-terminal domain Found in (Pfam) Contains (Smart and Prosite) PFAM Receptor Family SMART PROSITE N-terminal domain C-terminal domain Both families and domains can contain domains
31
Link related signatures - relationships
2) Contains – Found in Coverage Signature must cover the entire (>90%) sequence of contained signature Contains PFAM Found in SMART PFAM SMART Contains Found in Overlapping
32
Criteria for Signature InterPro Relationship
Relationships – evolutionary context Criteria for Signature InterPro Relationship Structural family Grandparent GENE3D Parents PFAM Sequence families Children TIGRFAM Functional families Unique to InterPro
33
Extensive Annotation Annotation Fields in InterPro Name and short name
Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
34
Select species-specific protein sets
Extensive Annotation Annotation Fields in InterPro Name and short name Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications Select species-specific protein sets
35
Links to Other Databases
Annotation Fields in InterPro Blocks (family alignments) IntEnz (enzymes) Prosite documents COME (bioinorganic motifs) CAZy (carbohydrate-active enzymes) IUPHAR (GPCR receptors) CluS-Tr (protein clusters) Pandit (phylogenetic trees of PFAMs) Merops (peptidases & inhibitors)
36
Structural information
Structures PDB Classification CATH SCOP Homology Models Swiss-Model ModBase
37
Sequence-Structure Display
Signatures predictive of protein annotation Structural data for specific proteins AstexViewer® for structure
38
Structure Viewer Manipulate structures
Navigate between structure and sequence
39
Other Features – splice variants
40
Each ‘balloon’ represents a linked InterPro domain
Other Features – domain architecture Each ‘balloon’ represents a linked InterPro domain Select data set of these proteins
41
Other Features – protein-protein interactions
Lists proteins in entry known to be involved in protein-protein interactions IntAct database of interactions
42
Protein Sequence Coverage
InterPro signatures cover: 95% of UniProt/Swiss-Prot proteins 79% of UniProt/TrEMBL proteins >4 million matches in InterPro >50,000 signature methods >16,000 InterPro entries
43
Searching InterPro Search tools include: Text Search
InterProScan (sequence search)
44
InterPro Text Search Text search box Search results Search using: text
protein ID InterPro ID GO term Search results Direct links to entry
45
Use ftp site to run multiple sequences simultaneously
InterProScan Search Use ftp site to run multiple sequences simultaneously Member database search engines Paste in sequence (protein/nucleotide)
46
Direct links to signature databases
InterProScan Search Results single InterPro entry Direct links to entry Direct links to signature databases
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.