Download presentation
Presentation is loading. Please wait.
1
Proteomics: Analyzing proteins space
2
Protein families Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families - what is it good for? Explosion in biological sequence data => need to organize! Understanding relations/hierarchy of groups is interesting as is, e.g. in evolutionary research. For applied research : –Annotation of new proteins : predicting their function, structure, cellular localization etc. –Looking for new folds
3
Sequence-based classification By sequence similarity (domains, motifs or complete proteins) : Pfam, PROSITE, SMART, InterPro etc. InterPro – Synthesizes the data from Pfam, PROSITE, Prints, ProDom, and SMART. Considered as “best” domain-based classification available
4
Other kinds of classification Global classification : –Systers, Protomap, CLUSTr –MetaFam synthesizes global classification data By structure similarity : SCOP etc. By function : Albumin, RetNet, TumorGenes etc.
5
A long-term project in HUJI led by Michal & Nati Linial. Provides automatic global classification of the known proteins. Performs hierarchical clustering on sequence-based metric space of proteins. Allows to “place” an external protein into the hierarchy. http://www.protonet.cs.huji.ac.il
6
Why clustering? We want to refine the “similarity” notion, compared to e.g. BLAST Exploit transitivity to improve grouping Can use a low threshold on similarity: - uses vast information from low similarities - allowable because clustering filters noise
7
Why hierarchical? Vertical Perspective Horizontal Perspective
8
ProtoNet: Pre-Computation All-against-all gapped BLAST using BLOSUM62 SwissProt release 40.28 database (114,033 proteins) BLAST identified ~2*10 7 relations between these proteins with relatively high sequence similarity E-Score of 100 or less: Don’t want to lose information => very permissive! But still less then ~6.5*10 9 => infeasible
9
Clustering Method First, each cluster is considered a singleton
10
Clustering Method Next, we iteratively merge the pairs of clusters We choose to merge the ‘most similar’ pair of clusters.
11
Clustering Method Next, we iteratively merge the pairs of clusters We choose to merge the ‘most similar’ pair of clusters.
12
Clustering Method Next, we iteratively merge the pairs of clusters We choose to merge the ‘most similar’ pair of clusters.
13
Clustering Method As we progress the number of singletons drops
14
Clustering Method The clustering process gradually generates a tree of clusters Stop whenever we like
15
How to merge? The potential merging score is calculated for each pair of clusters relevant for merging at each level At the bottom equals Higher, designed to reflect the similarity of clusters. Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster. m n
16
Potential Merging Score of Arithmetic Mean VI Geometric Mean VI Harmonic Mean
17
Missing Data Treatment For very low similarity pair (outside of ~2*10 7 ), its length is defined as Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)
18
Results: ProtoNet top 20 Why clustering at all?We want to extend the range of “similarity”, compared to e.g. BLASTExploit transitivity to improve groupingCan use a low threshold on similarity:- uses vast information from low similarities- allowable because clustering filters noiseWhy clustering at all?We want to extend the range of “similarity”, compared to e.g. BLASTExploit transitivity to improve groupingCan use a low threshold on similarity:- uses vast information from low similarities- allowable because clustering filters noise 20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level
19
Problem of result assessment: what is a “good” cluster? Contains all proteins in the family, does not contain proteins not in family But what is family? Does any keyword define a family? Stable as the merging events occur (long life- time)?
20
Problem of result assessment: what is a “good” tree? Should we trust the resulting forest? –Which clustering technique is better? Combined? –Bootstrap? Do the clusters correspond to meaningful families of proteins? –Validation against InterPro, SCOP etc. –Lack of will to automatically reconstruct them!!! What is the right level/cut to look at the forest?
21
Interpro Validation Interpro annotation allows systematic validation of the generated clustering The ‘geometric’ method exhibits high cluster purity –Corresponds to low FP
22
The Domain Problem Many proteins are composed of several domains The sequence similarity tools used are therefore local in nature: The score of comparing two sequences is the edit distance of the most similar subsequences of them This creates a false similarity problem:
23
The Modular Nature of Proteins CSKP HUMAN DLG3 MOUSEK6A1 MOUSEMPP3 HUMAN Serine/Threonine protein kinase family active site Protein kinase C-terminal domain PDZ domain SH3 domain Guanylate kinase
24
8e-78 2e-47 9e-41 1e-42 False Transitivity of Local Alignment CSKP HUMAN DLG3 MOUSEMPP3 HUMANK6A1 MOUSE We ran BLAST using default parameters: All these pairwise similarities have better than 1e-40 EScore If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN
25
Alternative methods Different types of clustering –Non-binary –Goal-oriented => semi-guided –Graph theory insights Non-clustering ways of exploring the space of proteins Why BLAST E-score??? Enrichment of the metric using structure
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.