Presentation is loading. Please wait.

Presentation is loading. Please wait.

©CMBI 2006 The WWW of clustering in Bioinformatics. or, How homo-sapiens thinks Clustering hokjes.

Similar presentations


Presentation on theme: "©CMBI 2006 The WWW of clustering in Bioinformatics. or, How homo-sapiens thinks Clustering hokjes."— Presentation transcript:

1 ©CMBI 2006 The WWW of clustering in Bioinformatics. or, How homo-sapiens thinks Clustering hokjes

2 ©CMBI 2006 Disclaimer I know nothing about the actual clustering techniques; for that you must ask Lutgarde, or Ron, or any of their colleagues. I will talk today about fields, mainly bioinformatics, where clustering is being used.

3 ©CMBI 2006 My daily clustering

4 ©CMBI 2006 Remember bioinformatics 1? The main reason for aligning sequences is that that allows us to transfer information. If there are many sequences available, clustering can help us figure out from which of those sequences we can best transfer that information. Why clustering sequences?

5 ©CMBI 2006 My daily clustering Take, for example, the three sequences: 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR and you see immediately that 2 and 3 are close, while 1 is further away. So the three will look roughly like: 3 2 1

6 ©CMBI 2006 Clustering sequences; start with distances D E Matrix of pair-wise distances between five sequences. 10 8 7 D and E are the closest pair. Take them, and collapse the matrix by one row/column.

7 ©CMBI 2006 Clustering sequences D E A B

8 ©CMBI 2006 Clustering sequences D E C A B

9 ©CMBI 2006 Clustering sequences D E C A B So it really looks as if we have two clusters, AB and C,DE. But feel free to call it three clusters…

10 ©CMBI 2006 So, nice tree, but what did we actually do? 1)We determined a distance measure 2)We measured all pair-wise distances 3)We collapsed everything in ~1½ dimension 4)We used an algorithm to visualize things 5)We decided on the number of clusters And that, ladies and gentleman, is called clustering…

11 ©CMBI 2006 Back to my daily clustering 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR Actually I cheated a bit. 1 is closer to 3 than to 2 because of the A at position 1. How can we express this in the tree? For example: 3 2 1 3 2 1 I will call this tree-flipping

12 ©CMBI 2006 Can we generalize tree-flipping? To generalize tree flipping, sequences must be placed ‘distance-correct’ in 1 dimension: 2 3 1 And then connect them, as we did before: So, now most info sits in the horizontal dimension. Can we use the vertical dimension usefully?

13 ©CMBI 2006 The problem is actually bigger 1 ASWTFGHK 2 GTWSFANR 3 ATWAFADR d(i,j) is the distance between sequences i and j. d(1,2)=6; d(1,3)=5; d(2,3)=3. But what if a 4th sequence is added with d(1,4)=4, d(2,4)=5, d(3,4)=4? Where would that sequence sit? So a perfect representation would be: 1 3 2

14 ©CMBI 2006 Projection to visualize clusters Fuller projection; Unfolded Dymaxion map Gnomonic projection: Correct distances Political projection Source: Wikepedia Mercator projection

15 ©CMBI 2006 Back to sequences: ASASDFDFGHKMGHS 1 ASASDFDFRRRLRHS 2 ASASDFDFRRRLRIT 5 ASLPDFLPGHSIGHS 3 ASLPDFLPGHSIGIT 6 ASLPDFLPRRRVRIT 4 The more dimensions we retain, the less information we loose. The tree is now in 3D…

16 ©CMBI 2006 Projection to visualize clusters We want to reduce the dimensionality with minimal distortion of the pair-wise distances. One way is Eigenvector determination, or PCA.

17 ©CMBI 2006 PCA to the rescue Now we have made the data one-dimensional, while the second, vertical, dimension is noise. If we did this correctly, we kept as much data as possible.

18 ©CMBI 2006 One problem, though… Can we actually draw a straight line through the points? To me it looks that the best line is not straight but bend; and if that’s true we lost some kind of information! But that is a data-modelling / clustering problem, not a PCA problem…

19 ©CMBI 2006 Back to sequences: In we have N sequences, we can only draw their distance matrix in an N-1 dimensional space. By the time it is a tree, how many dimensions, and how much information have we lost? Perhaps we should cluster in a different way?

20 ©CMBI 2006 Cluster on critical residues? QWERTYAKDFGRGH AWTRTYAKDFGRPM SWTRTNMKDTHRKC QWGRTNMKDTHRVW Gray = conserved Red = variable Green = correlated No information for clustering No noise

21 ©CMBI 2006 Cluster based on correlated residues

22 ©CMBI 2006 Conclusion about sequence clustering Important topics: 1.Distance measure (~ data selection) 2.Data-modelling / algorithm 3.Visualization 4.Dimensionality reduction

23 ©CMBI 2006 We don’t only cluster sequences… Other cluster procedures are found in e.g.: 1.Structure prediction, analysis, etc; 2.Cell-sorting; 3.HIV medication choice; 4.Cancer radiation and chemotherapy regimes; 5.Design of food taste product combinations; and many more, often very different, fields of bio-related topics. The next few slides show some examples.

24 ©CMBI 2006 Other sciences that cluster: Determination of a structure from massive averaging of very low resolution data. This is often iterative.

25 ©CMBI 2006 Micro array data http://gap.stat.sinica.edu.tw/Bioinformatics/

26 ©CMBI 2006 Brain imaging Donders instituut and One way of measuring is via hemoglobin. Mane patients or test persons must be averaged. Clustering determines who can be averaged. Distance measure? Donders, Waisman, and MPG websites. Cluster with picture deformation

27 ©CMBI 2006 Molecular dynamics These actually are many superposed motions acting at the same time. Essential dynamics, eigenvector analysis in distance space, separates those motions. Bert de Groot

28 ©CMBI 2006 Essential dynamics Daan van Aalten Motion along eigenvector 1

29 ©CMBI 2006 Summary: domain knowledge is needed These examples make clear that domain knowledge is needed to cluster biological data. Everybody can cluster thisBut this?

30 ©CMBI 2006 Even informatics now knows: Michele CeccarelliMichele Ceccarelli and Antonio Maratea (2006)Antonio Maratea Improving Fuzzy Clustering of Biological Data by Side Information* International Journal on Approximate Reasoning :. Abstract Semi Supervised methods use a small amount of auxiliary information as a guide in the learning process in presence of unlabeled data. When using a clustering algorithm, the auxiliary information has the form of Side Information, that is a list of co-clustered points. Recent literature shows better performance of these methods with respect to totally unsupervised ones even with a small amount of Side Information. This fact suggests that the use of Semi Supervised methods may be useful especially in very difficult and noisy tasks where little a priori information is available, as is the case of data deriving from biological experiments. http://www.scoda.unisannio.it/Papers/Publications/JAR_2006 * “ Side knowledge” is informatics-speak for domain knowledge

31 ©CMBI 2006 Cluster techniques: K-means Simply speaking k-means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data. But how many clusters? D E C A B Elbow rule-of-thumb

32 ©CMBI 2006 -0.3 12.4 0.01 12.1 -7.3 5.16 ….. Cluster techniques: Neural nets For example: self organizing maps. Every data-point must be represented by a vector of length N. Every square in this map becomes a random vector of length N. For every data-point the in-product is calculated with every map-vector. The best fitting map- vector is averaged with 1/q times the data vector. The neighbours in the map are averaged with 1/p.q times the data vector. This is iterated N times with normalisations in between. 1/q 1/p.q

33 ©CMBI 2006 Support Vector Machines The idea is to split a dataset in two parts, for example binding ligands in green and non-binding ones in red. The parameters P and Q describe the ligands. The SVN finds the best separating line. Obviously this must be used in N dimensional space with an N-1 dimensional hyper plane. I believe that a mathematical theorem exists that tells us that this still works if N is (much) bigger than the number of data points. P Q

34 ©CMBI 2006 Summary There are many ways to describe the data. There are many clustering techniques. There are many visualisation methods. What to use depends each time on the biological question. Domain knowledge is a prerequisite for success.


Download ppt "©CMBI 2006 The WWW of clustering in Bioinformatics. or, How homo-sapiens thinks Clustering hokjes."

Similar presentations


Ads by Google