K-Medoid May 5, 2019.

Slides:

Advertisements

Similar presentations

Advertisements

K-means Clustering Given a data point v and a set of points X,

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Clustering Basic Concepts and Algorithms

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

PARTITIONAL CLUSTERING

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

Introduction to Bioinformatics

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.

Cluster Analysis (1).

What is Cluster Analysis?

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Clustering Unsupervised learning Generating “classes”

1 CLARACLARA. 2 data Algorithm CLARA 1. For i= 1 to 5, repeat the following steps: k = 2 mincost = 9999 bestset.

Presented by Tienwei Tsai July, 2005

BINF6201/8201 Hidden Markov Models for Sequence Analysis

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.

Selecting Diverse Sets of Compounds C371 Fall 2004.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

Vector Quantization CAP5015 Fall 2005.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Unsupervised Learning

Chapter 5 Unsupervised learning

Clustering MacKay - Chapter 20.

PREDICT 422: Practical Machine Learning

Clustering CSC 600: Data Mining Class 21.

SIMILARITY SEARCH The Metric Space Approach

Chapter 15 – Cluster Analysis

Data Mining K-means Algorithm

PCB 3043L - General Ecology Data Analysis.

Research in Computational Molecular Biology , Vol (2008)

Basic machine learning background with Python scikit-learn

K Nearest Neighbor Classification

REMOTE SENSING Multispectral Image Classification

Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.

of the Artificial Neural Networks.

DATA MINING Introductory and Advanced Topics Part II - Clustering

Parallelization of Sparse Coding & Dictionary Learning

CSCI N317 Computation for Scientific Applications Unit Weka

What Is Good Clustering?

Numerical Descriptive Statistics

Clustering Wei Wang.

Data Transformations targeted at minimizing experimental variance

Nearest Neighbors CSC 576: Data Mining.

Text Categorization Berlin Chen 2003 Reference:

Volume 3, Issue 6, Pages (November 1998)

Data Pre-processing Lecture Notes for Chapter 2

Calibration and homographies

Hierarchical Clustering

Introduction to Machine learning

Unsupervised Learning

Presentation transcript:

K-Medoid May 5, 2019

Partitional Clustering Partition n objects into k clusters These techniques start with K clusters (partitions) The partitions (clusters) is decided in advance by the user. May 5, 2019

k-medoid methods There are two best-known k-medoid methods: PAM (Partitioning Around Medoids) CLARA (Clustering LARge Applications) May 5, 2019

PAM (Partitioning Around Medoids) The Idea: Find a single partition of the data into K clusters Each cluster has a most representative point a point that is the most “centrally” located point in the cluster with respect to some measure, e.g., distance. These lead us to the medoid definition… May 5, 2019

Medoid - definition A medoid is an actual point in the dataset that is centrally located and is therefore representative of the cluster. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 May 5, 2019

More precisely… Object Oj belongs to the cluster represented by Om if: d (Oj, Om) = minOe d )Oj, Oe) Oj is a non-selected object Om is a (selected) medoid d(O1,O2) denotes the dissimilarity or distance between objects O1 and O2. minOe denotes the minimum over all medoids Oe May 5, 2019

PAM – In General… To find the k-medoids… PAM begins with an arbitrary selection of k objects. Then, in each step, a swap between a selected object Om and a non-selected object Op is made. As long as such a swap would result in an improvement of the quality of the clustering. May 5, 2019

A simple example for swap Suppose there are 2 medoids: A and B And we replace A with a new medoid M. B A A B M May 5, 2019

B M A Y For all the objects Y that are originally in the cluster represented by A: find the nearest medoid in light of the replacement. May 5, 2019

There are 2 cases: case2 case1 Case 1: Y moves to the cluster represented by B, but not to the new one represented by M. Case 2: Y moves to the new cluster represented by M, and the cluster represented by B is not affected. B M A Y case2 case1 May 5, 2019

We also need to consider all the objects Z that are originally in B’s cluster. M A Z B May 5, 2019

More 2 cases: Z case4 case3 Case 3: Z either stays with B Case 4: Z moves to the new cluster represented by M. M case4 A Z case3 B May 5, 2019

Om current medoid that is to be replaced (e.g., A). Op new medoid to replace Om (e.g., M). Oj other non-medoid objects that may or may not need to be moved (e.g., Y and Z) Oj,2 a current medoid that is nearest to Oj without A and M (e.g., B). B M Z A Y May 5, 2019

To formalize the effect of a swap between Om and Op, PAM computes costs Cjmp for all non-medoid objects Oj. Depending on which of the following cases Oj is in, Cjmp is defined differently. May 5, 2019

Case 1: Cjmp = d)Oj, Oj,2) – d)Oj, Om) Oj currently belongs to the cluster represented by Om. Oj be more similar to Oj,2 than to Op, i.e., d)Oj, Op) >= d)Oj, Oj,2) Thus, Oj would belong to the cluster represented by Oj,2 The cost of the swap is: Cjmp = d)Oj, Oj,2) – d)Oj, Om) Oj Oj,2 Om Op May 5, 2019

Case 2: Oj currently belongs to the cluster represented by Om. Oj is less similar to Oj,2 than to Op, i.e., d)Oj, Op) < d)Oj, Oj,2) Thus, Oj would belong to the cluster represented by Op The cost of the swap is: Cjmp = d)Oj, Op) – d)Oj, Om) Oj Om Oj,2 Op May 5, 2019

Case 3: Oj currently belongs to a cluster Oj,2. Oj is more similar to Oj,2 than to Op. Then, even if Om is replaced by Op, Oj would stay in the cluster represented by Oj,2. The cost is: Cjmp = 0 Oj Om Oj,2 Op May 5, 2019

Case 4: The cost of the swap is: Cjmp = d)Oj, Op) – d)Oj, Oj,2) Oj currently belongs to the cluster represented by Oj,2. Oj is less similar to Oj,2 than to Op. Then, replacing Om with Op would cause Oj to jump to the cluster of Op from that of Oj. The cost of the swap is: Cjmp = d)Oj, Op) – d)Oj, Oj,2) Oj Om Oj,2 Op May 5, 2019

Total Cost Combining the four cases , the Total Cost of replacing Om with Op is given by: TCmp = May 5, 2019

Algorithm PAM Arbitrarily choose k objects as the initial medoids Until no change, do (Re) assign each object to the cluster to which the nearest medoid Randomly select a non-medoid object Op, compute the total cost, TCmp, of swapping medoid Om with Op If TCmp < 0 then swap Om with Op to form the new set of k medoids May 5, 2019

PAM: Example K=2 Do loop Until no change 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Arbitrary choose k object as initial medoids Assign each remaining object to nearest medoids K=2 Randomly select a nonmedoid object,Oramdom 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Do loop Until no change Compute total cost of swapping Swapping O and Oramdom If quality is improved. May 5, 2019

PAM Disadvantage Experimental results show that PAM works satisfactorily for small data sets (e.g., 100 objects in 5 clusters) . But, it is not efficient in dealing with medium and large data sets. This is not too surprising if we perform a complexity analysis on PAM. There are altogether k(n-k) pairs. For each pair, computing TCmp requires the examination of (n - k) non-selected objects. Thus, the complexity combined is of . And this is the complexity of only one iteration. Thus, it is obvious that PAM becomes too costly for large values of n and k. This analysis motivates the development of CLARA. May 5, 2019

CLARA (Clustering LARge Applications) Designed to handle large data sets The Idea: Instead of finding representative objects for the entire data set, CLARA draws a sample of the data set, applies PAM on the sample, and finds the medoids of the sample. The point is that, if the sample is drawn in a sufficiently random way, the medoids of the sample would approximate the medoids of the entire data set. May 5, 2019

Experiments shows that samples of size 40 + ‏2K To come up with better approximations, CLARA draws multiple samples and gives the best clustering as the output. The quality of a clustering is measured based on the average dissimilarity of all objects in the entire data set. Experiments shows that samples of size 40 + ‏2K give satisfactory results. May 5, 2019

Algorithm CLARA For i = 1 to 5, repeat the following steps: Draw a sample of 40 + 2k objects randomly from the entire data set, and call Algorithm PAM to find k medoids of the sample. For each object Oj in the entire data set, determine which of the k medoids is the most similar to Oj. Calculate the average dissimilarity of the clustering obtained in the previous step. If this value is less than the current minimum, use this value as the current minimum, and retain the k medoids found in Step 2 as the best set of medoids obtained so far. Return to Step 1 to start the next iteration. May 5, 2019

Biological Application May 5, 2019

The Biological Problem Some facts… Recent advances of experimental techniques and automation in molecular and structural biology have led to the rapid increase in the determination of many protein structures. The number of structures deposited in the Protein Data Bank (PDB) is now over 20,000 and the contents are growing rapidly. May 5, 2019

Over half of all of the proteins of sequenced genomes has no inferable molecular functions. As sequence similarity infers functional similarity, structural similarity also infers similarity in molecular function: if a hypothetical protein has a structure similar to one or more protein structures of known function, the structural similarity infers a powerful clue to the molecular function of the hypothetical protein. Measures of structural similarity, assessed computationally or visually, between pairs of proteins are also the foundation for classifying protein structures. May 5, 2019

The Goal We base our method on: distances The goal of the method is: To find measures of structural similarity between proteins We base our method on: distances May 5, 2019

Some Biological Background… May 5, 2019

Protein Structure Amino Acid: שרשרת פוליפפטידית: May 5, 2019

Structure of the -Helix: sheet: May 5, 2019

The Method May 5, 2019

The Method We start with the distance matrix representation of protein structure. The distance matrix of a protein structure is a square matrix consisting of the distances between all pairs of atoms in the protein. May 5, 2019

When there are residues in protein p, its distance matrix is the matrix Dp is: {dp(i, j): i, j=1, . . . , } dp(i, j) is the distance (in Å) between residues i and j. May 5, 2019

m x m sub-matrices described by: We sub-divide the distance matrix of each protein structure into many overlapping sub-matrices. The overlapping sub-matrices presenting local features involving m-residues by m-residues in the protein is the following collection : m x m sub-matrices described by: May 5, 2019

The collection of these sub-matrices over P proteins is: We use a collection of these sub-matrices from a large number of distance matrices to extract a set of K medoid sub-matrices by medoid analysis (PAM). May 5, 2019

Example: 100 Medoids One hundred medoid sub-matrices obtained from partitioning around medoids (PAM) analysis of distance matrices of 100 sampled proteins. May 5, 2019

Generation of the LFF Profile Each of the protein sub-matrices is labeled by the index of the nearest medoid sub-matrix. The count vector summarizes the frequency distribution of local feature patterns of the protein. Any given protein structure can be represented by a profile, a vector of a common length K, containing the frequencies of occurrence of these medoid sub-matrices in the structure. May 5, 2019

We call this decoding process: profiling of the protein structure The final feature vector , profile of protein p May 5, 2019

We normalize frequency of local interaction pattern k in protein p by: May 5, 2019

Normalization of the results Because the abundance of local patterns varies considerably from one pattern to another, some normalization of the profile is necessary. For example, the ‘‘null’’ pattern is most abundant of all, and, without normalization, such an abundant pattern will dominate when computing structural similarity or dissimilarity distances. This is not desirable because the frequency of the void pattern contains little structural information. May 5, 2019

Normalize Vector Normalize vector X: . In our method: May 5, 2019

Before… May 5, 2019

After… May 5, 2019

Calculation of Similarity/ Dissimilarity Scores The profile of protein P: The collection of profiles, or the protein-by-pattern matrix: May 5, 2019

As a measure of structural similarity between two proteins p and q with profiles Ap and Aq in , we use their cosine. The cosine distance is defined as 1 - cos(Ap, Aq) and used to represent structural dissimilarity or structural distance. Note that the cosine distance ranges from 0 (closest) to 1 (farthest). May 5, 2019

Problem & Solution The Problem: The profile of protein P is a vector which belong to . The Solution: Using SVD which helps to reduce the number of dimensions. May 5, 2019

Singular Value Decomposition (SVD) The SVD of matrix A is defined as U is an m x n matrix V is a n x n square matrix U,V are orthgonal so that: . Now we can approximating the original protein by pattern matrix by: May 5, 2019

The 1st -> length of protein We compute the truncated SVD with k=3 to obtain approximation A3 of the protein by pattern matrix, because the first three values are significantly greater than the rest. We can represent proteins and patterns in the same R3 space by their first three principal coordinates: The 1st -> length of protein The 2nd -> types of secondary structure elements The 3rd -> parallelism, direction May 5, 2019

2D plot Proteins belonging to all- , all- , are colored red, blue respectively. May 5, 2019

3D plot May 5, 2019

Criticism In order to find the k medoids… How many proteins should I use as my database? How should I choose the value of k, i.e. the number of the clusters? In order to find the overlapping sub-matrixes – how to choose the value for m? The method for finding structural similarity between proteins uses algorithm PAM which is not effective for using large databases. So, why not using the CLARA algorithm ? Is the method able to recognize small proteins or those with no distinct secondary structure elements in their topology? May 5, 2019

The end… May 5, 2019

Analogy to text analysis Document -> vector of word counts Protein structure = document Protein structure -> document Words -> different medoid sub-matrices May 5, 2019

The quality of clustering of local features is difficult to discern because of the domination of the null medoid and low signal to noise ratio of the rest of the modoids (lower four plots). However, after normalization by the spread of the counts in each representative medoid, the similarity among LFF profiles within each family is evident. May 5, 2019

****k-medoid advantages Very robust to the existence of outliers (i.e., data points that are very far away from the rest of the data points). Clusters found by k-medoid methods do not depend on the order in which the objects are examined. Experiments have shown that the k-medoid methods can handle very large data sets quite efficiently. May 5, 2019

After converting each protein structure into a local feature frequency (LFF) profile, the fold similarity between a pair of proteins can be computed very easily as Euclidean distance or cosine distance between two corresponding LFF profile vectors. May 5, 2019

The quality of a clustering The quality of the chosen medoids, is measured by the average dissimilarity or distance between an object and the medoid of its cluster. May 5, 2019