A Geometric Clustering Algorithm and Its Applications to Structural Data Guido Muscioni Lorenzo Semeria Shuangxi zhu 10/05/2017 @UIC
Outline Context introduction Datasets description The algorithm Results From: “A Geometric Clustering Algorithm and Its Applications to Structural Data”, Shutan Xu, Shuxue Zou, and Lincong Wang
context Incredible growth of complex structured data in molecular biology Nuclear Magnetic Resonance (NMR) spectroscopy Protein-ligand Docking Molecular Dynamics (MD) simulations
Nuclear Magnetic Resonance spectroscopy NMR machine at the École polytechnique fédérale de Lausanne Based on magnetic property of atoms Determine properties of: Atoms Molecules Produces a unique firm for each molecules
Protein-ligand docking Protein modelling technique Predict the structure of a protein Image from: https://www.intechopen.com/books/protein-engineering-technology-and-application/protein-protein-and-protein-ligand-docking
Molecular dynamics simulation Based on strength between molecules Computing and updating the position based on the iteractions between molecules N-body simulation techniques
dataset NMR dataset Protein-ligand dataset Based on SiR5 Two set of intermediates: Computed on 101 residues Large amount of them 22 set of poses (retrieved by GOLD suite) 500 poses available for each initial protein-ligand complex
algorithm Cluster based algorithm RSDM Similarity measure between two structure in the same created cluster.
Steps Check for dmax Seed another cluster Reclustering based on d Create two more seeds Initializing two cluster Cluster the former dcc<dmax yes Cluster the latter no
Number of structures Ns results Metrics: Number of structures Ns VDW energy NOE violation
BASELINE Complete link Geometric clustering Average link VS K-medoid
Result-nmr Complete link Geometric clustering Average link VS K-medoid
Both show problems, the level of confidence of the results is not high Result-Protein-lingand Better identification of clusters Sometimes fails in recognizing the best cluster Geometric clustering GOLD score VS Both show problems, the level of confidence of the results is not high
Discussion Pros & What we liked Cons & What we disliked Algorithm is not clearly explained (elements are assigned to new clusters with no apparent criteria) Complexity is high (O (n^2 * log(n) ) ) A prior knowledge is needed to have a good result The datasets may not be big enough to be realistic The datasets may not represent real data# Second part in the result section does not have a good description It’s simple (it is a “standard” iterative clustering algorithm) The results are better than other clustering algorithms for these problems Not only implementing new algorithm, but It also came up with a relatively new scoring function for best cluster selection
Questions