Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Clustering Prof. Navneet Goyal BITS, Pilani
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
By Lydia E. Kavraki, Petr Svestka, Jean-Claude Latombe, Mark H. Overmars Emre Dirican
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 Last lecture  Configuration Space Free-Space and C-Space Obstacles Minkowski Sums.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
Robot Motion Planning Bug 2 Probabilistic Roadmaps Bug 2 Probabilistic Roadmaps.
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces Lydia E. Kavraki Petr Švetka Jean-Claude Latombe Mark H. Overmars Presented.
Clustering Unsupervised learning Generating “classes”
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
© Manfred Huber Autonomous Robots Robot Path Planning.
Algorithms for Triangulations of a 3D Point Set Géza Kós Computer and Automation Research Institute Hungarian Academy of Sciences Budapest, Kende u
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Digital Image Processing CCS331 Relationships of Pixel 1.
Clustering.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Presented by Ho Wai Shing
Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Vector Quantization CAP5015 Fall 2005.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Flow cytometry data analysis: SPADE for cell population identification and sample clustering Narahara.
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Hierarchical Clustering: Time and Space requirements
Data Mining Soongsil University
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
Data Mining K-means Algorithm
Mean Shift Segmentation
Graph Analysis by Persistent Homology
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
K Nearest Neighbor Classification
Last lecture Configuration Space Free-Space and C-Space Obstacles
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Topological Signatures For Fast Mobility Analysis
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
CSE572: Data Mining by H. Liu
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Overview Algorithms GRAVIclust AUTOCLUST AUTOCLUST+ 3D Boundary-based Clustering SNN

Gravity based spatial clustering GRAVIclust Initialisation Phase calculate the initial centre clusters Optimisation Phase improve the position of the cluster centres so as to achieve a solution which minimizes the distance function

GRAVIclust: Initialisation Phase Input: set of points P

GRAVIclust: Initialisation Phase Input: set of points P matrix of distances between all pairs of points assumption: actual access path distance exists in GIS maps e.g.. very versatile footpath road map rail map

GRAVIclust: Initialisation Phase Input: set of points P matrix of distances between all pairs of points # of required clusters k

GRAVIclust: Initialisation Phase Step 1: calculate first initial centre the point with the largest number of points within radius r remove first initial centre & all points within radius r from further consideration Step 2: repeat Step 1 until k initial centres have been chosen Step 3: create initial clusters by assigning all points to the closest cluster centre

GRAVIclust: radius calculation Radius r calculated based on the area of the region considered for clustering static radius based on the assumption that all clusters are of the same size dynamic radius recalculated after each initial cluster centre is chosen

GRAVIclust: Static vs. Dynamic Static reduced computation # points within a radius r has to be calculated only once not suitable for problems where the points are separated by large empty areas Dynamic increases computation time ensures the radius is adjusted as the points are removed Differs only when distribution is non-uniform

GRAVIclust: Optimisation Phase Step 1: for each cluster, calculate new centre based on the the point closest to cluster centre of gravity Step 2: re-assign points to new cluster centres Step 3: recalculate distance function never greater than previous Step 4: repeat Step 1 to 3 until value distance function equals previous

GRAVIclust Deterministic Can handle obstacles Monotonic convergence of the distance function to a stable point

AUTOCLUST Definitions

AUTOCLUST Definitions II

AUTOCLUST Phase 1: finding boundaries Phase 2: restoring and re-attaching Phase 3: detecting second-order inconsistency

AUTOCLUST: Phase 1 Finding boundaries Calculate Delaunay Diagram for each point p i ShortEdges(p i ) LongEdges(p i ) OtherEdges(p i ) Remove ShortEdges(p i ) and LongEdges(p i )

AUTOCLUST: Phase 2 Restoring and re-attaching for each point p i where ShortEdges(p i )   Determine a candidate connected component C for p i If there are 2 edges e j = (p i,p j ) and e k = (p i,p k ) in ShortEdges(p i ) with CC[p j ]  CC[p k ], then Compute, for each edge e = (p i,p j )  ShortEdges(p i ), the size ||CC[p j ]|| and let M = max e = (pi,pj)  ShortEdges(pi) ||CC[p j ]|| Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to p i )

AUTOCLUST: Phase 2 Restoring and re-attaching for each point p i where ShortEdges(p i )   Determine a candidate connected component C for p i If … Otherwise, let C be the label of the connected component all edges e  ShortEdges(p i ) connect p i to

AUTOCLUST: Phase 2 Restoring and re-attaching for each point p i where ShortEdges(p i )   Determine a candidate connected component C for p i If the edges in OtherEdges(p i ) connect to a connected component different than C, remove them. Note that all edges in OtherEdges(p i ) are removed, and only in this case, will p i swap connected components Add all edges e  ShortEdges(p i ) that connect to C

AUTOCLUST: Phase 3 Detecting second-order inconsistency compute the LocalMean for 2- neighbourhoods remove all edges in N 2,G(pi) that are long edges

AUTOCLUST

No user supplied arguments eliminates expensive human-based exploration time for finding best-fit arguments Robust to noise, outliers, bridges and type of distribution Able to detect clusters with arbitrary shapes, different sizes and different densities Can handle multiple bridges O(n log n)

AUTOCLUST+ Construct Delaunay Diagram Calculate MeanStDev(P) For all edges e, remove e if it intersects some obstacles Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps

3D Boundary-based Clustering Benefits from 3D Clustering more accurate spatial analysis distinguish positive clusters: clusters in higher dimensions but not in lower dimensions

3D Boundary-based Clustering Benefits from 3D Clustering more accurate spatial analysis distinguish positive clusters: clusters in higher dimensions but not in lower dimensions negative clusters: clusters in lower dimensions but not in higher dimensions

3D Boundary-based Clustering Based on AUTOCLUST Uses Delaunay Tetrahedrizations Definitions: e j potential inter-cluster edge if:

3D Boundary-based Clustering Phase I For all the p i  P, classify each edge e j incident to p i into one of three groups ShortEdges(pi) when the length of e j is less than the range in AI(p i ) LongEdges(pi) when the length of e j is greater than the range in AI(p i ) OtherEdges(pi) when the length of e j is within AI(p i ) For all the p i  P, remove all edges in ShortEdges(pi) and LongEdges(pi)

3D Boundary-based Clustering Phase II Recuperate ShortEdges(pi) incident to border points using connected component analysis Phase III Remove exceptionally long edges in local regions

Shared Nearest Neighbour Clustering in higher dimensions Distances or similarities between points become more uniform, making clustering more difficult Also, similarity between points can be misleading i.e.. a point can be more similar to a point that “actually” belongs to a different cluster Solution Shared nearest neighbor approach to similarity

SNN: An alternative definition of similarity Euclidian distance most common distance metric used while useful in low dimensions, it doesn’t work well in high dimensions A1A2A3A4A5A6A7A8A9A10 P P P P

SNN: An alternative definition of similarity Define similarity in terms of their shared nearest neighbours the similarity of the points is “confirmed” by their common shared nearest neighbours

SNN: An alternative definition of density SNN similarity, with the k-nearest neighbour approach if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space

SNN: Algorithm Compute the similarity matrix corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix by keeping only the k most similar neighbours corresponds to keeping only the k strongest links of the similarity graph

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared nearest neighbour graph from the sparsified similarity matrix

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Find the core points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Form clusters from the core points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Form clusters from the core points Discard all noise points

SNN: Algorithm Compute the similarity matrix Sparsify the similarity matrix … Construct the shared … Find the SNN density of each point Form clusters from the core points Discard all noise points Assign al non-noise, non-core points to clusters

Shared Nearest Neighbour Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers Handles data of high dimentionality and varying densities Automaticly detects the # of clusters