Clustering John Owen Sarah Smith
What is clustering? Grouping together objects that are like one another and not like the objects other clusters. Like sorting laundry…
What is Clustering? Has its routes in statistical analysis In data mining, clustering is used to give a user a high level view of what is going on in their database.
Clustering Approach Algorithms can be complex The general approach contains five steps Pattern representation Identify the pattern proximity relative to the data domain Grouping or Clustering of the data. Data abstraction Assessment of output.
Four Clustering Methods Partitioning (k-means clustering) Hierarchical Density-Based Grid-Based
Partitioning (k-means clustering) Classification of the data into k groups, which meet two requirements each group must contain at least one object, and each object must belong to exactly one group The analyst decides how many clusters there should be, then creates the best fit of points to a cluster The analyst must know the data to do this
Partitioning Example (Source k-means clustering http://www.togaware.com/datamining/survivor/K_Means.html)
Hierarchical Clustering Analyst need not know the data Designed primarily for creating micro-clusters in large database sets Hierarchal method is either agglomerative (bottom-up) or divisive (top-down)
Hierarchical Example (Source http://genome.imim.es/~eblanco/seminars/docs/clustering/index_types.html#hierarchy)
Density-Based Clustering Defines the data by the density of the data distribution Does not require the user to identify the number of clusters before beginning the data analysis Useful for dealing with outliers
Density-Based Examples (Source: http://klimt.iwr.uni-heidelberg.de/mip/research/hader_clust/)
Grid-Based Clustering Adaptation of Density-Based Clustering Data points are placed in a data grid Each data grid is of equal size Grids can be decomposed into smaller grids
Grid-Based Example
Business Uses of Clustering Marketing Identifying customers/clients who are outliers Detection of Credit Card Fraud Scientific inquiry Human genome
Future of Clustering AI Unsupervised Learning from pattern recognition