Data Mining--Clustering “All human beings desire to know” Aristotle, Metaphysics, I.1. Lecture 12 Data Mining--Clustering Prof. Sin-Min Lee
AprioriTid Algorithm The database is not used at all for counting the support of candidate itemsets after the first pass. The candidate itemsets are generated the same way as in Apriori algorithm. Another set C’ is generated of which each member has the TID of each transaction and the large itemsets present in this transaction. This set is used to count the support of each candidate itemset. The advantage is that the number of entries in C’ may be smaller than the number of transactions in the database, especially in the later passes.
Apriori Algorithm Candidate itemsets are generated using only the large itemsets of the previous pass without considering the transactions in the database. The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1. Each generated itemset, that has a subset which is not large, is deleted. The remaining itemsets are the candidate ones.
Example Database L1 C2 C3 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Itemset Support {1} 2 {2} 3 {3} {5} Itemset Support {1 3}* 2 {1 4} 1 {3 4} {2 3}* {2 5}* 3 {3 5}* {1 2} {1 5} C3 Itemset Support {1 3 4} 1 {2 3 5}* 2 {1 3 5}
Example Database L1 C2 C3 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Itemset Support {1} 2 {2} 3 {3} {5} Itemset TID {1 3} 100 {1 4} {3 4} {2 3} 200 {2 5} {3 5} {1 2} 300 {1 5} 400 C3 Itemset TID {1 3 4} 100 {2 3 5} 200 {1 3 5} 300
Example Database L1 C2 C3 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Itemset Support {1} 2 {2} 3 {3} {5} Itemset Support {1 2} 1 {1 3}* 2 {1 5} {2 3}* {2 5}* 3 {3 5}* C3 {1 2 3} {1 3 5} {2 3 5} Itemset Support {2 3 5}* 2
Example C2 Database L1 C’3 C’2 C3 Itemset Support {1 2} 1 {1 3}* 2 {1 5} {2 3}* {2 5}* 3 {3 5}* Database L1 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Itemset Support {1} 2 {2} 3 {3} {5} C’3 C’2 200 {2 3 5} 300 100 {1 3} 200 {2 3}, {2 5}, {3 5} 300 {1 2}, {1 3}, {1 5}, {2 3}, {2 5}, {3 5} 400 {2 5} C3 Itemset Support {2 3 5}* 2
No practicable methodology has been demonstrated for reliable prediction of large earthquakes on times scales of decades or less Some scientists question whether such predictions will be possible even with much improved observations Pessimism comes from repeated cycles in which public promises that reliable predictions are just around the corner are followed by the equally public failures of specific prediction methodologies. Bad for science!
COMPLEX PLATE BOUNDARY ZONE IN SOUTHEAST ASIA Northward motion of India deforms all of the region Many small plates (microplates) and blocks Molnar & Tapponier, 1977
Short-term prediction (forecast) Mission district — San Francisco Earthquake, 1906 Short-term prediction (forecast) Frequency and distribution pattern of foreshocks Deformation of the ground surface: Tilting, elevation changes Emission of radon gas Seismic gap along faults Abnormal animal activities
强烈地震顷刻间将唐山夷为一片平地。图为唐山市区震后废墟
Freeway Damage — 1994 CA Earthquake
Sand Boils after Loma Prieta Earthquake
California Earthquake Probabilities Map
Clustering Group data into clusters Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: no predefined classes Cluster 1 Cluster 2 Outliers
What Is A Good Clustering? High intra-class similarity and low inter-class similarity Depending on the similarity measure The ability to discover some or all of the hidden patterns
General Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
Data Structures in Clustering Data matrix (two modes) Dissimilarity matrix (one mode)
Measuring Similarity Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.
Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters
Hierarchy algorithms Agglomerative: each object is a cluster, merge clusters to form larger ones Divisive: all objects are in a cluster, split it up into smaller clusters
Types of Clusters: Well-Separated Well-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters
Types of Clusters: Center-Based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters
Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters
Types of Clusters: Density-Based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters
Types of Clusters: Conceptual Clusters Shared Property or Conceptual Clusters Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles
Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits
Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix
Intermediate Situation After some merging steps, we have some clusters C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
After Merging The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5
How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Proximity Matrix
p1 p3 p5 p4 p2 . . . . MIN Proximity Matrix MAX
Distance Between Centroids Group Average Distance Between Centroids
Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph. 1 2 3 4 5
Hierarchical Clustering: MIN 5 1 2 3 4 5 6 4 3 2 1 Nested Clusters Dendrogram
Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters 1 2 3 4 5
Hierarchical Clustering: MAX 5 4 1 2 3 4 5 6 2 3 1 Nested Clusters Dendrogram
Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters. Need to use average connectivity for scalability since total proximity favors large clusters 1 2 3 4 5
Hierarchical Clustering: Group Average 5 4 1 2 3 4 5 6 2 3 1 Nested Clusters Dendrogram
Hierarchical Clustering: Time and Space requirements O(N2) space since it uses the proximity matrix. N is the number of points. O(N3) time in many cases There are N steps and at each step the size, N2, proximity matrix must be updated and searched Complexity can be reduced to O(N2 log(N) ) time for some approaches
Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone No objective function is directly minimized Different schemes have problems with one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters and convex shapes Breaking large clusters
MST: Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree) Start with a tree that consists of any point In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters