CS 685: Special Topics in Data Mining Jinze Liu

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Other Clustering Techniques
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
CSE 5243 Intro. to Data Mining
Clustering in Ratemaking: Applications in Territories Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
CSE572, CBS598: Data Mining by H. Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
CS 685G: Special Topics in Data Mining
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Lecture 10 Clustering.
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
CS 685G: Special Topics in Data Mining
Presentation transcript:

CS 685: Special Topics in Data Mining Jinze Liu Clustering CS 685: Special Topics in Data Mining Jinze Liu

Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering methods Outlier analysis

Hierarchical Clustering Group data objects into a tree of clusters Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA)

AGNES (Agglomerative Nesting) Initially, each object is a cluster Step-by-step cluster merging, until all objects form a cluster Single-link approach Each cluster is represented by all of the objects in the cluster The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters

Dendrogram Show how to merge clusters hierarchically Decompose data objects into a multi-level nested partitioning (a tree of clusters) A clustering of the data objects: cutting the dendrogram at the desired level Each connected component forms a cluster

DIANA (DIvisive ANAlysis) Initially, all objects are in one cluster Step-by-step splitting clusters until each cluster contains only one object

Distance Measures Minimum distance Maximum distance Mean distance Average distance m: mean for a cluster C: a cluster n: the number of objects in a cluster

Challenges of Hierarchical Clustering Methods Hard to choose merge/split points Never undo merging/splitting Merging/splitting decisions are critical Do not scale well: O(n2) What is the bottleneck when the data can’t fit in memory? Integrating hierarchical clustering with other techniques BIRCH, CURE, CHAMELEON, ROCK

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies CF (Clustering Feature) tree: a hierarchical data structure summarizing object info Clustering objects  clustering leaf nodes of the CF tree

Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: Ni=1=Xi2 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)

CF-tree in BIRCH Clustering feature: Summarize the statistics for a subcluster: the 0th, 1st and 2nd moments of the subcluster Register crucial measurements for computing cluster and utilize storage efficiently A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of children

CF Tree CF1 CF3 CF2 CF6 Root B = 7 L = 6 Non-leaf node CF1 CF2 CF3 CF5 child1 CF3 child3 CF2 child2 CF6 child6 Root B = 7 L = 6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node Leaf node prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

Parameters of A CF-tree Branching factor: the maximum number of children Threshold: max diameter of sub-clusters stored at the leaf nodes

CF Tree Insertion Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node Modifying the path: update CF information up the path.

Vladimir Jelić (jelicvladimir5@gmail.com) Example of the BIRCH Algorithm New subcluster sc4 sc5 sc8 sc6 sc7 LN3 sc3 LN2 sc1 sc2 Root LN1 LN2 LN3 LN1 sc8 sc5 sc3 sc6 sc7 Vladimir Jelić (jelicvladimir5@gmail.com) sc1 sc4 sc2

Merge Operation in BIRCH If the branching factor of a leaf node can not exceed 3, then LN1 is split sc4 sc1 sc5 sc3 sc6 sc2 sc7 sc8 LN2 LN1” LN3 Root LN1’ LN2 LN3 LN1’ LN1” sc8 sc5 sc3 sc6 sc7 sc1 sc4 sc2

Merge Operation in BIRCH If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one sc3 sc1 sc6 sc4 Root sc2 sc7 sc5 NLN1 sc8 LN3 LN2 NLN2 LN1’ LN1” LN1’ LN1” LN2 LN3 sc8 sc1 sc4 sc7 sc3 sc6 sc2 sc5 Vladimir Jelić (jelicvladimir5@gmail.com)

Merge Operation in BIRCH Assume that the subclusters are numbered according to the order of formation sc5 sc6 sc3 sc2 root sc1 LN1 sc4 LN2 LN1 LN2 sc6 sc1 sc5 sc2 sc3 sc4

Merge Operation in BIRCH If the branching factor of a leaf node can not exceed 3, then LN2 is split sc5 sc6 sc2 sc1 sc3 sc4 root LN1 LN2” LN2” LN2’ LN2’ LN1 sc6 sc5 sc3 sc1 sc4 sc2

Vladimir Jelić (jelicvladimir5@gmail.com) Merge Operation in BIRCH LN2’ and LN1 will be merged, and the newly formed node wil be split immediately sc2 sc5 sc6 sc3 sc4 sc1 LN3’ LN3” root LN2” LN2” LN3” LN3’ sc6 sc2 sc3 sc1 sc4 sc5 Vladimir Jelić (jelicvladimir5@gmail.com)

Birch Clustering Algorithm (1) Phase 1: Scan all data and build an initial in-memory CF tree. Phase 2: condense into desirable length by building a smaller CF tree. Phase 3: Global clustering Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results

Birch Clustering Algorithm (2)

Vladimir Jelić (jelicvladimir5@gmail.com) Birch – Phase 1 Start with initial threshold and insert points into the tree If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values Good initial threshold is important but hard to figure out Outlier removal – when rebuilding tree remove outliers Vladimir Jelić (jelicvladimir5@gmail.com)

Vladimir Jelić (jelicvladimir5@gmail.com) Birch - Phase 2 Optional Phase 3 sometime have minimum size which performs well, so phase 2 prepares the tree for phase 3. BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. Vladimir Jelić (jelicvladimir5@gmail.com)

Vladimir Jelić (jelicvladimir5@gmail.com) Birch – Phase 3 Problems after phase 1: Input order affects results Splitting triggered by node size Phase 3: cluster all leaf nodes on the CF values according to an existing algorithm Algorithm used here: agglomerative hierarchical clustering Vladimir Jelić (jelicvladimir5@gmail.com)

Vladimir Jelić (jelicvladimir5@gmail.com) Birch – Phase 4 Optional Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 Recalculating the centroids and redistributing the items. Always converges (no matter how many time phase 4 is repeated) Vladimir Jelić (jelicvladimir5@gmail.com)

Vladimir Jelić (jelicvladimir5@gmail.com) Conclusions (1) Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets Scans whole data only once Handles outliers better Superior to other algorithms in stability and scalability Vladimir Jelić (jelicvladimir5@gmail.com)

Vladimir Jelić (jelicvladimir5@gmail.com) Conclusions (2) Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster Vladimir Jelić (jelicvladimir5@gmail.com)

BIRCH Clustering Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

Pros & Cons of BIRCH Linear scalability Can handle only numeric data Good clustering with a single scan Quality can be further improved by a few additional scans Can handle only numeric data Sensitive to the order of the data records

Drawbacks of Square Error Based Methods One representative per cluster Good only for convex shaped having similar size and density A number of clusters parameter k Good only if k can be reasonably estimated

Drawback of Distance-based Methods Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense

Directly Density Reachable p q Parameters Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps-neighborhood of that point NEps(p): {q | dist(p,q) Eps} Core object p: |Neps(p)|MinPts Point q directly density-reachable from p iff q Neps(p) and p is a core object MinPts = 3 Eps = 1 cm

Density-Based Clustering: Background (II) p q p1 Density-reachable Directly density reachable p1p2, p2p3, …, pn-1 pn  pn density-reachable from p1 Density-connected Points p, q are density-reachable from o  p and q are density-connected p q o

DBSCAN A cluster: a maximal set of density-connected points Discover clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5

DBSCAN: the Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed

Problems of DBSCAN Different clusters may have very different densities Clusters may be in hierarchies