Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266.

Slides:

Advertisements

Similar presentations

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Advertisements

Cluster Analysis: Basic Concepts and Algorithms

Hierarchical Clustering, DBSCAN The EM Algorithm

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Cluster Analysis: Basic Concepts and Algorithms

What is Cluster Analysis?

Cluster Validation.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Clustering Algorithms

CLUSTERING Eitan Lifshits Big Data Processing Seminar Prof. Amir Averbuch Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffery.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

Clustering Unsupervised learning Generating “classes”

Clustering 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 8: Clustering Mining Massive Datasets.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Lecture 20: Cluster Validation

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.

Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.

Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.

Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)

DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)

BHARATH RENGARAJAN PURSUING MY MASTERS IN COMPUTER SCIENCE FALL 2008.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Data Mining: Basic Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Clustering Hierarchical /Agglomerative and Point- Assignment Approaches Measures of “Goodness” for Clusters BFR Algorithm CURE Algorithm Jeffrey D. Ullman.

More on Clustering in COSC 4335

CSE 4705 Artificial Intelligence

Hierarchical Clustering: Time and Space requirements

Clustering CSC 600: Data Mining Class 21.

Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.

What Is the Problem of the K-Means Method?

Data Mining K-means Algorithm

Cluster Analysis: Basic Concepts and Algorithms

Clustering Evaluation The EM Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms

CSE 4705 Artificial Intelligence

Clustering CS246: Mining Massive Datasets

Data Mining Cluster Techniques: Basic

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Critical Issues with Respect to Clustering

Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms

CSE572: Data Mining by H. Liu

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Presentation transcript:

Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar

2 Last Class K-means Clustering + Hierarchical Clustering –Basic Algorithms –Modifications for streaming data More today: –Modifications for big data! DBSCAN Algorithm Measures for validating clusters…

3 Project Schedule Updates… No.MilestoneDue date 1Selection of Topics1/27/2015 2Project Description & Approach3/3/2015 3Initial Project Report3/31/2015 4Project Demonstrations4/16/2015 – 4/19/2015 5Project Report4/23/2015 6Posters4/23/2015 Reports (2-3) will be about 1 -2 pages most, Final report which will be between 5-6 pages typically NIPS format More to come on posters

4 Assignment 1 updates Dr. Sukumar will be here on Tue (Mar 3) to give you an overview of the python tools We will keep the deadline for the assignment to be Mar 10 Assignment 2 will go out on Mar 10

5 Density-based Spatial Clustering of Applications with Noise (DBSCAN)

6 Preliminaries Density is defined to be the number of points within a radius (ε) –In this case, density = 9 ε Core point has more than a specified number of points (minPts) at ε –Points are interior to a cluster Border points have < minPts at ε but are within vicinity of the core point A noise point is neither a core point nor a border point ε minPts = 4 core point border point noise point

7 DBSCAN Algorithm

8 Illustration of DBSCAN: Assignment of Core, Border and Noise Points

9 DBSCAN: Finding Clusters

10 Advantages and Limitations Resistant to noise Can handle clusters of different sizes and shapes Eps and MinPts are dependent on each other –Can be difficult to specify Different density clusters within the same class can be difficult to find

11 Advantages and Limitations Varying density data High dimensional data

12 How to determine Eps and MinPoints For points within a cluster, k th nearest neighbors are roughly at the same distance Noise points are farther away in general Plot by sorting the distance of every point to its kth nearest neighbor

13 Modifying clustering algorithms to work with large-datasets…

14 Last class, MapReduce Approach for K- means… In the map step: Read the cluster centers into memory from a sequencefile Iterate over each cluster center for each input key/value pair. Measure the distances and save the nearest center which has the lowest distance to the vector Write the clustercenter with its vector to the filesystem. In the reduce step (we get associated vectors for each center): Iterate over each value vector and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received). This is the new center, save it into a SequenceFile. Check the convergence between the clustercenter that is stored in the key object and the new center. If it they are not equal, increment an update counter What do we do, if we do not have sufficient memory? i.e., we cannot hold all of the data… Practical scenarios: we cannot hold a million data points at a given time, each of them being high dimensional What do we do, if we do not have sufficient memory? i.e., we cannot hold all of the data… Practical scenarios: we cannot hold a million data points at a given time, each of them being high dimensional

15 Bradley-Fayyad-Reina (BFR) Algorithm BFR is a variant of k-means designed to work with very large, disk resident data Assumes clusters are normally distributed around a centroid in a Euclidean space –Standard deviation in different dimensions may vary –Clusters are axis-aligned ellipses Efficient way to summarize clusters –want memory required O(clusters) –Instead of O(data)

16 How does BFR work? Rather than keeping the data, BFR just maintains summary statistics of data: –Cluster summaries –Outliers –Points to be clustered This makes it easier to store only the statistics based on the number of clusters

17 Details of the BFR Algorithm 1.Initialize k clusters 2.Load a bag of points from disk 3.Assign new points to one of the K original clusters, if they are within some distance threshold of the cluster 4.Cluster the remaining points, and create new clusters 5.Try to merge new clusters from step 4 with any of the existing clusters 6.Repeat 2-5 until all points have been examined 1.Initialize k clusters 2.Load a bag of points from disk 3.Assign new points to one of the K original clusters, if they are within some distance threshold of the cluster 4.Cluster the remaining points, and create new clusters 5.Try to merge new clusters from step 4 with any of the existing clusters 6.Repeat 2-5 until all points have been examined

18 Details, details, details of BFR Points are read from disk one main-memory full at a time Most points from previous memory loads are summarized by simple statistics From the initial load, we select k centroids: –Take k random points –Take a small random sample and cluster optimally –Take a sample; pick a random point, and then k-1 more points, each as far as possible from the previously selected points

19 Three classes of points… 3 sets of points, which we can keep track of Discard Set (DS): –Points close enough to a centroid to be summarized Compression Set (CS): –Groups of points that are close together but not close to any existing centroid –These points are summarized, but not assigned to a cluster Retained Set (RS): –Isolated points waiting to be assigned to a compression set

20 Using the Galaxies Picture to learn BFR centroid Reject Set Compressed Set Discard Set Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

21 Summarizing the sets of points For each cluster, the discard set (DS) is summarized by: –The number of points, N –The vector SUM, whose i th component is is the sum of the coordinates of the points in the i th dimension –The vector SUMSQ: i th component = sum of squares of coordinates in i th dimension

22 More details on summarizing the points 2d+1 values represent any clusters –d is the number of dimensions Average in each dimension (the centroid) can be calculated as SUMi / N –SUMi = ith component of SUM Variance of a cluster’s discard set in dimension i is: (SUMSQi / N) – (SUMi / N)2 –And standard deviation is the square root of that Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!

23 The “Memory Load” of Points and Clustering Step 3: Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS –These points are so close to the centroid that they can be summarized and then discarded Step 4: Use any in-memory clustering algorithm to cluster the remaining points and the old RS –Use any in-memory clustering algorithm to cluster the remaining points and the old RS

24 More on Memory Load of points DS set: Adjust statistics of the clusters to account for the new points –Add Ns, SUMs, SUMSQs Consider merging compressed sets in the CS If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

25 A few more details… How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster? We need a way to decide whether to put a new point into a cluster (and discard) BFR suggests two ways: –Mahalanobis distance High likelihood of the point belonging to currently nearest centroid

26 Second way to define a “close” point Normalized Euclidean distance from centroid For point (x1, x2, …, xd) and centroid (c1, c2, …, cd): –Normalize in each dimension: yi = (xi-ci)/d –Take sum of squares of yi –Take the square root

27 More on Mahalanobis Distance If clusters are normally distributed in d dimensions, then after transformation, one standard deviation = sqrt(d) –i.e., 68% of points will cluster with a distance of sqrt(d) Accept a point for a cluster if its M.D. is < some threshold, e.g. 2 standard deviations

28 Should 2 CS clusters be combined? Compute the variance of the combined subcluster –N, SUM, and SUMSQ allow us to make that calculation quickly Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density

29 Limitations of BFR… Makes strong assumption about the data: –Normally distributed data –Does not work with non-linearly separable data –Woks only with axis aligned datasets… Real world datasets are hardly this way

30 Clustering with Representatives (CURE)

31 CURE: Efficient clustering approach Robust clustering approach Can handle outliers better Employs a hierarchical clustering approach: –Middle ground between centroid based and all-points extreme (MAX) Can handle different types of data

32 CURE Algorithm CURE (points, k) 1.It is similar to hierarchical clustering approach. But it uses sample point variance as the cluster representative rather than every point in the cluster. 2.First set a target sample number c. Than we try to select c well scattered sample points from the cluster. 3.The chosen scattered points are shrunk toward the centroid in a fraction of  where 0 <  <1

33 CURE clustering procedure 4.These points are used as representative of clusters and will be used as the point in d min cluster merging approach. 5.After each merging, c sample points will be selected from original representative of previous clusters to represent new cluster. 6.Cluster merging will be stopped until target k cluster is found Nearest Merge Nearest Merge

34 Other tweaks in the CURE algorithm Use Random Sampling: –data cannot be stored all at once in memory –it is similar to the core-set idea Partition and Two-pass clustering: –Reduces compute time –First, we divide the n data point into p partition and each contain n/p data point. –We than pre-cluster each partition until the number of cluster n/pq reached in each partition for some q > 1 –Then each cluster in the first pass result will be used as the second pass clustering input to form the final cluster.

35 Space and Time Complexity Worst case time complexity: –O(n 2 log n) Space complexity: –O(n) –Using a k-d tree for insertion/updating

36 Example of how it works

37 How to validate clustering approaches?

38 Cluster validity For supervised learning: –we had a class label, –which meant we could identify how good our training and testing errors were –Metric: Accuracy, Precision, Recall For clustering: –How do we measure the “goodness” of the resulting clusters?

39 Clustering random data (overfitting) If you ask a clustering algorithm to find clusters, it will find some

40 Different aspects of validating clsuters Determine the clustering tendency of a set of data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting) External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth). Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information. Compare clusterings to determine which is better. Determining the ‘correct’ number of clusters.

41 Measures of cluster validity External Index: Used to measure the extent to which cluster labels match externally supplied class labels. –Entropy, Purity, Rand Index Internal Index: Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE), Silhouette coefficient Relative Index: Used to compare two different clusterings or clusters. –Often an external or internal index is used for this function, e.g., SSE or entropy

42 Measuring Cluster Validation with Correlation Proximity Matrix vs. Incidence matrix: –A matrix K ij with 1 if the point belongs to the same cluster; 0 otherwise Compute the correlation between the two matrices: –Only n(n-1)/2 values to be computed –High values indicate similarity between points in the same cluster Not suited for density based clustering

43 Another approach: use similarity matrix for cluster validation

44 Internal Measures: SSE SSE is also a good measure to understand how good the clustering is –Lower SSE  good clustering Can be used to estimate number of clusters

45 More on Clustering a little later… We will discuss other forms of clustering in the following classes Next class: –please bring your brief write up on the two papers –We will discuss frequent itemset mining and a few other aspects of clustering –Move on to Dimensionality Reduction