Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Slides:

Advertisements

Similar presentations

DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.

Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Cluster Analysis: Basic Concepts and Algorithms

Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Basic Concepts and Algorithms

PARTITIONAL CLUSTERING

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.

2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

Clustering Prof. Navneet Goyal BITS, Pilani

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

Clustering Methods Professor: Dr. Mansouri

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Chapter 3: Cluster Analysis

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Cluster Analysis.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

SCAN: A Structural Clustering Algorithm for Networks

Cluster Analysis.

Cluster Analysis: Basic Concepts and Algorithms

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.

Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.

Density-Based Clustering Algorithms

EECS 274 Computer Vision Segmentation by Clustering II.

Topic9: Density-based Clustering

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Presented by Ho Wai Shing

Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Other Clustering Techniques

CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Clustering Anna Reithmeir Data Mining Proseminar 2017

Data Mining: Basic Cluster Analysis

DATA MINING Spatial Clustering

More on Clustering in COSC 4335

CSE 4705 Artificial Intelligence

Hierarchical Clustering: Time and Space requirements

CSE 5243 Intro. to Data Mining

Parallel Density-based Hybrid Clustering

Clustering in Ratemaking: Applications in Territories Clustering

CS 685: Special Topics in Data Mining Jinze Liu

CSE572, CBS598: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

CS 485G: Special Topics in Data Mining

CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

Clustering Wei Wang.

FLOSCAN: An Artificial Life Based Data Mining Algorithm

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang

Input Parameters of DBSCAN 1.The neighborhood radius, r. 2.The minimum number, k, of neighbors to be a core point – the seed for expansion The clustering quality highly depends on the input parameters, especially r.

Selection of Parameters Matters:

Content Introduction Related Work Parameter Reduction for Density-based Clustering –Experiments and Observations –Determination of the neighborhood radii –Nonparametric Density-based Clustering Performance Analysis Conclusion

Introduction Density-based clustering is widely used in various spatial applications such as geographical information analysis, medical applications, sky data observatories and satellite image analysis. In density-based clustering, clusters are dense areas of points in the data space that are separated by areas of low density (noises) One of the problems of density-based clustering is minimal domain knowledge to determine the input parameters.

Our Approach to Solve the Problem We explore an automatic approach to determine the minimum neighborhood radii based on data distribution of the dataset. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters based on many experiments and observations. We combine MINR with the enhanced DBCSCAN, e- DBCSCAN, into a nonparametric density-based clustering algorithm (NPDBC).

Problem with Input Parameter In the density based clustering, the main input is the minimum neighborhood radius. A dataset may consist of clusters with same density or different densities. Figure shows some possible distributions of X. When clusters are in different densities, it is more difficult to determine the minimum neighborhood radii.

Related Work Clustering methods Attempts to reduce parameters Enhanced DBSCAN clustering

Clustering Methods There are mainly two clustering methods: –similarity-based partitioning methods –density-based clustering methods. A similarity-based partitioning algorithm breaks a dataset into k subsets, called clusters. The major problems with partitioning methods are: –k has to be predetermined; –it is difficult to identify clusters with different sizes; –it only finds convex clusters.

Clustering Methods (Cont.) Density-based clustering methods are used to discover clusters with arbitrary shapes. The most typical algorithm is DBSCAN [1]. DBSCAN is very sensitive to input parameters, which are the neighborhood radius (r) and a minimum number of neighbors (MinPts). Another density-based algorithm DenClue uses a grid and is very efficient. This algorithm generalizes some other clustering approaches which, however, results in a large number of input parameters.

Attempts To Reduce Parameters There have been many efforts to make clustering process parameter- free, such as OPTICS [8], CHAMELEON [7] and TURN*[2]. OPTICS computes an augmented cluster ordering, which represents the density-based clustering structure of the data. This method is used for interactive cluster analysis. CHAMELEON has been found to be very effective in clustering convex shapes. However, the algorithm cannot handle outliers and needs parameter setting to work effectively. TURN* is a brute force approach. It first decreases the neighborhood radius to so small that every data point becomes noise. Then the radius is doubled each time to do clustering until it finds a “turn.” Even though it chooses big steps, the computation time is not promising for large datasets with various densities.

Enhanced DBSCAN (e-DBSCAN) A point p is an internal point if it has at least k neighbors within its neighborhood. Its neighborhood is called core. A point p is an external point if the number of its neighbors within its neighborhood is less than k, and it is located within a core. Given k = 4, points 6 and 8 are internal points, while point 5 is an external point.

e-DBSCAN (Cont.) A cluster C is a collection of cores, the centers of which are density reachable from each other. Boundary points of a cluster is a collection of external points within clusters. Enhanced DBSCAN (e-DBSCAN) is different from the original DBSCAN in that the boundary points of each cluster are stored as a separate set. The boundary sets are used for cluster merge at the later stage.

Steps of e-DBSCAN 1.Pick an arbitrary point x, if it is not an internal point, it is labeled as noise. Otherwise its neighborhood will be a rudiment cluster C. Insert all neighbors of point x into the seed store. 2.Retrieve the next point from the seed store. If it is an internal point, merge its neighborhood to cluster C. Insert all its neighbors to the seed store; if it is an external point, insert it to the boundary set of C. 3.Go back to step 2 with the next seed until the seed store is empty. 4.Go back to step 1 with the next unclustered point in the dataset.

Parameter Reduction DBSCAN have two input parameters: –the minimum number of neighbors, k, –the neighborhood radius, r. k is the size of the smallest cluster. DBSCAN set k to 4 [1]. TURN* also treats it as a fixed value [2]. The clustering comparison between k= 4 and k=7 are shown in the next slide. Therefore, the only input parameter is the minimum neighborhood radius, r. r depends on data distribution of the dataset. It should be different for different density clusters.

“K = 4” vs “K = 7”: (R 4 ) 2 /4 = (R 7 ) 2 /7

Observation 1 Define R as a distance between each point x and its 4th nearest neighbor. The points are then sorted based on R in ascending order. DS1 is a dataset used by DBSCAN. The data size is 200. DS2 is reproduced from a dataset used by CHAMELEON. (a) DS1 (b) DS2 (c) R-x of DS1 (d) R-x of DS2

Observation 2 Given a neighborhood radius r, we calculate the number of neighbors for each point within the given radius, denoted as K Sort the points in descending order, and get the sorted K-x graph. When the neighborhood radius is close to the maximum R, the K-x graph shows “knees” very clearly.

Observation 2 (Cont.) In order to find the “knees” in the graph, we calculate the differentials of the graphs,  Ks. The knees are close to the points with peak differentials. The number of “knees” is equal to the number of cluster densities in the dataset. Intuitively, we infer that the points divided by “knees” belong to different density clusters or noise.

Observation 3 Sort the dataset DS2 based on K, and then partition the sorted dataset into three subsets separated by two “ knees. ” The two “ knees ” are at positions of and Therefore the three partitions are 0 – 10000, – 15500, and

Determination of the neighborhood radii Based on the experiments, we develop an algorithm to automatically determine the minimum neighborhood radii for mining clusters with different densities, MINR, based on the data distribution.

Nonparametric Density-based Clustering Given a series of radii, start clustering using the enhance DBSCAN algorithm, e-DBSCAN, with the smallest r = r1. The densest cluster(s) would be formed. Then set r = r2. Only process those unclustered points. The next sparser cluster(s) are formed.

Nonparametric Density-based Clustering Calculates a series of neighborhood radii for different density clusters using MINR Iterative clustering process using e-DBSCAN with the radii. Merge any pair of clusters which share most of the boundary points of either cluster.

Performance Analysis We compare our nonparametric density-based clustering algorithm (NPDBC) with the performance of TURN*. We show the run time comparisons on the dataset, DS2, we discussed above. In order to make the data contain the clusters in different densities, we artificially insert more data in some clusters to make them denser than the others. The resulted datasets have the sizes from 10k to 200k.

Performance Analysis (Cont.) NPDBC is more efficient than TURN* for large datasets. The reason is that for NPDBC, the parameters are computed once at the beginning of the clustering process TURN* algorithm tries different neighborhood radii until the first “turn” is found in case of two different densities. We only compare NPDBC with TURN* on datasets with two different densities. If the density variety increases, NPDBC will outperform TURN* much more.

Conclusion and Future Work In this paper, we explore an automatic approach to determine this parameter based on the distribution of datasets. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters. We developed a nonparametric clustering method (NPDBC) by combining MINR with the enhanced DBCSCAN, e- DBCSCAN. Experiments show our NPDBC is more efficient and scalable than TURN* for clusters in two different densities. In our future work, we will implement our NPDBC using the vertical data structure, P-tree, the efficient data mining ready data representation.

Thanks!