DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.

Slides:

Advertisements

Similar presentations

Density-Based Clustering Math 3210 By Fatine Bourkadi.

Advertisements

1 Density-Based and other Clustering Methods Slides based on those by J. Han and Martin Pfeifle

Clustering Basic Concepts and Algorithms

Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Density-Based Clustering of Spatial Data when facing.

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.

Density-based Approaches

Cluster Analysis Part III. Learning Objectives Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary.

OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans- Peter Kriegel, Jörg Sander Presented by Chris Mueller.

2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

Clustering By: Avshalom Katz. We will be talking about… What is Clustering? Different Kinds of Clustering What is DBSCAN? Pseudocode Example of Clustering.

Qiang Yang Adapted from Tan et al. and Han et al.

Clustering Prof. Navneet Goyal BITS, Pilani

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.

Clustering Methods Professor: Dr. Mansouri

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Chapter 3: Cluster Analysis

MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank.

K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.

Cluster Analysis.

INTERNATIONAL INSTITUTE FOR GEO-INFORMATION SCIENCE AND EARTH OBSERVATION Conceptualization of Place via Spatial Clustering and Co- occurrence Analysis.

An Introduction to Clustering

Instructor: Qiang Yang

SCAN: A Structural Clustering Algorithm for Networks

Cluster Analysis.

2015/7/21 Incremental Clustering for Mining in a Data Warehousing Environment Martin Ester Hans-Peter Kriegel J.Sander Michael Wimmer Xiaowei Xu Proceedings.

Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE

By: Arthy Krishnamurthy & Jing Tun

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.

Density-Based Clustering Algorithms

Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.

Topic9: Density-based Clustering

Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.

Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.

Data Mining and Warehousing: Chapter 8

DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Presented by Ho Wai Shing

Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Other Clustering Techniques

CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.

Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.

1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.

Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology

DATA MINING Spatial Clustering

More on Clustering in COSC 4335

CSE 4705 Artificial Intelligence

Hierarchical Clustering: Time and Space requirements

CSE 5243 Intro. to Data Mining

©Jiawei Han and Micheline Kamber Department of Computer Science

CS 685: Special Topics in Data Mining Jinze Liu

Data Mining: Concepts and Techniques Clustering

Topic 3: Cluster Analysis

The University of Adelaide, School of Computer Science

CSE572, CBS598: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

CS 485G: Special Topics in Data Mining

GPX: Interactive Exploration of Time-series Microarray Data

CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002

Outline Clustering Background Density-based Clustering DBSCAN Algorithm DBSCAN Implementation on ATLaS Performance Conclusion

Clustering Algorithms Partitioning Alg: Construct various partitions then evaluate them by some criterion (CLARANS, O(n) calls) Hierarchy Alg: Create a hierarchical decomposition of the set of data (or objects) using some criterion (merge & divisive, difficult to find termination condition) Density-based Alg: based on local connectivity and density functions

Density-Based Clustering Clustering based on density (local cluster criterion), such as density-connected points Each cluster has a considerable higher density of points than outside of the cluster

Density-Based Clustering Major features: Discover clusters of arbitrary shape Handle noise One scan Several interesting studies: DBSCAN: Ester, et al. (KDD ’ 96) GDBSCAN: Sander, et al. (KDD ’ 98) OPTICS: Ankerst, et al (SIGMOD ’ 99). DENCLUE: Hinneburg & D. Keim (KDD ’ 98) CLIQUE: Agrawal, et al. (SIGMOD ’ 98)

Density Concepts Two global parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point Core Object: object with at least MinPts objects within a radius ‘ Eps-neighborhood ’ Border Object: object that on the border of a cluster p q MinPts = 5 Eps = 1 cm

Density-Based Clustering: Background N Eps (p):{q belongs to D | dist(p,q) <= Eps} Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if 1) p belongs to N Eps (q) 2) |N Eps (q)| >= MinPts (core point condition) p q MinPts = 5 Eps = 1 cm

Density-Based Clustering: Background (II) Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p 1, …, p n, p 1 = q, p n = p such that p i+1 is directly density-reachable from p i Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q p1p1 pq o

DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density- connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5

DBSCAN: The Algorithm (1) Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density- reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

DBSCAN: The Algorithm (2)

DBSCAN: The Algorithm (3)

DBSCAN: The Algorithm (4)

Implementation with ATLaS (1) table SetOfPoints (x real, y real, ClId int) RTREE; /* meaning of ClId: -1: unclassified, 0: noise, 1,2,3...: cluster */ table nextId(ClusterId int); table seeds (sx real, sy real); insert into nextId values (1); select ExpandCluster(x, y, ClusterId, Eps, Minpts) from SetOfPoints, nextId where ClId= -1 ;

Implementation with ATLaS (2) aggregate ExpandCluster (x real, y real, ClusterId int, Eps real, MinPts int):Boolean { table seedssize (size int); initialize: iterate: { insert into seeds select regionQuery (x, y, Eps); insert into seedssize select count(*) from seeds; insert into return select False from seedssize where size<MinPts; update SetofPoints set ClId=ClusterId where exists (select * from seeds where sx=x and sy=y) and SQLCODE=0; update nextId as n set n.ClusterId=n.ClusterId+1 where SQLCODE=1; delete from seeds where sx=x and sy=y and SQLCODE=1; select changeClId (sx, sy, ClusterId, Eps, MinPts) from seeds and SQLCODE=1; }

Implementation with ATLaS (3) aggregate changeClId (sx real, sy real, ClusterId int, Eps real, MinPts int):Boolean { table result (rx real, ry real); table resultsize (size int); initialize: iterate: { insert into result select regionQuery(sx, sy, Eps); insert into resultsize select count(*) from result; insert into seeds select rx, ry from result where (select size from resultsize)>=Minpts and (select ClId from SetofPoints where x=rx and y=ry)=-1; update SetofPoints set ClId=ClusterId where SQLCODE=1 and (x,y) in (select rx,ry from result) and (ClId=-1 or ClId=0); delete from seeds where seeds.sx=sx and seeds.sy=sy; }

Implementation with ATLaS (4) aggregate regionQuery (qx real, qy real, Eps real):(real, real) { initialize: iterate: terminate: {Insert into return select x,y from SetOfPoints where distance(x, y, qx, qy) <=Eps; }

R*-Tree(1) R*-Tree: A spatial index Generalize the 1-dimensional B+Tree to d-dimensional data spaces

R*-tree(2) R*-Tree is a height-balanced data structure Search key is a collection of d- dimensional intervals Search key value is referred to as bounding boxes

R*-Tree(3) Query a bounding box B in R*-Tree: Test bounding box for each child of root if it overlaps B, search the child ’ s subtree If more than one child of root has a bounding box overlapping B, we must search all the corresponding subtrees Important difference between B+tree: search for single point can lead to several paths

DBSCAN Complexity Comparison Time ComplexityA single neighborhood query DBSCAN Without indexO(n)O(n 2 ) R*-treeO(log n)O(n*log n) The height of a R*-Tree is O(log n) in the worst case A query with a “ small ” region traverses only a limited number of paths in the R*-Tree For each point, at most one neighborhood query is needed

Heuristic for Eps and Minpts K-dist (p): distance from the kth nearest neighbour to p Sorting by k-dist (p) Minpts: k>4 no significant difference, but more computation, thus set k=4

Performance Evaluation compared with CLARANS (1) Accuracy CLARANS: DBSCAN:

Performance Evaluation compared with CLARANS (2) Efficiency SEQUOIA2000 benchmark data (Stonebraker et al. 1993)

Conclusion Density-based Algorithm DBSCAN is designed to discover clusters of arbitrary shape. R*-Tree spatial index reduce the time complexity from O(n 2 ) to O(n*log n). DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency using SEQUOIA 2000 benchmark. Implementation is done on ATLaS using User- Defined Aggregate and RTREE table

References Ester M., Kriegel H.-P., Sander J. and Xu X “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, Raghu Ramakrishnan, Johannes Gehrke, “Database Management systems (Second Edition)”, McGraw-Hill Companies, Inc. Beckmann N., Kriegel H.-P., Schneider R, and Seeger B “The R*-tree: An Efficient and RobustAccess Method for Points and Rectangles”. Proc. ACM SIGMOD Int. Conf. on Management of Data.Atlantic City, NJ, Jain A.K., and Dubes R.C “Algorithms for Clustering Data”. New Jersey: Prentice Hall. Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, in: Data Mining and Knowledge Discovery, an Int. Journal, Kluwer Academic Publishers, Vol. 2, No. 2, 1998, pp Haixun Wang, Carlo Zaniolo: Database System Extensions for Decision Support: the AXL Approach. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 2000: 11-20

Thank you!