Density-Based Clustering Algorithms

Slides:



Advertisements
Similar presentations
Density-Based Clustering Math 3210 By Fatine Bourkadi.
Advertisements

DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Density-Based Clustering of Spatial Data when facing.
Lecture outline Density-based clustering (DB-Scan) – Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for.
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Density-based Approaches
Spatial and Temporal Data Mining
Cluster Analysis Part III. Learning Objectives Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary.
OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans- Peter Kriegel, Jörg Sander Presented by Chris Mueller.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Clustering By: Avshalom Katz. We will be talking about… What is Clustering? Different Kinds of Clustering What is DBSCAN? Pseudocode Example of Clustering.
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank.
Cluster Analysis.
An Introduction to Clustering
Clustering II.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Instructor: Qiang Yang
SCAN: A Structural Clustering Algorithm for Networks
Cluster Analysis.
2015/7/21 Incremental Clustering for Mining in a Data Warehousing Environment Martin Ester Hans-Peter Kriegel J.Sander Michael Wimmer Xiaowei Xu Proceedings.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Project Presentation Arpan Maheshwari Y7082,CSE Supervisor: Prof. Amitav Mukerjee Madan M Dabbeeru.
Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE
Advanced Database Technologies
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Topic9: Density-based Clustering
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Data Mining and Warehousing: Chapter 8
DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Han: Clustering1 Clustering — Slides for Textbook — — Chapter 8 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of.
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.
Other Clustering Techniques
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
More on Clustering in COSC 4335
CSE 5243 Intro. to Data Mining
©Jiawei Han and Micheline Kamber Department of Computer Science
CS 685: Special Topics in Data Mining Jinze Liu
The University of Adelaide, School of Computer Science
Overview Of Clustering Techniques
CSE572, CBS598: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
GPX: Interactive Exploration of Time-series Microarray Data
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Density-Based Clustering Algorithms Presented by: Iris Zhang 17 January 2003

Outline Clustering Density-based clustering DBSCAN DENCLUE Summary and future work

Clustering Problem description Given: A data set of N data items which are d-dimensional data feature vectors. Task: Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise.

Major Types of Clustering Algorithms Partitioning: Partition the database into k clusters which are represented by representative objects of them Hierarchical: Decompose the database into several levels of partitioning which are represented by dendrogram

Other kinds of Clustering Algorithms Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Density-Based Clustering A cluster is defined as a connected dense component which can grow in any direction that density leads. Density, connectivity and boundary Arbitrary shaped clusters and good scalability

Two Major Types of Density-Based Clustering Algorithms Connectivity based: DBSCAN, GDBSCAN, OPTICS and DBCLASD Density function based: DENCLUE

DBSCAN [Ester et al.1996] Clusters are defined as Density-Connected Sets (wrt. Eps, MinPts) Density and connectivity are measured by local distribution of nearest neighbor Target low dimensional spatial data

DBSCAN Definition 1: Eps-neighborhood of a point NEps(p) = {q ∈D | dist(p,q) ≤ Eps} Definition 2: Core point |NEps(q)| ≥ MinPts

DBSCAN Definition 3: Directly density-reachable A point p is directly density-reachable from a point q wrt. Eps, MinPts if 1) p ∈ NEps(q) and 2) |NEps(q)| ≥ MinPts (core point condition).

DBSCAN Definition 4: Density-reachable A point p is density-reachable from a point q wrt. Eps and MinPts if there is a chain of points p1, ..., pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi Definition 5: Density-connected A point p is density-connected to a point q wrt. Eps and MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.

DBSCAN

DBSCAN Definition 6: Cluster Let D be a database of points. A cluster C wrt. Eps and MinPts is a non-empty subset of D satisfying the following conditions: 1) ∀ p, q: if p ∈ C and q is density-reachable from p wrt. Eps and MinPts, then q ∈ C. (Maximality) 2) ∀ p, q ∈ C: p is density-connected to q wrt. Eps and MinPts. (Connectivity)

DBSCAN Definition 7: Noise Let C1 ,. . ., Ck be the clusters of the database D wrt. parameters Epsi and MinPtsi, i = 1, . . ., k. Then we define the noise as the set of points in the database D not belonging to any cluster Ci , i.e. noise = {p ∈D | ∀ i: p Ci}.

DBSCAN Lemma 1:Let p be a point in D and |NEps(p)| ≥ MinPts. Then the set O = {o | o ∈D and o is density-reachable from p wrt. Eps and MinPts} is a cluster wrt. Eps and MinPts. Lemma 2: Let C be a cluster wrt. Eps and MinPts and let p be any point in C with |NEps(p)| ≥ MinPts. Then C equals to the set O = {o | o is density-reachable from p wrt. Eps and MinPts}.

DBSCAN For each point, DBSCAN determines the Eps-environment and checks whether it contains more than MinPts data points DBSCAN uses index structures (such as R*-Tree) for determining the Eps-environment

Arbitrary shape clusters found by DBSCAN

DENCLUE [Hinneburg & Keim.1998] Clusters are defined according to the point density function which is the sum of influence functions of the data points. It has good clustering in data sets with large amounts of noise. It can deal with high-dimensional data sets. It is significantly faster than existing algorithms

DENCLUE Influence Function: Density Function: Influence of a data point in its neighborhood Density Function: Sum of the influences of all data points

DENCLUE Definition 1:Influence Function The influence of a data point y at a point x in the data space is modeled by a function e.g.:

DENCLUE Definition 2:Density Function The density at a point x in the data space is defined as the sum of influences of all data points x e.g.:

DENCLUE Example

DENCLUE Definition 3: Gradient The gradient of a density function is defined as e.g.:

DENCLUE Definition 4: Density Attractor A point x* ∈Fd is called a density attractor for a given influence function, iff x* is a local maximum of the density-function Example of Density-Attractor

DENCLUE Definition 5: Density attracted point A point x* ∈Fd is density attracted to a density attractor x*, iff  k ∈N: d(xk,x*)   with -xi is a point in the path between x and its attractor x* -density-attracted points are determined by a gradient-based hill-climbing method

DENCLUE Definition 6: Center-Defined Cluster A center-defined cluster with density-attractor x* ( ) is the subset of the database which is density-attracted by x*.

DENCLUE Definition 7:Arbitrary-shaped cluster A arbitrary-shaped cluster for the set of density-attractors X is a subset C D,where 1) xC,x*  X: x is density attracted to x* and 2) x1*,x2*X:  a path P Fd from x1* to x2* with pP:

DENCLUE Noise-Invariance Assumption:Noise is uniformly distributed in the data space Lemma:The density-attractors do not change when the noise level increases. Idea of the Proof: - partition density function into signal and noise - density function of noise approximates a constant.

DENCLUE Example of noise invariance

DENCLUE Parameter-σ: It describes the influence of a data point in the data space. It determines the number of clusters.

DENCLUE Parameter-σ: Choose σ such that number of density attractors is constant for the longest interval of σ.

DENCLUE Parameter- ξ It describes whether a density-attractor is significant, helping reduce the number of density-attractors such that improving the performance.

DENCLUE Experiment Polygonal CAD data (11-dimensional feature vectors) Comparison between DBSCAN and DENCLUE

DENCLUE

DENCLUE Molecular biology to determine the behavior of the molecular in the conformation space (19-dimensional dihedral angle space with large amount of noise) Folded State Unfolded State Folded Conformation of the Peptide

Summary arbitrary shaped clusters good scalability explicit definition of noise noise invariance high dimensional clustering

Future work Using density-based clustering method to deal with high dimensional dataset

References [EKS+ 96] M. Ester, H-P. Kriegel, J. Sander, X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996. [HK 98] A. Hinneburg, D.A. Keim, An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998. [XEK+ 98] X. Xu, M. Ester, H-P. Kriegel and J. Sander., A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases, Proc. 14th Int. Conf. on Data Engineering (ICDE’98), Orlando, FL, 1998, pp. 324-331.

References J. Sander, M. Ester, H-P. Kriegel, X. Xu, Density-Based Clustering in Spatial Databases: the Algorithm GDBSCAN and its Applications, Knowledge Discovery and Data Mining, an International Journal, Vol. 2, No. 2, Kluwer Academic Publishers, 1998, pp. 169-194. Ankerst, M., Breunig, M., Kriegel, H.-P., and Sander, J. OPTICS: Ordering Points To Identify . In Proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, 1999. Hinneburg A., Keim D. A.: Clustering Techniques for Large Data Sets: From the Past to the Future ,Tutorial, Proc. Int. Conf. on Principles and Practice in Knowledge Discovery (PKDD'00), Lyon, France, 2000.

Q&A