Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer.

Similar presentations


Presentation on theme: "Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer."— Presentation transcript:

1 Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer Science Brighton,UK November 01-04, 2004

2 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Outline Density-Based Clustering Clustering of Complex Objects Experimental Evaluation

3 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Outline Density-Based Clustering Core Object · Density-Reachability · DBSCAN · OPTICS Clustering of Complex Objects Experimental Evaluation

4 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Data Mining Larger and larger amounts of data collected automatically Too large for humans to analyze manually Tools to assist analysis necessary  KDD / Data Mining Hubble Space TelescopeTelecommunication DataMarket-Basket Data

5 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Clustering –Efficiently grouping the database into sub-groups (clusters) such that similarity within clusters maximized similarity between clusters minimized Flat Clustering one level of clusters Hierarchical Clustering nested clusters e.g. density-based clustering algorithm DBSCAN [KDD 96] e.g. density-based clustering algorithm OPTICS [SIGMOD 99]

6 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Density-Based Clustering I Parameters –range  and minimal weight MinPts Definition: core object – q is core object if | rangeQuery (q,  ) |  MinPts Definition: directly density-reachable –p directly density-reachable from q if q is a core object and p  rangeQuery (q,  ) Definition: density-reachable –density-reachable: transitive closure of “directly density-reachable” q MinPts=5 p q o q r

7 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Density-Based Clustering II Core Idea of Hierarchical Cluster Ordering: Order the objects linearly such that objects of a cluster are adjacent in the ordering.

8 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Density-Based Clustering II Core Idea of Hierarchical Cluster Ordering: Order the objects linearly such that objects of a cluster are adjacent in the ordering. Definition: core-distance core-distance(o) o  MinPts = 5

9 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Density-Based Clustering II Core Idea of Hierarchical Cluster Ordering: Order the objects linearly such that objects of a cluster are adjacent in the ordering. Definition: core-distance Definition: reachability-distance core-distance(o) o reachability-distance(p,o) p p  MinPts = 5

10 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm A I B J K L R M P N C F D E G H 44  reach seedlist: Example Database (2-dimensional, 16 points) = 44, MinPts = 3 

11 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm A I B J K L R M P N C F D E G H 44  reach seedlist: Example Database (2-dimensional, 16 points) = 44, MinPts = 3  A I B J K L R M P N C F D E G H A 44   core- distance (B,40) (I, 40)

12 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm 44  reach Example Database (2-dimensional, 16 points) = 44, MinPts = 3  A 44  B A I B J K L R M P N C F D E G H seedlist: (I, 40) (C, 40)

13 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm 44  reach Example Database (2-dimensional, 16 points) = 44, MinPts = 3  A 44  B A I B J K L R M P N C F D E G H I seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43)

14 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm 44  reach Example Database (2-dimensional, 16 points) = 44, MinPts = 3  A 44  B I A I B J K L R M P N C F D E G H J seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40)

15 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm 44  reach Example Database (2-dimensional, 16 points) = 44, MinPts = 3  A 44  B IJ A I B J K L R M P N C F D E G H L … seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40)

16 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm A I B J K L R M P N C F D E G H seedlist: - ABIJLMKNRPCDFGEH 44 reach  Example Database (2-dimensional, 16 points) = 44, MinPts = 3 

17 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK OPTICS Algorithm A I B J K L R M P N C F D E G H seedlist: - ABIJLMKNRPCDFGEH 44 reach  Example Database (2-dimensional, 16 points) = 44, MinPts = 3 

18 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Outline Foundations of Density-Based Clustering Core Object · Density-Reachability · DBSCAN · OPTICS Clustering of Complex Objects Direct Integration of the Multi-Step Query Processing Paradigm Experimental Evaluation

19 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Complex Objects complex objects complex models complex distance measure

20 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Single-Step Clustering Approach Exact information Density-based Clustering algorithms, like DBSCAN and OPTICS Query Q(q,  ) Result R(q,  ) Performance Problems For each database object q, we perform one range query. Expensive exact distance computation d o (o,q) for each object o of the database independent of the  range 1 2

21 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Multi-Step Query Processing Multi-Step Similarity Search Range Queries (Faloutsos et al. 94) k-Nearest Neighbor Queries (Korn et al. 96) Optimal k- Nearest Neighbor Queries (Seidl, Kriegel 98) No False Drops? Filter Step (index-based) Refinement Step (exact evaluation) candidates results filter distanceobject distance Lower-Bounding Property

22 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Traditional Multi-Step Clustering Approach Range query processor (e.g. Faloutsos et al. 94) Density-based Clustering algorithms, like DBSCAN and OPTICS Performance Problems For each database object q, we perform one range query (1). The range query is first performed on the filter information (2,3). One expensive exact distance computation d o (o,q) for each object o of the candidate set C(q,  ) is performed (4). This refinement step is very expensive for non-selective filters or high  values. Query (q,  ) 1 Candidates C(q,  ) Filter information Query Q(q,  ) using d f 23 Exact information refinement-step computation of d o (o,q) for all o  C(q,  ) 4 Result (q,  ) 5

23 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Integrated Multi-Step Clustering Approach Exact information Filter information Extended Density-based Clustering algorithms, like DBSCAN and OPTICS Query Q(q,  ) using d f Candidates C (q,  ) computation of d o (o,q) for Core - properties of q 123 Direct integration of the multi-step query processing paradigm into the clustering algorithm postponing expensive exact distance computations as long as possible Proposed Solution For each database object q, we perform one range query on the filter information (1,2). Only those exact distances d o (o,q) are computed which are necessary to determine the core-properties of q (3). A beneficial heuristic for determining the reachability- properties is applied which saves on exact distance computations (4). postponed computations of d o (o,q) for Reach.-properties of o 4

24 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Filter Information Q  First, we carry out a range query on the filter for each query object Q. Second, we order the resulting candidate set in ascending order according to the filter distance. Third, we walk through the candidate set and perform exact distance calculations until we can be sure that we have found the MinPts nearest neighbors. MinPts=3  =75 d f (K,Q)=10 d f (Z,Q)=12 d f (R,Q)=18 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 Sorted Distance List R Z K M A I d o (K,Q)=53 d o (Z,Q)=69 d o (R,Q)=49 Determination of Core-Properties Integrated Multi-Step Clustering Approach core-distance of Q =53 d o (R,Q)=53

25 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (R,B)=18 d f (R,D)=34 d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Integrated Multi-Step Clustering Approach d f (K,C)=55 d 0 (M,C)=65 first elements are ascendingly ordered each list of predecessor objects is ascendingly ordered d o (R,Q)=53 d o (Z,Q)=69 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 result list of the current query object Q which has to be inserted into the extended seedlist d o (K,Q)=53 Extended Seedlist

26 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d o (K,Q)=53 d f (R,B)=18 d f (R,D)=34 d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Integrated Multi-Step Clustering Approach d f (K,C)=55 d 0 (M,C)=65 d o (R,Q)=53 d o (Z,Q)=69 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 result list of the current query object Q which has to be inserted into the extended seedlist d o (K,Q)=53 Extended Seedlist

27 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (R,B)=18 d f (R,D)=34 d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Integrated Multi-Step Clustering Approach d 0 (M,C)=65 d o (R,Q)=53 d o (Z,Q)=69 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 result list of the current query object Q which has to be inserted into the extended seedlist d o (K,Q)=53 d 0 (Z,Q)=69 Extended Seedlist

28 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (R,B)=18 d f (R,D)=34 d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Integrated Multi-Step Clustering Approach d o (R,Q)=53 d o (Z,Q)=69 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 result list of the current query object Q which has to be inserted into the extended seedlist d o (K,Q)=53 d 0 (M,C)=65 d 0 (Z,Q)=69 d 0 (R,Q)=53 Extended Seedlist

29 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (R,B)=18 d f (R,D)=34 d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Integrated Multi-Step Clustering Approach d o (R,Q)=53 d o (Z,Q)=69 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 result list of the current query object Q which has to be inserted into the extended seedlist d o (K,Q)=53 d 0 (M,C)=65 d 0 (Z,Q)=69 d 0 (R,Q)=53 d f (M,Q)=55 Extended Seedlist

30 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (R,B)=18 d f (R,D)=34 d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Integrated Multi-Step Clustering Approach d o (R,Q)=53 d o (Z,Q)=69 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 result list of the current query object Q which has to be inserted into the extended seedlist d o (K,Q)=53 d 0 (M,C)=65 d 0 (Z,Q)=69 d 0 (R,Q)=53 d f (M,Q)=55d f (A,Q)=58 Extended Seedlist

31 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (A,Q)=58 d f (R,B)=18 d f (R,D)=34 d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Extended Seedlist Integrated Multi-Step Clustering Approach d o (R,Q)=53 d o (Z,Q)=69 d f (M,Q)=55 d f (A,Q)=58 d f (I,Q)=65 result list of the current query object Q which has to be inserted into the extended seedlist d o (K,Q)=53 d 0 (M,C)=65 d 0 (Z,Q)=69 d 0 (R,Q)=53 d f (M,Q)=55d f (I,Q)=65

32 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (A,Q)=58d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Determination of Next Query Object Integrated Multi-Step Clustering Approach d o (K,Q)=53 d 0 (M,C)=65 d 0 (Z,Q)=69 d 0 (R,Q)=53 d f (M,Q)=55d f (I,Q)=65d o (R,B)=44 d f (R,B)=18 d f (R,D)=34

33 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (A,Q)=58 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Determination of Next Query Object Integrated Multi-Step Clustering Approach d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 d o (K,Q)=53 d 0 (M,C)=65 d 0 (Z,Q)=69d f (M,Q)=55d f (I,Q)=65 d 0 (R,Q)=53 d o (R,B)=44 d f (R,D)=34 d o (R,B)=44

34 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK d f (A,Q)=58 Data Structure “List of Lists” Additional information about possible predecessor objects are stored in order to postpone exact distance calculations as long as possible. Determination of Next Query Object Integrated Multi-Step Clustering Approach d f (K,B)=20 d f (K,L)=30 d f (K,G)=43 d o (K,Q)=53 d 0 (M,C)=65 d 0 (Z,Q)=69d f (M,Q)=55d f (I,Q)=65 d 0 (R,Q)=53 d o (R,B)=44 d f (R,D)=34 d o (R,B)=44 d 0 (K,B)=25

35 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Outline Foundations of Density-Based Clustering Core Object · Density-Reachability · DBSCAN · OPTICS Clustering of Complex Objects Direct Integration of the Multi-Step Query Processing Paradigm Experimental Evaluation

36 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Experimental Evaluation High dimensional feature vectors representing CAD objects [DASFAA 03] not very selective filter used (Euclidean norm) Graphs representing images [DAWAK 03] Expensive exact distance function Selective filter used Test Data Sets

37 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK no. of objects runtime [sec.] no. of objects runtime [sec.] Feature vectors Already non-selective filters (feature vectors) are helpful for accelerating DBSCAN by up to an order of magnitude when using the new integrated multi-step query processing approach. The traditional multi-step query processing approach does not benefit from non- selective filters (feature vectors), as the cardinality of the candidate set is still high even when small  values are used. When filters of high selectivity (graphs) are used, our new integrated multi-step query processing approach leads to a speed-up of two orders of magnitude compared to a full table scan. Graphs Experimental Evaluation DBSCAN

38 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK no. of objects runtime [sec.] no. of objects runtime [sec.] When using filters of high selectivity (graphs), our new integrated multi-step query processing approach outperforms the traditional multi-step query processing approach and the full table scan by a factor of up to 30. For high  values, as used with OPTICS, the full table scan performs even better than the traditional multi-step query processing approach. Feature vectorsGraphs no. of objects Experimental Evaluation OPTICS

39 Martin Pfeifle, University of MunichICDM 2004, Brighton, UK Conclusions Summary „Efficient Density-Based Clustering of Complex Objects“ direct integration of the multi-step query processing paradigm into the clustering algorithm MinPts-nearest neighbor queries on the exact information postponing expensive exact distance computations as long as possible Future Work integration of the multi-step query processing paradigm into other data mining algorithms


Download ppt "Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer."

Similar presentations


Ads by Google