1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Clustering.
WEI-MING CHEN k-medoid clustering with genetic algorithm.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
Data Mining Techniques: Clustering
Clustering II.
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering.
Spatial Data Mining: Progress and Challenges Survey Paper Krzysztof Koperski, Junas Adhikary, and Jiawei Han (1996) Review by Brad Danielson CMPUT 695.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering II.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Cluster Analysis.
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Spatial and Temporal Data Mining
Cluster Analysis.
What is Cluster Analysis?
Birch: An efficient data clustering method for very large databases
1 CLARACLARA. 2 data Algorithm CLARA 1. For i= 1 to 5, repeat the following steps: k = 2 mincost = 9999 bestset.
Cluster Analysis Part I
Advanced Database Technologies
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
1 Clustering Sunita Sarawagi
1 CS599 Spatial & Temporal Database Spatial Data Mining: Progress and Challenges Survey Paper appeared in DMKD96 by Koperski, K., Adhikary, J. and Han,
CIS664-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University Clustering I (based on.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Clustering.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
Data Mining and Warehousing: Chapter 8
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Cluster Analysis.
Data Mining Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
CLARANS: A Method for Clustering Objects for Spatial Data Mining IEEE Transactions on Knowledge and Data Enginerring, 2002 Raymond T. Ng et al. 22 MAR.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
DATA MINING Spatial Clustering
What Is the Problem of the K-Means Method?
CSE 5243 Intro. to Data Mining
CSE 5243 Intro. to Data Mining
The University of Adelaide, School of Computer Science
Overview Of Clustering Techniques
CS 685: Special Topics in Data Mining Jinze Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
K-Medoid May 5, 2019.
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Clustering Deviance From CART Analysis and Silhouette Widths
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04

2 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary

3 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary

4 Spatial Data Mining Identifying interesting relationships and characteristics that may exist implicitly in Spatial Databases Different from Relational Databases Spatial objects - store both spatial and non- spatial attributes Queries (“All Walmart stores within 10 miles of UH) Spatial Joins, work on spatial indexes (R-tree) Huge sizes (Tera bytes) GIS is a classic example

5 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary

6 Partitioning Methods Given K, the number of partitions to create, a partitioning method constructs initial partitions. It then iterative refines the quality of these clusters so as to maximize intra-cluster similarity and inter-cluster dissimilarity. [Quality of Clustering]: Average dissimilarity of objects from their cluster centers (medoids) Selected algorithms: 1. K-medoids 2. PAM 3. CLARA 4. CLARANS

7 K-Medoids Partition based clustering (K partitions) Effective, why ? Resistant to outliers Do not depend on order in which data points are examined Cluster center is part of dataset, unlike k-means where cluster center is gravity based Experiments show that large data sets are handled efficiently K-means K-medoids

8 PAM ( Partitioning Around Medoids ) [Goal]: Find K representative objects of the data set. Each of the K objects is called a Medoid, the most centrally located object within a cluster.

9 PAM (2) Start with K data points designated as medoids. Create cluster around a medoid by moving data points close to the medoid O j belongs to O i if d(O j, O i ) = min Oe d(O j, O e ) Iteratively replace O i with O h if quality of clustering improves. Swapping cost, C ijh, associated for replacing a selected object O i with a non-selected object O h

10 PAM (3) * O(k(n-k) 2 ) for each iteration * Good for small data sets (n=100, k=5)

11 CLARA ( Clustering LARge Applications ) Improvement over PAM Finds medoids in a sample from the dataset [Idea]: If the samples are sufficiently random, the medoids of the sample approximate the medoids of the dataset [Heuristics]: 5 samples of size 40+2k gives satisfactory results Works well for large datasets (n=1000, k=10)

12 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary

13 CLARANS ( Clustering Large Applications based on RANdomized Search ) A graph abstraction, G n,k Each vertex is a collection of k medoids | S1 S2 | = k – 1 Each node has k(n-k) neighbors Cost of each node is total dissimilarity of objects to their medoids PAM searches whole graph CLARA searches subgraph

14 CLARANS (2) Experimental values numLocal = 2 maxNeighbors = max(1.25% of k(n-k), 250)

15 CLARANS (3) Outperforms PAM and CLARA in terms of running time and quality of clustering O(n 2 ) for each iteration CLARANS vs PAM CLARANS vs CLARA

16 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary

17 Generalization Useful to mine non-spatial attributes Process of merging tuples based on a concept hierarchy DBLearn – SQL query, gen. hierarchy and threshold Initial relationGeneralized relation Sphere(color, diameter)

18 Silhouette Silhouette of object O j determines how much O j belongs to it’s cluster Between -1 and 1 1 indicates high degree of membership Silhouette width of cluster Average silhouette of all objects in cluster Silhouette coefficient Average silhouette widths of k clusters Silhoutte widthInterpretation 0.71 – 1Strong cluster 0.51 – 0.7Reasonable cluster 0.26 – 0.5Weak or artificial cluster ≤ 0.25No cluster found

19 SD and NSD approach SD – Spatial Dominant NSD – Non-Spatial Dominant Clustering for spatial attributes / Generalization for non-spatial attributes Dominance is decided by what is carried out first (clustering/generalization) Second phase works on tuples from previous stage

20 SD(CLARANS) Finds non-spatial generalizations from spatial clustering Value for K nat is determined through heuristics using the silhouette coefficients Clustering phase can be treated as finding spatial generalization hierarchy

21 NSD(CLARANS) Finds spatial clusters from non-spatial generalizations Clusters may overlap

22 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary

23 Observations In all previous methods, quality of mining depends on the SQL query CLARANS assumes that the entire dataset is in memory. Not always the case for large data sets. Quality of results cannot be guaranteed when N is very large – due to Randomized Search

24 Observations (2) Other clustering algorithms proposed for Spatial Data Mining Hierarchical: BIRCH Density based: DBSCAN, GDBSCAN, DBRS Grid based: STING

25 Summary A seminal paper on use of clustering for spatial data mining CLARANS is an effective clustering technique for large datasets SD(CLARANS)/NSD(CLARANS) are effective spatial data mining algorithms

26 References Primary Efficient and Effective Clustering Methods for Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han Secondary CLARANS: A Method for Clustering Objects for Spatial Data Mining - Raymond T. Ng, Jiawei Han Clustering for Mining in Large Spatial Databases - Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu An Introduction to Spatial Database Systems - Ralf Hartmut Güting