Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,

Slides:

Advertisements

Similar presentations

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Advertisements

Clustering Categorical Data The Case of Quran Verses

PARTITIONAL CLUSTERING

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:

Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.

Spatial statistics Lecture 3.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

What is Cluster Analysis

Evaluating Hypotheses

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Examples of.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

Spatial Statistics Applied to point data.

Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Name: Sujing Wang Advisor: Dr. Christoph F. Eick

A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.

Extracting Regional Knowledge from Spatial Datasets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is Regional.

Spatial Data Analysis Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What is spatial data and their special.

Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.

1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.

Data Mining & Machine Learning Group ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.

1 Eick, Zeidat, Vilalta: Using Representative-based Clustering for NN Dataset Editing (ICDM04) Using Representative-Based Clustering For Nearest Neighbour.

Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.

Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA

Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department,

So, what’s the “point” to all of this?….

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.

Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.

Supervised Clustering --- Algorithms and Applications

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.

Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,

Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department, University.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)

Semi-Supervised Clustering

More on Clustering in COSC 4335

Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab aims at the development of data analysis, data mining, GIS and artificial.

COSC 6335 Data Mining Fall 2009: Assignment3a Post Analysis

Outlier Discovery/Anomaly Detection

HC-edit: A Hierarchical Clustering Approach To Data Editing

Critical Issues with Respect to Clustering

UH-DMML: Ongoing Data Mining Research

Brainstorming How to Analyze the 3AuCountHand Datasets

CSE572, CBS572: Data Mining by H. Liu

Clustering Wei Wang.

Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.

Discovery of Interesting Spatial Regions

CSE572: Data Mining by H. Liu

Presentation transcript:

Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference, Berlin, Sept. 21, 2006 Department of Computer Science University of Houston, Texas, USA Organization 1.Motivation: Examples of Region Discovery 2.Region Discovery Framework 3.A Family of Clustering Algorithms for Region Discovery 4.Experimental Evaluation 5.Related Work 6.Generalizability of the Region Discovery Framework 7.Conclusion

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD Motivation: Examples of Region Discovery RD-Algorithm Application 1: Hot-spot Discovery [this paper] Application 2: Regional Association Rule Mining [DEWY06] 1.Find Regions 2.Mine Regional association rules Application 3: Find Interesting Regions with respect to a Continuous Variable Application 4: Regional Co-location Mining Application 5: Find “representative” regions (Sampling) Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well  =1.01  =1.04

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD Region Discovery Framework We assume we have spatial or spatio-temporal datasets that have the following structure: (x,y,[z],[t]; ) e.g. (longitude, lattitude, class_variable) or (longitude, lattitude, continous_variable) Clustering occurs in the (x,y,[z],[t])-space; regions are found in this space. The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself. For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Region Discovery Framework Continued The algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c 1,…,c k } as follows: q(X)=  c  X reward(c)  size(c)  with  >1 Objective: Find c 1,…,c k  O such that: 1.c i  c j =  if i  j 2.X={c 1,…,c k } maximizes q(X) 3.All cluster c i  X are contiguous (each pair of objects belonging to c i has to be delaunay-connected with respect to c i and to d) 4.c 1 ,…,  c k  O 5.c 1,…,c k are frequently ranked based on the reward each cluster receives, and low reward clusters are not reported

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Example of a Fitness Function for Hot Spot Discovery Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5 |c| P(c, Unsafe) 20/50 = 40%40/200 = 20%10/200 = 5%30/350 = 8.6%100/200=50% Reward Class of Interest: Unsafe_Well Prior Probability: 20% γ1 = 0.5, γ2 = 1.5; R+ = 1, R-= 1; β = 1.1,  =1. 10%30%

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Challenges for Region Discovery 1.Recall and precision with respect to the discovered regions should be high 2.Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets” 3.Detection of regions at different levels of granularities (from very local to almost global patterns) 4.Detection of regions of arbitrary shapes 5.Necessity to cope with very large datasets 6.Regions should be properly ranked by relevance (reward) 7.Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD A Family of Clustering Algorithms for Region Discovery 1.Supervised Partitioning Around Medoids (SPAM). 2.Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). 3.Supervised Clustering using Evolutionary Computing (SCEC) 4.Agglomerative Hierarchical Supervised Clustering (SCAH) 5.Hierarchical Grid-based Supervised Clustering (SCHG) 6.Supervised Clustering using Multi-Resolution Grids (SCMRG) 7.Representative-based Clustering with Gabriel Graph Based Post-processing (SCEC+GGP / SRIDHCR+GGP) 8.Supervised Clustering using Density Estimation Techniques (SCDE) Remark: For a more details about SCEC, SPAM, SRIDHCR see [EZZ04, ZEZ06]; the PKDD06 paper briefly discusses SCAH, SCHG, SCMRG

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 SCAH (Agglomerative Hierarchical) Inputs: A dataset O={o 1,...,o n } A distance Matrix D = {d(o i,o j ) | o i,o j  O }, Output: Clustering X={c 1,…,c k } Algorithm: 1) Initialize: Create single object clusters: c i = {o i }, 1≤ i ≤ n; Compute merge candidates based on “nearest clusters” 2) DO FOREVER a) Find the pair (c i, c j ) of merge candidates that improves q(X) the most b) If no such pair exist terminate, returning X={c 1,…,c k } c) Delete the two clusters c i and c j from X and add the cluster c i  c j to X d) Update inter-cluster distances incrementally e) Update merge candidates based on inter-cluster distances

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 SCHG (Hierarchical Grid-based) Remark: Same as SCAH, but uses grid cells as intial clusters Inputs: A dataset O={o 1,...,o n } A grid structure G Output: Clustering X={c 1,…,c k } Algorithm: 1) Initialize: Create clusters making each single non-empty grid cell a cluster Compute merge candidates (all pairs of neighboring grid cells) 2) DO FOREVER a) Find the pair (c i, c j ) of merge candidates that improves q(X) the most b) If no such pair exist terminate, returning X={c 1,…,c k } c) Delete the two clusters c i and c j from X and add the cluster c’=c i  c j to X d) Update merge candidates:  c  X (MC(c’,c)  MC(c, c i )  MC(c, c j ))

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Ideas SCMRG (Divisive, Multi-Resolution Grids) Cell Processing Strategy 1. If a cell receives a reward that is larger than the sum of its rewards its ancestors: return that cell. 2. If a cell and its ancestor do not receive any reward: prune 3. Otherwise, process the children of the cell (drill down)

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD Experimental Evaluation Volcano Earthquake Dataset Name# of objects# of classes 1B-Complex93,0312 2Volcano1,5332 3Earthquake-13,1613 4Earthquake-1031,6143 5Earthquake ,1483 6Wyoming-Poverty493,7812

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Experimental Results Dataset Algorithms SCAHSCHGSCMRGSCAHSCHGSCMRG Parameters β = 1.01, η = 6β = 3, η = 1 B-Complex9 Purity Quality Clusters Volcano Purity Quality E-57E-41E-4 Clusters Earthquake-1 Purity Quality Clusters Earthquake-10 Purity DNF DNF Quality DNF DNF Clusters DNF37506DNF12153 Earthquake-100 Purity DNF DNF Quality DNF DNF Clusters DNF38780DNF9191 Wyoming Purity DNF DNF Quality DNF DNF Clusters DNF48989DNF39178

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Experimental Evaluation SCAH outperforms SCHG and SCMRG when the penalty for the number of clusters is very low (  =1.01,  =6). However, when SCAH runs out of pure clusters to merge, it has the tendency to terminate prematurely; therefore, it does quite poorly when the objective is obtain large clusters (  =3,  =1). SCHG outperforms SCMRG and SCAH for  =3,  =1. SCMRG obtains better clusters than SCAH for the Volcano dataset for  =1.01,  =6, which can be attributed to the fact that SCMRG uses grid cells with different sizes. Avg. wall clocktime for smaller datasets SCAH:SCMRG/SCHG: 13:1/52:1 SCAH is not suitable to cope with dataset sizes of and more, mainly because of the large number of distance computations, large numbers of clusters, and merge steps needed. The quality of clustering of SCMRG is strongly dependent on initial cluster sizes and on the look ahead depth.

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Problems with SCAH No look ahead: Non-contiguous clusters: XXX OOO OOO XXX Too restrictive definition of merge candidates:

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD Related Work In contrast to most work in spatial data mining, our work centers on creating regional knowledge and not global knowledge. A lot of work in spatial data mining centers on partioning a spatial dataset into “transactions” so that apriori-style algorithms can be used. We claim that our work can contribute to “finding such transactions” [DEWY06]. Our work has similarity to work in supervised clustering/semi-supervised clustering in that it uses class labels in evaluating clusters. Moreover, the goals of the algorithms presented in this paper are similar to hotspot discovery algorithms, a task that does not receive a lot of attention in spatial data mining, but more attention by scientists in earth sciences and related disciplines.

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD Generalizibility 1.Find regions whose density/entropy/purity with respect to a class of interest is low/high  this talk 2.Find regions whose variance with respect to a continuous variable is low  contour maps 3.Find regions whose variance with respect to a contious variable is high  … 4.Find regions whose distribution is similar to the distribution of the whole dataset  spatial sampling 5.Find regions in which the density of 2 or more classes is elevated  regional co-location mining

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD Summary 1.A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced. 2.Evidence concerning the usefulness of the framework for hot spot discovery problems has been presented. 3.As a by-product some known and not so well known flaws of hierarchical clustering algorithms have been identified. 4.The ultimate vision of this research is the development of region discovery engines that assist earth scientists in finding interesting regions in spatial datasets.

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 The Vision of the Presented Research Spatial Databases Data Set Domain Expert Measure of Interestingness Acquisition Tool Fitness Function Family of Clustering Algorithms Visualization Tools Ranked Set of Interesting Regions and their Properties Region Discovery Display Database Integration Tool Architecture Region Discovery Engine

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Additional Transparencies Not used for PKDD 2006 Talk

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Code SCMRG

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Why should people use Region Discovery Engines (RDE)? RDE: finds sub-regions with special characteristics in large spatial datasets and presents findings in an understandable form. This is important for: Focused summarization Find interesting subsets in spatial datasets for further studies Identify regions with unexpected patterns; because they are unexpected they deviate from global patterns; therefore, their regional characteristics are frequently important for domain experts Without powerful region discovery algorithms, finding regional patters tends to be haphazard, and only leads to discoveries if ad-hoc region boundaries have enough resemblance with the true decision boundary Exploratory data analysis for a mostly unknown dataset Co-location statistics frequently blurred when arbitrary region definitions are used, hiding the true relationship of two co-occuring phenomena that become invisible by taking averages over regions in which a strong relationship is watered down, by including objects that do not contribute to the relationship (example: High crime- rates along the major rivers in Texas) Data set reduction; focused sampling

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Experimental Results Volcano for  =1.01,  =6 SCAH SCHG SCMRG

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Example Result SCMRG

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Datasets Used Obtained from Geosciences Department in University of Houston. The Earthquake dataset contains all earthquake data worldwide done by the United States Geological Survey (USGS) National Earthquake Information Center (NEIC). The modified Earthquake dataset contains the longitude, latitude and a class variable that indicates the depth of the earthquake, 0(shallow), 1(medium) and 2(deep).

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Datasets Used Wyoming datasets were created from U.S. Census 2000 data. The Wyoming Modified Poverty Status in 1999 is a modified version of the original dataset, Wyoming Poverty Status. The Wyoming Poverty Datasets were created using county statistics. For each county, random population coordinates were generated using the complete spatial randomness (CSR) functions in S-PLUS. Then, the background information was attached to each individual county based on the county’s distribution for the class of interest. Finally, all counties were merged into a single dataset that describes the whole state.

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Datasets Used Obtained from Geosciences Department in University of Houston. The Volcano dataset contains basic geographic and geologic information for volcanoes thought to be active in the last 10,000 years The original data include a unique volcano number, volcano name, location, latitude and longitude, summit elevation, volcano type, status and the time range of the last recorded eruption. The Subset of the volcano dataset used in this thesis contains longitude, latitude and a class variable that indicates if a volcano is non – violent (blue) or violent (red).

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 Global Co-location: and Task: Find Co-location patterns for the following data-set. Another Example: Regional Co-location Mining Regional Co-location

Ch. Eick: Discovering Interesting Region is Spatal Data Sets using Supervised Clustering, PKDD 2006 A Co-Location Reward Framework Task: Find regions in which the density of 2 or more classes is elevated. One approach to measure class density elevation: In general, multipliers C can be computed for every class in a dataset, indicating how much the density of instances of class C is elevated in region r compared to their density in the whole space. Example: Binary Co-Location Reward Framework; 1.increase C (r)= if C (r)  1 then 0 else (( C (r)– 1)/(1/(prior(C)-1)))   C1,C2 (r) = increase C1 (r)* increase C2 (r) 3.reward(r)= max C1,C2; C1  C2 (  C1, C2 (r))