Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:
Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.
Intensity Transformations (Chapter 3)
Data Mining Techniques: Clustering
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Segmentation Divide the image into segments. Each segment:
Evaluating Hypotheses
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Time Series Data Analysis - II
Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Examples of.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Presented by Tienwei Tsai July, 2005
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.
Extracting Regional Knowledge from Spatial Datasets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is Regional.
CS 376b Introduction to Computer Vision 02 / 22 / 2008 Instructor: Michael Eckmann.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
Data Mining & Machine Learning Group ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.
1 Eick, Zeidat, Vilalta: Using Representative-based Clustering for NN Dataset Editing (ICDM04) Using Representative-Based Clustering For Nearest Neighbour.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department,
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Ch. Eick: Randomized Hill Climbing Techniques Randomized Hill Climbing Neighborhood Hill Climbing: Sample p points randomly in the neighborhood of the.
1. Randomized Hill Climbing Neighborhood Randomized Hill Climbing: Sample p points randomly in the neighborhood of the currently best solution; determine.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
More on Clustering in COSC 4335
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
Data Mining: Exploring Data
Critical Issues with Respect to Clustering
CSE572, CBS598: Data Mining by H. Liu
Brainstorming How to Analyze the 3AuCountHand Datasets
CSE572, CBS572: Data Mining by H. Liu
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Result of N Categorical Variable Regional Co-location Mining
Presentation transcript:

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery Framework 3.A Fitness For Hotspot Discovery 4.Other Fitness Functions 5.A Family of Clustering Algorithms for Region Discovery 6.Summary

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery Framework 3.A Fitness For Hotspot Discovery 4.Other Fitness Functions 5.A Family of Clustering Algorithms for Region Discovery 6.Summary

Ch. Eick: Introduction Region Discovery 1. Motivation: Examples of Region Discovery RD-Algorithm Application 1: Supervised Clustering [EVJW07] Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07] Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08] Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08] Application 5: Find “representative” regions (Sampling) Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well  =1.01  =1.04

Ch. Eick: Introduction Region Discovery 2. Region Discovery Framework We assume we have spatial or spatio-temporal datasets that have the following structure: (x,y,[z],[t]; ) e.g. (longitude, lattitude, class_variable) or (longitude, lattitude, continous_variable) Clustering occurs in the (x,y,[z],[t])-space; regions are found in this space. The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself. For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same

Ch. Eick: Introduction Region Discovery Region Discovery Framework Continued The algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c 1,…,c k } as follows: q(X)=  c  X reward(c)=  c  X interestingness(c)  size(c)  with  >1 Objective: Find c 1,…,c k  O such that: 1.c i  c j =  if i  j 2.X={c 1,…,c k } maximizes q(X) 3.All cluster c i  X are contiguous (each pair of objects belonging to c i has to be delaunay-connected with respect to c i and to d) 4.c 1 ,…,  c k  O 5.c 1,…,c k are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported

Ch. Eick: Introduction Region Discovery Challenges for Region Discovery 1.Recall and precision with respect to the discovered regions should be high 2.Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets” 3.Detection of regions at different levels of granularities (from very local to almost global patterns) 4.Detection of regions of arbitrary shapes 5.Necessity to cope with very large datasets 6.Regions should be properly ranked by relevance (reward); in many application only the top-k regions are of interest 7.Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.

Ch. Eick: Introduction Region Discovery 3. Fitness Function for Supervised Clustering Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5 |c| P(c, Unsafe) 20/50 = 40%40/200 = 20%10/200 = 5%30/350 = 8.6%100/200=50% Reward Class of Interest: Unsafe_Well Prior Probability: 20% γ1 = 0.5, γ2 = 1.5; R+ = 1, R-= 1; β = 1.1,  =1. 10%30%

Ch. Eick: Introduction Region Discovery 4. Fitness Functions for Other Region Discovery Tasks 4.1 Creating Contour Maps for Water Temperature (Temp) 1.Examples in the data set WT have the form: (x,y,temp); var(c,temp) denotes the variance of variable temp in region c 2.interestingness(c)= IF var(c,temp)>var(WT,temp) THEN 0 ELSE min(1, log 20 (var(WT,temp)/var(c,temp)))  with  being a parameter (with default 1) 3.Basically, regions receive rewards if their variance is lower than the variance of the variable temperature for the whole data set, and regions whose variance is at least 20 times less receive the maximum reward of 1. Fig. 1: Sea Surface Temperature on July Var=2.2 Reward: 48,5 Rank: 3 A single region and its summary Mean=11.2

Ch. Eick: Introduction Region Discovery 4.2 Finding Regions with High Water Temperature Differences 1.Examples in the data set WT have the form: (x,y,Temp) 2.Fitness function: Let c be a cluster to be evaluated interestingness(c)= IF var(c,temp)<var(WT,temp) THEN 0 ELSE min(1, log 20 (var(c,temp)/var(WT,temp)))  ) with  being a parameter (with default 1)

Ch. Eick: Introduction Region Discovery 4.3 Programming Project Fitness Functions Purity r1 r2 (6, 2, 2) (0, 0, 5) We assume th=0.5 and  =2 i(r1)= ( )**2=0.01 i(r2)=(1-0.5)**2=0.25 i(r3)=0 q(X)=q({r1,r2,r3})= 0.01*10  *5  (2,2,1) r3 We assume we have 3 classes; in r1 we have 6 objects of class1, 3 objects of class 2, and 2 objects of class1

Ch. Eick: Introduction Region Discovery Programming Project Fitness Function Variance We assume  =1 and th=1.5 i(r1)= 0 i(r2)=(2-1.5)=0.5 i(r3)=(11-1.5)=9.5 i(r4)=0 O Var(O)=100 r1 var(r1)=80 r2 Var(r2)=200 r3 Var(r3)=1100 r4 Var(r4)=20

Ch. Eick: Introduction Region Discovery Interestingness Function Binary Co-location We assume  =1, th=0.1 and A={B1,B2} i(r1)= | |/3 -0.1=0.1 i(r2)=| |/3-0.1=1.4 i(r3)=… i(r4)=0 because | |/3=0.01<0.1 r1 (1,1) (-1, 1) (1, 0.6) r2 (-1, -4) (-.0.5, -1) (-0.5,0) r3 R4 (1,-1) (1, 1) (0.3, -0.1) Meaning: z-value of B1 is -1, and z-value of B2 is -4 Binary Co-location: i(o,{B1,B2})=z B1 (o)*z B2 (o)

Ch. Eick: Introduction Region Discovery Programming Project Function MSE r1 r2 (2,2) (4,4) (-1,-1) (-7,-7) (-4,-4) MSE(r1)=(1**2+1**2+1**2+1**2+1**2)/2=2 MSE(r2)=(3**2+3**2+3**2+3**2+1**2+0+0)/3=12

Ch. Eick: Introduction Region Discovery Global Co-location: and are co-located in the whole dataset Task: Find Co-location patterns for the following data-set. 4.4 Regional Co-location Mining Regional Co-location R1 R2 R3 R4

Ch. Eick: Introduction Region Discovery A Reward Function for Binary Co-location Task: Find regions in which the density of 2 or more classes is elevated. In general, multipliers C are computed for every region r, indicating how much the density of instances of class C is elevated in region r compared to C’s density in the whole space, and the interestness of a region with respect to two classes C1 and C2 is assessed proportional to the product C1  C2 Example: Binary Co-Location Reward Framework; C (r)=p(C,r)/prior(C)  C1,C2 = 1/((prior(C1)+prior(C2)) “maximum multiplier”  C1,C2 (r) = IF C1 (r)<1 or C2 (r )<1 THEN 0 ELSE sqrt(( C1 (r)–1)*( C2 (r)–1))/(  C1,C2 –1) interestingness(r)= max C1, C2;C1  C2 (  C1, C2 (c))

Ch. Eick: Introduction Region Discovery The Ultimate Vision of the Presented Research Spatial Databases Data Set Domain Expert Measure of Interestingness Acquisition Tool Fitness Function Family of Clustering Algorithms Visualization Tools Ranked Set of Interesting Regions and their Properties Region Discovery Display Database Integration Tool Architecture Region Discovery Engine

Ch. Eick: Introduction Region Discovery How to Apply the Suggested Methodology 1.With the assistance of domain experts determine structure of dataset to be used. 2.Acquire measure of interestingness for the problem of hand (this was purity, variance, MSE, probability elevation of two or more classes in the examples discussed before) 3.Convert measure of interestingness into a reward-based fitness function. The designed fitness function should assign a reward of 0 to “boring” regions. It is also a good idea to normalize rewards by limiting the maximum reward to 1. 4.After the region discovery algorithm has been run, rank and visualize the top k regions with respect to rewards obtained (interestingness(c)  size(c)  ), and their properties which are usually task specific.

Ch. Eick: Introduction Region Discovery 5. A Family of Clustering Algorithms for Region Discovery 1.Supervised Partitioning Around Medoids (SPAM). 2.Representative-based Clustering Using Randomized Hill Climbing (CLEVER) 3.Supervised Clustering using Evolutionary Computing (SCEC) 4.Single Representative Insertion/Deletion Hill Climbing with Restart (SRIDHCR) 5.Supervised Clustering using Multi-Resolution Grids (SCMRG) 6.Agglomerative Clustering (MOSAIC) 7.Supervised Clustering using Density Estimation Techniques (SCDE) 8.Clustering using Density Contouring (DCONTOUR) Remark: For a more details about SCEC, SPAM, SRIDHCR see [EZZ04, ZEZ06]; the PKDD06 paper briefly discusses SCMRG

Ch. Eick: Introduction Region DiscoveryCLEVER  Separate Slideshow

Ch. Eick: Introduction Region Discovery 20 Steps of Grid-based Clustering Algorithms Basic Grid-based Algorithm 1.Define a set of grid-cells 2.Assign objects to the appropriate grid cell and compute the density of each cell. 3.Eliminate cells, whose density is below a certain threshold . 4.Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function) Simple version of a grid-based algorithm: Merge cells greedily as long as merging improves q(X).

Ch. Eick: Introduction Region Discovery 21 Advantages of Grid-based Clustering Algorithms fast: –No distance computations –Clustering is performed on summaries and not individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects) –Easy to determine which clusters are neighboring Shapes are limited to union of grid-cells

Ch. Eick: Introduction Region Discovery Ideas SCMRG (Divisive, Multi-Resolution Grids) Cell Processing Strategy 1. If a cell receives a reward that is larger than the sum of its rewards its ancestors: return that cell. 2. If a cell and its ancestor do not receive any reward: prune 3. Otherwise, process the children of the cell (drill down)

Ch. Eick: Introduction Region Discovery Code SCMRG

Ch. Eick: Introduction Region Discovery Parameters SCMRG  Separate Transparency!

Ch. Eick: Introduction Region Discovery 6. Summary 1.A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced. 2.The framework find interesting places and their associated patterns. 3.The framework extracts regional knowledge from spatial datasets 4.The ultimate vision of this research is the development of region discovery engines that assist earth scientists in finding interesting regions in spatial datasets.

Ch. Eick: Introduction Region Discovery Why should people use Region Discovery Engines (RDE)? RDE: finds sub-regions with special characteristics in large spatial datasets and presents findings in an understandable form. This is important for: Focused summarization Find interesting subsets in spatial datasets for further studies Identify regions with unexpected patterns; because they are unexpected they deviate from global patterns; therefore, their regional characteristics are frequently important for domain experts Without powerful region discovery algorithms, finding regional patters tends to be haphazard, and only leads to discoveries if ad-hoc region boundaries have enough resemblance with the true decision boundary Exploratory data analysis for a mostly unknown dataset Co-location statistics frequently blurred when arbitrary region definitions are used, hiding the true relationship of two co-occurring phenomena that become invisible by taking averages over regions in which a strong relationship is watered down, by including objects that do not contribute to the relationship (example: High crime- rates along the major rivers in Texas) Data set reduction; focused sampling