Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.

Slides:



Advertisements
Similar presentations
Cluster Analysis: Basic Concepts and Algorithms
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Categorical Data The Case of Quran Verses
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:
Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
What is Cluster Analysis?
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND Pasi Fränti and.
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.
1 Eick, Zeidat, Vilalta: Using Representative-based Clustering for NN Dataset Editing (ICDM04) Using Representative-Based Clustering For Nearest Neighbour.
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Using Clustering to Learn Distance Functions for Supervised Similarity Assessment.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Photoconsistency constraint C2 q C1 p l = 2 l = 3 Depth labels If this 3D point is visible in both cameras, pixels p and q should have similar intensities.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
1. Randomized Hill Climbing Neighborhood Randomized Hill Climbing: Sample p points randomly in the neighborhood of the currently best solution; determine.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
AIM: Clustering the Data together
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
Presentation transcript:

Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer Science, University of Houston, USA Organization 1.Motivation 2.Analyzing Related Datasets 3.Correspondence Clustering Definition Frameworks Representative-based Correspondence Clustering Algorithms 4.Assessing Agreement between Related Datasets 5.Experimental Evaluation 6.Conclusion and Future Work 1

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Clustering Related Datasets has many applications: Relating habitats of animals and their source of food Understanding change of ozone concentrations due to industrial emissions of other pollutants. Analyzing changes in water temperature However, traditional clustering algorithms that cluster each dataset separately are not well suited to cluster related datasets: They do not consider correspondence between the datasets The variance inherent in most clustering algorithms complicates analyzing related datasets. 1. Motivation 2

Subtopics: Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) Change Analysis (“what is new/different?”) in temporal datasets; e.g. [CKT06] and [CSZHT07] utilize a concept of temporal smoothness that states that clustering results of data in two consecutive time frames should not be dramatically different. Relational clustering [BBM07] clusters different types of objects based on their properties and relationships. Co-clustering [DMM03] partition rows and columns of data matrix simultaneously to create clusters for two sets of objects. Correspondence clustering centers on “mining good clusters in different datasets with interesting relationships between them”. 2. Analyzing Related Datasets Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 3

Clustering with Plug-in Fitness Functions  In the last 5 years, my research group developed families of clustering algorithms that find contiguous spatial clusters that by maximizing a plug-in fitness function.  This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for.  The presented paper generalizes this work to mine multiple datasets. 4

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Definition: A correspondence clustering algorithm clusters data in two or more datasets O={O 1,…,O n } and generates clustering results X={X 1,…,X n } such that for 1  i  n, X i is created from O i and the correspondence clustering algorithm seeks for clusters X i ’s such that each X i maximizes interestingness i (X i ) with respect to O i as well as maximizes the correspondence measure Corr(X 1,…,X n ) between itself and the other clusterings Xj for 1  j  n, j  i. 3. Correspondence Clustering 5

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Example Correspondence Clustering O1—Earthquakes 86-91O2—Earthquakes Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Find clusters X1 for O1 and X2 for O2 maximizing the following objective function: q̃(X1,X2)=(  (i(X1)+i(X2))) + ((1  )  Agreement(X1,X2)) where i(X) measures the interestingness of X based on the variance in earthquake depth of earthquakes in the clusters of X and  determines the importance of dataset cluster quality and agreement between the two clusterings. 6

Relies on clustering algorithms that support plug-in fitness functions, allowing for non-distance based notion of interestingness. Geared towards analyzing spatial datasets; spatial attributes serve as glue to relate different spatial datasets Corresponding clustering can be viewed as a multi-objective optimization problem in which we try to obtain good clusters in multiple datasets with good fit with respect to a given correspondence relationship. What is unique about Corresponding Clustering? 7

2 groups of algorithms can be distinguished: Iterative algorithms that improve the clustering of one dataset while keeping the clusters of other datasets fixed. Concurrent algorithms that cluster all datasets in parallel. In the following, an iterative representative-based correspondence clustering algorithm C-CLEVER-I will be briefly discussed. Algorithms for Correspondence Clustering Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 8

Representative-based Clustering Attribute2 Attribute Objective: Find a set of objects O R such that the clustering X obtained by using the objects in O R as representatives minimizes q(X). Characteristic: cluster are formed by assigning objects to the closest representative Popular Algorithms: K-means, K-medoids, CLEVER,… 9

CLEVER [ACM-GIS’08] Is a representative-based clustering algorithm, similar to PAM. Searches variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity. In general, new clusters are generated in the neighborhood of the current solution by replacing, inserting, and replacing representatives. Searches for optimal number of clusters Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 10

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Inputs: O1 and O2, TCond, k’, neighborhood-size, p, p’,  Output: X1, X2, q(X1), q(X2), q̃(X1,X2), Corr(X1,X2) Algorithm: 1. Run CLEVER on dataset O1 with fitness function q and get clustering result X1 and a set of representative R1: (X1,R1) :=Run CLEVER(O1, q); 2. Repeat until the Termination Condition TCond is met. a. Run CLEVER on dataset O2 with compound fitness function q̃2 that uses the representatives R1 to calculate Corr(X1,X2): (X2,R2) :=Run CLEVER(O2,R1, q̃2) b. Run CLEVER on dataset O1 with compound fitness function q̃1 that uses the representatives R2 to calculate Corr(X1,X2): (X1,R1) :=Run CLEVER(O1,R2, q̃1) C-CLEVER-I Outputs and Fitness Functions: X1, X2 are clusterings of O1 and O2 q is a single dataset fitness function q̃(X1,X2)=(  (q(X1)+q(X2))) + ((1  )  Corr(X1,X2)) q̃1(X1)=(  (q(X1)) + ((1  )  Corr(X1,X2))—q with X2 fixed q̃2(X2)=(  (q(X2)) + ((1  )  Corr(X1,X2))—q with X1 fixed 11

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 We assume that the two datasets share the same spatial attributes. However, the challenge of this task is that we do not have object identity; otherwise, we could directly compare the co-occurrence matrices M X1 and M X2 of the two clusterings. Key Idea: We can use the representatives of the clusters in one dataset to cluster the other dataset; then we can compute the similarity of original clustering with the clustering obtained using the representatives of the other dataset for both data sets. Finally, we assess similarity by averaging over these two similarities: 4. Assessing Agreement between Clusterings For Related Datasets 12

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 5. Experimental Evaluation O1O2 Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Find clusters X1 for O1 and X2 for O2 maximizing the following objective function: q̃(X1,X2)=(  (i(X1)+i(X2))) + ((1  )  Agreement(X1,X2)) What is done in the experimental evaluation? 1.We compare running C-CLEVER-I with running CLEVER 2.Analyze the potential of using agreement as a correspondence function to reduce the variance of clustering results. 3.We analyze different initialization strategies and parameter settings. 13

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Comparing CLEVER and C-CLEVER-I CLEVERC-CLEVER-I (  =1.0e-5) C-CLEVER-I (  =2.0e-6) Fitness q(X 1 ) Fitness q(X 2 ) q(X 1 ) + q(X 2 ) Agreement(X 1,X 2 ) Agreement X 1 ’s Agreement X 2 ’s Computation Time5.48E E E+06 Table 5-3. Comparison of average results of CLEVER and C-CLEVER-I Comparison between CLEVER and C-CLEVER-I 14

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Different Initialization Strategies Table 5-3. Comparison of average results of CLEVER and C-CLEVER-I Comparison between Different Initialization Strategies The following initialization strategies for C-CLEVER-I have been explored: (1)Use random initial representatives (2)Use the nearest neighbors of the final representatives of the last iteration for the other dataset (3) Use the final representatives from the previous iteration for the same dataset C-CLEVER- I-C (2) C-CLEVER- I-O (3) C-CLEVER- I-R (1) Compound Fitness q̃(X 1,X 2 ) Fitness q(X 1 )2.3E E E+08 Fitness q(X 2 )2.14E E E+08 q(X 1 ) + q(X 2 )4.44E E E+08 Agreement(X 1,X 2 ) Computation Time3.23E E E+06 15

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 A representative-based correspondence clustering framework has been introduced that relies on plug-in clustering and correspondence functions. Correspondence clustering algorithms that are generalizations of an algorithm called CLEVER were presented. Our experimental results suggest that correspondence clustering can reduce the variance inherent to representative-based clustering algorithms. Since the two datasets are related to each other, using one dataset to supervise the clustering of the other dataset can lead to more reliable clusterings. As a by-product, a novel agreement assessment method to compare representative-based clusterings that originate from different dataset has been introduced. 6. Conclusion 16

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 What about other correspondence measures, besides agreement and disagreement? What about applications that look for other forms of correspondence between clusters originating from different spatial datasets? Several implementation strategies for concurrent correspondence clustering are possible: –Cluster each dataset for a few iterations and switch –Find clusters for both datasets using some spatial traversal approach, creating clusters for subregions –… Future Work 17

Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Inputs: Dataset O, k’, neighborhood-size, p, p’,  Outputs: Clustering X, fitness q Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. Create p neighbors of the current solution randomly using the given neighborhood definition. 3. If the best neighbor improves the fitness q, it becomes the current solution. Go back to step If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution; otherwise, go back to step 2 replacing the current solution by the best solution found by re- sampling. CLEVER 18