Data Mining & Machine Learning Group ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Using the Crosscutting Concepts As conceptual tools when meeting an unfamiliar problem or phenomenon.
CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:
Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
EE663 Image Processing Edge Detection 5 Dr. Samir H. Abdul-Jauwad Electrical Engineering Department King Fahd University of Petroleum & Minerals.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Data Mining – Intro.
FLANN Fast Library for Approximate Nearest Neighbors
Data Mining BS/MS Project Clustering for Market Segmentation Presentation by Mike Calder.
UH Data Mining & Machine Learning Group May 1, 2009 Christoph F. Eick Department of Computer Science University of Houston A Domain-Driven Framework.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Data Mining Techniques
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Examples of.
Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Presented by Tienwei Tsai July, 2005
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.
Extracting Regional Knowledge from Spatial Datasets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is Regional.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Density-Based Clustering Algorithms
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department,
So, what’s the “point” to all of this?….
Geographical Data and Measurement Geography, Data and Statistics.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Data Mining and Machine Learning Group (UH-DMML) Wei Ding Rachana Parmar Ulvi Celepcikay Ji Yeon Choo Chun-Sheng Chen Abraham Bagherjeiran Soumya Ghosh.
International Conference on Fuzzy Systems and Knowledge Discovery, p.p ,July 2011.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department, University.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Ch. Eick Project 2 COSC Christoph F. Eick.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Queensland University of Technology
Data Mining – Intro.
By Arijit Chatterjee Dr
Yongli Zhang and Christoph F. Eick University of Houston, USA
UH-DMML: Ongoing Data Mining Research
Brainstorming How to Analyze the 3AuCountHand Datasets
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
Result of N Categorical Variable Regional Co-location Mining
Presentation transcript:

Data Mining & Machine Learning Group ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding (University of Massachusetts at Boston, USA, USA), Tomasz Stepinski (Lunar and Planetary Institute, Houston, USA), Jean-Phillippe Nicot (Bureau of Economic Geology, University of Texas, Austin, USA) Finding Regional Co-Location Patterns for Sets of Continuous Variables in Spatial Datasets Irvine(CA), November 6, 2008

Data Mining & Machine Learning Group ACM-GIS08 Talk Outline 1.Introduction Co-location Mining 2.Clustering with Plug-in Fitness Functions 3.An Interestingness Measure for Co-location Mining Involving Continuous Variables. 4.Case Study: Arsenic Pollution in the Texas Water Wells 5.CLEVER—A Representative-based Clustering Algorithm 6.Conclusion.

Data Mining & Machine Learning Group ACM-GIS08 1. Introduction  “Spatial co-locations represent the subsets of features which are frequently located together in geographic space” [Shekhar]  Most of the past research centers on finding categorical co-location patterns which are global.  However, many real world datasets contain continuous variables, and global knowledge may be inconsistent with regional knowledge

Data Mining & Machine Learning Group Regional Co-location Mining Goal: To discover regional co-location patterns involving continuous variables in which continuous variables take values from the wings of their statistical distribution A novel framework that operates in the continuous domain is proposed to accomplish this goal. Dataset: (longitude,latitude, +) Regional Co-location Mining

Data Mining & Machine Learning Group ACM-GIS08 Why is Regional Knowledge Important in Spatial Data Mining?  A special challenge in spatial data mining is that information is usually not uniformly distributed in spatial datasets.  It has been pointed out in the literature that “whole map statistics are seldom useful”, that “most relationships in spatial data sets are geographically regional, rather than global”, and that “there is no average place on the Earth’s surface” [Goodchild03, Openshaw99].  Therefore, it is not surprising that domain experts are mostly interested in discovering hidden patterns at a regional or local scale rather than a global scale.

Data Mining & Machine Learning Group ACM-GIS08 Related Work  Shekhar et al. discuss several interesting approaches to mine spatial co-location patterns of categorical features.  Huang et al. address the problem of mining co- location patterns with rare features.  Srikant and Agrawal use discretization of continuous variables to form categorical variables on which classical association rule mining is applied.  Calder et al. introduce an approach to use rank correlation to mine quantitative association rules.  Achtert and others give a method to derive quantitative, non-spatial models to describe correlation clusters.

Data Mining & Machine Learning Group ACM-GIS08 2. Clustering with Plug-in Fitness Functions Motivation:  Finding subgroups in geo-referenced datasets has many applications.  However, in many applications the subgroups to be searched for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation.  Domain knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup.  Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for.  Only very few clustering algorithms published in the literature provide plug-in fitness functions; consequently existing clustering paradigms have to be modified and extended by our research to provide such capabilities.

Data Mining & Machine Learning Group ACM-GIS08 Current Suite of Spatial Clustering Algorithms  Representative-based: SCEC, SRIDHCR, SPAM, CLEVER  Grid-based: SCMRG  Agglomerative: MOSAIC  Density-based: SCDE, DCONTOUR (not really plug-in but some fitness functions can be s imulated) Clustering Algorithms Density-based Agglomerative-basedRepresentative-based Grid-based Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function. ACM-GIS08

Data Mining & Machine Learning Group ACM-GIS08 Spatial Clustering Alg. Cont.  Datasets are assumed to have the following structure: ( ; ) e.g. (longitude, latitude; + )  Clusters are found in the subspace of the spatial attributes, called regions in the following.  The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself.  Clustering algorithms are assumed to maximize reward-based fitness functions that have the following structure: where  is a parameter that determines the premium put on cluster size (larger values  fewer, larger clusters)

Data Mining & Machine Learning Group ACM-GIS08 3. An Interestingness Measure for Co-location Mining Involving Continuous Variables  Goal is to discover interesting regions with interesting co-location patterns.  Clustering algorithms that maximize fitness functions of the form already exist:  To use those algorithms for this task, an interestingness measure has to be designed.

Data Mining & Machine Learning Group ACM-GIS08 Co-location Measure for Continuous Variables  Products of z-scores of continuous variables are used to measure the interestingness of co-location patterns.  Pattern A  - Attribute A has high values  Pattern A  - Attribute A has low values

Data Mining & Machine Learning Group ACM-GIS08 Interestingness of a Pattern  Interestingness of a pattern B (e.g. B= {C , D , E  }) for an object o,  Interestingness of a pattern B for a region c, Remark: Purity (i(B,o)>0) measures the percentage of objects that exhibit pattern B in region c.

Data Mining & Machine Learning Group ID C z-score D z-score C↑D↓i(B, o) Interestingness Computations for Pattern {C ,D  } for a Region c Purity=2/4=0.5 Assuming  =1 we obtain: φ({C ,D  },c)=(( )/4)*0.5 1 =0.06 Justification for the chosen approach: Domain experts are interested in identifying regions in which a few objects exhibit very high products, even if purity is low (choose  =0). Domain experts are interested in finding regions with highly regular patterns with respect to observed products as well (choose high value for  ) …

Data Mining & Machine Learning Group ACM-GIS08 Region Interestingness  Region interestingness is assessed by computing the most prevalent pattern:  Region interestingness solely depends on the most interesting co-location set for the region.

Data Mining & Machine Learning Group ACM-GIS08 Example of a Result Exp. No. Top 5 Regions Region SizeRegion Reward Maximum Valued Pattern in theRegion Purity Average Product for maximum valued pattern Exp As  Mo  V  F -  As  Mo  V  As  Mo  V  SO 4 2-  As  B  Cl -  TDS  As  Mo  Cl -  TDS  All experiments: P(B) = (As  B or As  B) and |B|<5. Experiment 1  = 1.3, θ=1.0

Data Mining & Machine Learning Group ACM-GIS08 Summary  Pattern Interestingness in a region is evaluated using products of (cut-off) z-scores. In general, products of z-scores measure correlation.  Additionally, purity is considered that is controlled by a parameter  :  Finally, the parameter  determines how much premium is put on the size of a region when computing region rewards.

Data Mining & Machine Learning Group ACM-GIS08 4. Case Study

Data Mining & Machine Learning Group ACM-GIS08 Arsenic Water Pollution Problem  Arsenic pollution is a serious problem in the Texas water supply.  Hard to explain what causes arsenic pollution to occur.  Several Datasets were created using the Ground Water Database (GWDB) by Texas Water Development Board (TWDB) that tests water wells regularly, one of which was used in the experimental evaluation in the paper:  All the wells have a non-null samples for arsenic  Multiple sample values are aggregated using avg/max functions  Other chemicals may have null values  Format: (Longitude, Latitude, )

Data Mining & Machine Learning Group ACM-GIS08 Interesting Observations  High arsenic is a well-known problem in Southern Ogallala aquifer in the Texas Panhandle and in the Southern Gulf Cost aquifer. The co- location mining framework was able to identify regions in this areas, as for example for  =1.3,  =1.0 Rank 1, 2 and 3 regions are in Ogallala aquifer. Rank 4 region is in Gulf cost aquifer. The approach not only identified that high arsenic is associated with high vanadium and molybdenum but was also able to discriminate against companion elements like sulfate and fluoride.

Data Mining & Machine Learning Group ACM-GIS08 Interesting Observations cont.  For  =1.5, the extent of arsenic contamination in Texas: Ogallala Aquifer, Southern Gulf Coast, and West Texas basins, could be recognized.  For  =2.0, loosening of cluster definition results in a display of the known, often described as sharp, boundaries between high and low arsenic areas in the Ogallala (Ranks 2 and 4) and the Gulf Coast (Ranks 1 and 3) aquifers.  In general, for  =1.3 and  =1.5 the discovered regions tend to lie inside Texas aquifers, which is expected, because wells inside the same aquifer are connected by water flow.  The algorithm also finds some inconsistent co-location sets. As for example, for  =1.5, rank 3 region in west Texas has high arsenic co- located with high chloride, while rank 4 region in south Texas has low arsenic with high chloride which can be attributed to geographical differences in regions.  When  is increased to 5, not surprisingly all top regions have purities of 90% or above.

Table 5. Top 5 regions ranked by reward (as per formula 8). Exp. No. Top 5 Regi- ons Region SizeRegion Reward Maximum Valued Pattern in theRegion Purity Average Product for maximum valued pattern Exp As  Mo  V  F -  As  B  Cl -  TDS  As  TDS  As  Cl -  SO 4 2-  TDS  As  F -  Exp As  B  Cl -  TDS  As  V  F -  As  V  SO 4 2-  TDS  As  Mo  V  B  As  TDS  Example: Differences in Results Medium/High Rewards for Purity All: (As  B or As  B) and |B|<5 Experiment 2  = 1.5, θ=1.0 Experiment 4  = 1.5, θ=1.0

Data Mining & Machine Learning Group ACM-GIS08 High Reward Regions  =1 and  =5  =1  =5

Data Mining & Machine Learning Group ACM-GIS08 Challenges  Kind of “seeking a needle in a haystack” problem, because we search for both interesting places and interesting patterns.  The Interestingness measure is not anti-monotone: a superset of a co-location set might be more interesting.  Only considering the maximum valued pattern when evaluating regions is somewhat crude (employed solution: used seeded pattern and run algorithm multiple times)  Observation: different fitness function parameter settings lead to quite different results, many of which are valuable to domain experts.  New challenge: results of many runs have to be analyzed which is a lot of manual labor  need a tool for that.

Data Mining & Machine Learning Group Representative-based Clustering Attribute2 Attribute Objective: Find a set of objects O R such that the clustering X obtained by using the objects in O R as representatives minimizes q(X). Properties: Cluster shapes are convex polygons Popular Algorithms: K-means. K-medoids

Data Mining & Machine Learning Group ACM-GIS08 5. CLEVER (ClustEring using representatiVEs and Randomized hill climbing)  Is a representative-based, sometimes called prototype- based clustering algorithm  Uses variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity.  Searches for optimal number of clusters

Data Mining & Machine Learning Group ACM-GIS08 6. Summary  A novel framework for mining co-location patterns involving multiple continuous variables with values from the wings of their statistical distribution is proposed.  Regional co-location mining is approached as a clustering problem in which a reward-based fitness function has to be maximized.  The approach was successfully applied in a real world case study involving arsenic contamination. The case study revealed known areas of arsenic contamination and also some unknown areas with interesting features. Different parameters lead to characterization of arsenic patterns at different scales.  In general, the regional co-location mining framework has been valuable to domain experts in that it provided a data-driven approach that suggests promising hypotheses for future research.  A novel prototype-based clustering named CLEVER was also introduced.

Data Mining & Machine Learning Group ACM-GIS08 References  S. Shekhar and Y. Huang, “Discovering spatial co-location patterns: A summary of results,” Lecture Notes in Computer Science, vol. 2121, pp. 236+,  Y. Huang, J. Pei, and H. Xiong, “Mining co-location patterns with rare events from spatial data sets,” Geoinformatica, vol. 10, no. 3, pp. 239–260,  R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 1996, pp. 1–12.  T. Calders, B. Goethals, and S. Jaroszewicz, “Mining rank-correlated sets of numerical attributes,” in KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2006, pp. 96–105.  E. Achtert, C. B¨ohm, H.-P. Kriegel, P. Kr¨oger, and A. Zimek, “Deriving quantitative models for correlation clusters,” in KDD ’06: Proceedings of the 12 th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2006, pp. 4–13.  C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discovery of interesting regions in spatial datasets using supervised clustering,” in Proceedings of the 10 th European conference on Principles of Data Mining and Knowledge discovery, 2006.

Data Mining & Machine Learning Group ACM-GIS08 Region Discovery Framework Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Treats region discovery as a clustering problem.

Data Mining & Machine Learning Group ACM-GIS08 Region Discovery Framework Continued The clustering algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c 1,…,c k } as follows: q(X)=  c  X reward(c)=  c  X interestingness(c)  size(c)  with  >1 Objective: Find c 1,…,c k  O such that: 1.c i  c j =  if i  j 2.X={c 1,…,c k } maximizes q(X) 3.All cluster c i  X are contiguous in the spatial subspace 4.c 1 ,…,  c k  O 5.c 1,…,c k are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported