Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Categorical Data The Case of Quran Verses
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
2013 Teaching of Clustering
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Examples of.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Topic9: Density-based Clustering
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Machine Learning Queens College Lecture 7: Clustering.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.
Other Clustering Techniques
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
IOI/ACM ICPC Training 4 June 2005.
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
GC211Data Structure Lecture2 Sara Alhajjam.
Data Mining K-means Algorithm
ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees
Hierarchical clustering approaches for high-throughput data
Enumerating Distances Using Spanners of Bounded Degree
CSE572, CBS598: Data Mining by H. Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
Gaussian Mixture Models And their training with the EM algorithm
CSE572, CBS572: Data Mining by H. Liu
Discovery of Interesting Spatial Regions
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Presentation transcript:

Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate it on the datasets used in Part2 and one other datasets. Planned as a group project 5 algorithms to choose from; each group implements one algorithm –SCMRG (grid-based) –SCAH (agglomerative) –RG (sampling, radius-growing) –PICPF-DBSCAN (density-based) –SRIDHCR (representative-based) Today you have to tell us, what your top three algorithm choices are; groups are created based on those preferences on Thursday

Ch. Eick: Region Discovery Project Part3 Region Discovery Part3: Clustering Algorithms The objective of Part3 is to design and implement a clustering/region discovery algorithm that returns a set of regions that maximize a given fitness function q for a given spatial dataset. Inputs of the designed algorithm include: –Clustering algorithm specific parameters (e.g. grid-cell size, number of clusters c) –Parameter  of q(X) –Measure of Interestingness i(r) used including measure specific parameters (e.g. shape parameter  in some fitness functions) The region discovery algorithm to be designed returns the set of clusters (regions) and their associated interestingness and cluster reward; each cluster is described by triples (,, ).

Ch. Eick: Region Discovery Project Part3 Region Discovery Part3: Preview Representative-based Algorithms Using PAM with fitness function q for a fixed numbers of k regions. Functions when implementing this algorithm include: Implementation of an initialization function that selects k-representatives at random. Creating clusters for a given set of representatives Creating new sets of representatives by replacing a representative by a single non- representative SRIDHCR (see next transparencies) is a representative-based clustering that, in contrast to PAM, removes representatives and adds new representatives to the current set of representatives (see next set of transparencies)

Ch. Eick: Region Discovery Project Part3 Version of the PAM Algorithm for Region Discovery 1.Randomly create an initial set of k representatives curr 2.WHILE NOT DONE DO 1.Create new solutions S by replacing a single representative in curr by a single non-representative. 2.Determine the element s in S for which q(s) is maximal (if there is more than one minimal element, randomly pick one). 3.IF q(s)>q(curr) THEN curr:=s ELSE terminate, returning curr as the solution for this run. curr: current set of cluster representatives Not an algorithm to choose from in the course project!

Ch. Eick: Region Discovery Project Part3 Algorithm SRIDHCR REPEAT r TIMES curr := a randomly created set of representatives (with size between k’ and 2*k’) WHILE NOT DONE DO 1.Create new solutions S by adding a single non- representative to curr and by removing a single representative from curr. 2.Determine the element s in S for which q(s) is the largest (if there is more than one maximal element, randomly pick one). 3.IF q(s)>q(curr) THEN curr:=s ELSE IF q(s)=q(curr) AND |s|<|curr| THEN curr:=s ELSE terminate and return curr as the solution for this run. Report the best out of the r solutions found. Remark: c, and r, and k’ are input parameters.

Ch. Eick: Region Discovery Project Part3 Set of Medoids after adding one non-medoidq(X)Set of Medoids after removing a medoidq(X) (Initial solution) ……....…… …………… Trials in first part (add a non-medoid)Trials in second part (drop a medoid) RunSet of Medoids producing lowest q(X) in the runq(X)Purity (Init. Solution) Example SRIDHCR. In this example, we assume q(X) has to be minimized

Ch. Eick: Region Discovery Project Part3 SCAH (Agglomerative Hierarchical) Inputs: A dataset O={o 1,...,o n } A distance Matrix D = {d(o i,o j ) | o i,o j  O }, Output: Clustering X={c 1,…,c k } Algorithm: 1) Initialize: Create single object clusters: c i = {o i }, 1≤ i ≤ n; Compute merge candidates based on “nearest clusters” MERGE-CANDIDATE(c1,c2)= if c1 is closest to c2 or c2 is closest to c1 2) DO FOREVER a) Find the pair (c i, c j ) of merge candidates that improves q(X) the most b) If no such pair exist terminate, returning X={c 1,…,c k } c) Delete the two clusters c i and c j from X and add the cluster c i  c j to X d) Update inter-cluster distances incrementally e) Update merge candidates based on inter-cluster distances Recommendation: Use min-dist/single link to compute inter-cluster distances

Ch. Eick: Region Discovery Project Part3 Ideas SCMRG (Divisive, Multi-Resolution Grids) Cell Processing Strategy 1. If a cell receives a reward that is larger than the sum of its rewards its ancestors: return that cell. 2. If a cell and its ancestor do not receive any reward: prune 3. Otherwise, process the children of the cell (drill down)

Ch. Eick: Region Discovery Project Part3 ‘SCMRG Simple’ Pseudo Code 1.Put initial cells with flag set to false on the queue 2.WHILE queue NOT EMPTY DO 1.c=pop(queue) 2.If a cell c receives a reward that is larger than the sum of its rewards its ancestors: add c to the results reported 3.If a cell c has stop=false and its ancestors do not receive any reward: put its ancestors on the queue with stop=true 4.If a cell c has stop=true and its ancestors do not receive any reward: prune that cell. 5.Otherwise, process the children q of the cell (drill down) by putting (false,q) on the queue Remark: cells have a Boolean flag called stop for pruning; the queue contains (, ) Idea: Use queue of work still to be done as the main data structure.

Ch. Eick: Region Discovery Project Part3 Code SCMRG

Ch. Eick: Region Discovery Project Part3 PICPF-DBSCAN Input parameters: plug-in core-point function corep, radius r 1. For each point p in the dataset, compute the region r=  (p,r) and determine if it is a core-point by calling corep(p,r) 2. Create clusters as DBSCAN does Examples of Plug-in Core-point Functions: 1. The region r contains 3 other points and its purity is above 80% 2. The regions r contains 5 other points and the standard deviation of the continuous variable is at least twice as much as the standard deviation for the whole dataset. 3. The region r contains 4 other points—simulates DBSCAN Minpts=4 Remarks: It is okay to modify an existing implementation of DBSCAN if you find one… Does not fit 100% into the region discovery framework; therefore, experiments have to be slightly modified.

Ch. Eick: Region Discovery Project Part3 Input parameters: r (size of radius), y (how many points will be selected to draw radii around) 1. Create a result data-structure Top10 that contains the top ten regions found so far sorted by their q(X) value. 2. DO y TIMES 1.Randomly select a point p=(, ) (does not need to be a point in the dataset) 2.Draw radiuses of size r, 1.1*r, 1.3*r, 1.7*r, 2.2*r, 2.8*r, 3,5*r, 4.3*r, 5,2*r, 6.3*r around p “in general, follow some schedule to increase r” 3.Add the region, computed in step 2, with the higher q(X) value to TOP10 3. Return the top ten regions and the sum of their rewards Remarks: Returns overlapping regions Only returns the top 10 regions Similar to the popular SATSCAN hotspot discovery algorithm Can be generalized by making k (10 in the above) to be an input parameter Region Growing Algorithm (RG) Algorithm Sketch X

Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Visualization Issues 1. Data sets (without regions, prior to region discovery) –Visualize spatial objects in the dataset –Visualize class labels for supervised data sets in different colors –If datasets have a continuous variables, discretize them and display them like supervised datasets using an ordinal color coding(e.g. blue  yellow) 2. Data sets with regions (final or intermediate result of a region discovery alg.) –Region boundaries (draw a border around a region) –If a representative-based clustering algorithm was used, display the region representative for each region –Objects that belong to a region –Interestingness and reward of a region –Other region characteristics (vary for different measures of interestingness and for different region discovery tasks) 3. Display an individual region (e.g. the one that received the highest reward) –Use similar techniques as in Ideally, maps should be used as the background of displays to provide reference information and to make the display look nicer. Not that important this year!!!

Ch. Eick: Region Discovery Project Part3 Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets Ch. Eick

Ch. Eick: Region Discovery Project Part3 Problems with SCAH No look ahead: Non-contiguous clusters: XXX OOO OOO XXX Too restrictive definition of merge candidates:

Ch. Eick: Region Discovery Project Part3 More on Grid Structures Grid-cells are pairs of integers (i,,j) with i and j being numbers between 0 and g-1 Let v be a value of the attribute att, then the number of v’s grid-cell is computed as follows: g’= floor ((v  att_min)*g)/(att_max  att_min)) Example: Let attribute att1 range between -50 and +50 and att2 range between 0 and 20 and g is 10, and an example e=(att1=-5,att2=17) is given. Example e is assigned to the grid-cell (4,8), because floor=(-5 – (-50))x10)/100)= floor(450/100)=4 and floor(((17-0)x10)/20)=floor(8.5)=8 For a 2D grid-structure the following holds: –two different cells (i1,j1) and (i2,j2) are merge-candidates  i1=i2 or j1=j2