Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Statistical Techniques I
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
PARTITIONAL CLUSTERING
Indexing and Range Queries in Spatio-Temporal Databases
Danzhou Liu Ee-Peng Lim Wee-Keong Ng
On Map-Matching Vehicle Tracking Data
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
Spatial Mining.
Indexing Network Voronoi Diagrams*
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Accessing Spatial Data
Spatio-Temporal Databases
Motion Analysis (contd.) Slides are from RPI Registration Class.
ISEE: Efficient k-Nearest-Neighbor Monitoring over Moving Obejcts [SSDBM 2007] Wei Wu, Kian-Lee Tan National University of Singapore.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
WiOpt’04: Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks March 24-26, 2004, University of Cambridge, UK Session 2 : Energy Management.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
Cluster Analysis (1).
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
What is Cluster Analysis?
What is Cluster Analysis?
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Radial Basis Function Networks
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
AAU A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing Presented by YuQing Zhang  Slobodan Rasetic Jorg Sander James Elding Mario A.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
Nearest Neighbor Searching Under Uncertainty
Clustering Uncertain Data Speaker: Ngai Wang Kay.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
Exact indexing of Dynamic Time Warping
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Point Pattern Analysis Point Patterns fall between the two extremes, highly clustered and highly dispersed. Most tests of point patterns compare the observed.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
Bin Yao, Feifei Li, Piyush Kumar Presenter: Lian Liu.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.
On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU.
R-T REES Accessing Spatial Data. I N THE BEGINNING … The B-Tree provided a foundation for R-Trees. But what’s a B-Tree? A data structure for storing sorted.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Clustering Data Streams A presentation by George Toderici.
Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Haim Kaplan and Uri Zwick
Clustering Uncertain Taxi data
Spatio-temporal Pattern Queries
Solution of Equations by Iteration
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Mathematical Foundations of BME
Continuous Density Queries for Moving Objects
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU

Presentation Outline Introduction  concept of clustering, clustering of uncertain objects  Example: Application of clustering on uncertain data  UK-means algorithm Motivation  Voronoi-diagram-based (VD) clustering  MinMax-based (MM) clustering  VD is strictly better than MinMax Clustering algorithms  VDBi, VDBiP, VD based methods with Cluster Shift  When VD based methods are better than MM based methods? Experiments Conclusion

Introduction

Clustering  Group similar data objects together to form clusters Partition-based clustering Input: # of clusters (k), # of objects (n)  Iterative method  In each iteration, divide n data objects into k groups to minimize an objective function  e.g., minimize the sum of squares of distances  Stop when the results are converged

Introduction To cluster the data points in 2D space  Data objects: n data points  Apply any partition-based clustering algorithms (K-means)  Distance measure: Euclidean distance, Manhattan distance, etc.

Introduction To cluster the uncertain objects in 2D space  Uncertain objects: objects with uncertainty (e.g. location uncertainty) No fixed coordinates in 2D space Object’s location is estimated by using a probability density function (pdf) over an uncertainty region Assume the pdf for each object can be obtained Uncertainty region (ur): a region that the object may appear, with a certain probability distribution; and the probability of the objects appear outside the uncertainty region is zero Each object may have an irregular uncertainty region, also the pdf could be arbitrary o1o1 o 1.ur MBR of o 1.ur

The expected distance (ED) is used to measure the distance between uncertain object and cluster representative. ED is the expected distance function, d is Euclidean distance function, x is any point inside o i ’s uncertainty region, f is the pdf of uncertain objects o i, and p j is any cluster representatives. ED computations are very expensive, in each iteration of K-means, nk ED computations are required. Expected distance computation Cluster p j oioi ED(o i, p j )

Application: Clustering the vehicles Objective: get traffic patterns by clustering vehicles in a city Data objects: vehicles on a 2D map Uncertainty: location uncertainty of the vehicles, each pdf defined over object’s uncertainty region represent the probability distribution of possible location of a vehicle in a certain period of time

oioi Degree of uncertainty is affected by the following factors, 1.Time 2.Traffic of the roads 3.Shape of the roads 4.Speed of the vehicles

oioi Results

UK-means UK-means: first extension of K-means algorithm to handle uncertain objects Distance measure: Expected distance (ED) Disadvantage: Slow and inefficient Show the possibility of using K-means to handle the clustering of uncertain objects

Two Approaches to solve clustering problem by UK-means 1. MinMax-based approach (Jacky) 2. Voronoi-Diagram-based approach (Paul)

Motivation

Two Approaches to solve clustering problem by UK-means 1. MinMax-based approach (Jacky) Basic MinMax distance pruning (MinMax) MinMax with pre-computation of ED MinMax with Cluster Shift (MinMax-Shift) 2. Voronoi-Diagram-based approach (Paul) Voronoi diagram with Bisector Pruning (VDBi) Voronoi diagram with Bisector Pruning and Partial expected distance computations (VDBiP) Voronoi diagram with Bisector Pruning and Cluster Shift (VDBi-Shift) Voronoi diagram with Bisector Pruning and Partial expected distance computations and Cluster Shift (VDBiP-Shift)

MinMax-based Approach UK-means with MinMax distance pruning  Objective: avoid expected distance computation  using mindist and maxdist between object’s MBR and cluster representatives to represent the distance bounds of ED(c j, o i ) & ED(c m, o i )  E.g., given an object o i, cluster rep c j and c m,  if mindist(c j, o i ) > maxdist (c m, o i ) then c j can be pruned oioi cjcj cmcm maxdist (c m, o i ) mindist(c j, o i ) ED(c j,o i ) need not be calculated. (pruned) ED(c j,o i ) > ED(c m,o i )  prune c j

MinMax-based Approach Upper and lower bounds can become tighter by using Cluster Shift (CS) and ED Pre-computation (PC) methods  Replace mindist and maxdist loose estimation by tighter estimations on distance bounds  Details refer to Jacky’s works

Voronoi-diagram-based approach Each object’s uncertainty region is bounded by its minimum bounding rectangle (MBR) The objects’ MBRs are indexed by R-tree Voronoi diagram is constructed for the cluster representatives in each iteration o1o1 Voronoi diagram for 5 cluster representatives Uncertain object o 1 indexed by R-tree

o1o1 p1p1 p2 p2 Bisector of p 1 and p 2 Voronoi-diagram-based approach If the bisector of two cluster representatives do not cut an object’s MBR, and fall in p 2 side of the bisector, then  ED(p 1,o 1 ) > ED(p 2, o 1 )

p1p1 o1o1 p2 p2 p 3 ED(o 1, p 2 ) < ED(o 1, p 1 ) and ED(o 1,p 2 ) < ED(o 1, p 3 ) o 1 is assigned to cluster p 2. Voronoi-diagram-based approach (Cluster Assignment)

Voronoi-diagram-based approach In each iteration,  For each Voronoi cell, (approximated by a MBR) issue a range queries to object’s R-tree retrieve the candidates objects for the cluster  If the candidate’s MBR is completely enclosed in the Voronoi cell, assign the object to the cluster  If the candidate’s MBR intersect with more than one Voronoi cells, special handling methods required for the objects to prune away the unqualified clusters get candidate objects for the cluster object enclosed entirely in Voronoi cell object that intersect with more than one Voronoi cell

Avoid expected distance computation 1. If the object is completely enclosed in a Voronoi cell, then the object must belong to this cluster 2. For the best case, we do not need any expensive expected distance calculations, and we do not need to retrieve the object’s pdf during the clustering Advantages of using Voronoi- diagram-based clustering

Voronoi diagram construction cost is independent of number of objects  We only need O(k log k) time to compute the 2D Voronoi diagram in each iteration, where k is the number of clusters, and k is not depend on number of objects n is much larger than k

1. Handling of uncertain objects that intersect with more than one Voronoi cells We cannot determine the nearest clusters by just looking at the Voronoi diagram Difficulties of Voronoi based clustering c1c1 o1o1 c2 c2 c 3

Is VD better than basic MinMax? Theorem:  VD is strictly better than basic MinMax  Given an object o i that is assigned to cluster c 1, for any iteration in UK-means, if VD calculates ED(o i, c p ) for some c p, then MM must calculate ED(o i, c p ) as well.  If VD does not calculate ED(o i, c p ), sometimes MM must calculate ED(o i,c p ).

In some situations, VD based is better VD based methods is always better than basic MinMax, but VD based methods may not beat MinMax-Shift In some situations, VD based methods outperform all MM based methods  when the object uncertainty are very small, then VD based methods are preferred

Clustering algorithms

Clustering Methods Voronoi-diagram-based approach 1. Voronoi diagram with bisector pruning (VDBi) 2. Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP)

MinMax-based Methods For each object,  Find out the upper and lower bounds of ED values if Cluster-Shift (CS) method is not enabled, upper and lower bounds is estimated by “maxdist” and “mindist” respectively (MinMax) if CS method is enabled, then upper and lower bounds become tighter (MinMax-Shift)  Prune unwanted clusters by upper and lower bounds  For all un-pruned cluster compute the ED values to determine the cluster assignment of the object

Voronoi-diagram-based Methods Before each iteration, Voronoi diagram is constructed for all cluster representatives For each cluster representative,  Find out the objects which completely enclosed in the cluster’s Voronoi cell  Apply bisector pruning to prune unrelated clusters

Voronoi diagram with Bisector Pruning (VDBi) c1c1 o1o1 c 2 c 3 Comparing c 1 and c 3, o 1 fall into c 1 side of the bisector(c 1,c 3 ), then c 3 can be pruned. Since bisector of c1 and c2 cut o1’s MBR, o1 may assigned to either c1 or c2.

Voronoi diagram with bisector pruning and partial expected distance computation (VDBiP) Cut the object ’ s MBR input two equal halves (a) and (b) o1o1 (a)(b)

VDBiP If o 1(b) ’ s MBR is completely enclosed in Voronoi cell of c 2 Compute ED(o 1(a), c 1 ) & ED(o 1(a), c 2 ) Since ED(o 1(b), c 2 ) < ED(o 1(b), c 1 ) If ED(o 1(a), c 2 ) < ED(o 1(a), c 1 ) then ED(o 1(a), c 2 ) + ED(o 1(b), c 2 ) < ED(o 1(a), c 1 ) + ED(o 1(b), c 1 ) => prune c 1 c1c1 o1o1 c 2 (a)(b) ED(o 1(a), c 1 ) ED(o 1(a), c 2 )

Experiments

Measures  Efficiency (Expected distance computation required) Comparison with  Basic Min-max distance pruning (MinMax)  Voronoi diagram with Bisector Pruning (VDBi)  Voronoi diagram with Bisector Pruning and Partial expected distance computation (VDBiP)  MM-based with Cluster Shift (MinMax-Shift)  VD-based with Cluster Shift (VDBi-Shift,VDBiP-Shift)

Experimental Settings Data setrandomly generated synthetic data set Probability density function random Domain 100 x 100 2D space Number of objects Number of clusters vary Maximum length of an MBR’s side 10%, 1%, 0.1% Number of sample points 20 * 20

Degree of uncertainty is large (MBR width = 10%) 1.VDBi perform slight better than basic MinMax only 2.Cluster shift method greatly improve basic MinMax and VDBi performance

Degree of uncertainty is small (MBR width = 1%) 1.Cluster shift method cannot greatly improve the performance of MinMax 2.VD-based approach outperform MM-based approach 1.VD-based approach still better than MM-based approach, but VD perform slightly better if there are less clusters

Degree of uncertainty is very small (MBR width = 0.1%)

Performance analysis AlgorithmsDescription MinMaxthe worst one MinMax-ShiftGood when object is large VDBiGood when object is small VDBi-ShiftGood at all cases, outperform MinMax-based method VDBiPbetter than VDBi, perform well when MBR width is small VDBiP-ShiftFurther improvement to VDBiP

Performance Analysis Basic MinMax performance is bad, because of the loose upper and lower bound estimation by maxdist and mindist.  When degree of uncertainty of an object are small, MinMax with cluster shift (improved distance bounds) method cannot greatly improve the tightness of distance bounds, since mindist and maxdist is accurate enough  MinMax-Shift’s performance is similar to that of basic MinMax Because of the smaller object’s size, lesser objects may intersect with multiple Voronoi cells, also we proved that VD is better than basic MinMax  VD is good for small objects, and a hybrid of cluster shift (PC) and VD perform well in all cases Maxdist(o 1,c j ) is a very loose upper bounds, Cluster shift method can improve a lot cjcj o1o1 cjcj o2o2 Maxdist(o 2,c j ) is not a loose upper bounds, Cluster shift method cannot improve a lot

Conclusion Uncertain clustering  Voronoi-diagram-based approach and MinMax-based approach  VDBi is strictly better than basic MinMax  Voronoi-diagram-based approach beat MinMax-based approach when object’s uncertainty are small  Hybrid approach is good in all cases

Thank you Questions?