Yongli Zhang, Sujing Wang, Amar Mani Aryal, and Christoph F. Eick

Slides:



Advertisements
Similar presentations
Incremental Clustering for Trajectories
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Multi-Scale Analysis of Crime and Incident Patterns in Camden Dawn Williams Department of Civil, Environmental & Geomatic.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
Clustering Prof. Navneet Goyal BITS, Pilani
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
The Evolution of Spatial Outlier Detection Algorithms - An Analysis of Design CSci 8715 Spatial Databases Ryan Stello Kriti Mehra.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Techniques
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Spatial Statistics Applied to point data.
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Topic9: Density-based Clustering
H. Lexie Yang1, Dr. Melba M. Crawford2
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
A Generalized Architecture for Bookmark and Replay Techniques Thesis Proposal By Napassaporn Likhitsajjakul.
AegisDB: Integrated realtime geo-stream processing and monitoring system Chengyang Zhang Computer Science Department University of North Texas.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
What Else is Important in AI we Did not Cover?
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
Hierarchical Clustering: Time and Space requirements
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Database management system Data analytics system:
Fast nearest neighbor searches in high dimensions Sami Sieranoja
PCB 3043L - General Ecology Data Analysis.
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Parallel Density-based Hybrid Clustering
Mining Spatio-Temporal Reachable Regions over Massive Trajectory Data
Mean Shift Segmentation
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
University of Houston, USA
Spatio-temporal Pattern Queries
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
CSE 4705 Artificial Intelligence
Outlier Discovery/Anomaly Detection
6. Introduction to nonparametric clustering
On Spatial Joins in MapReduce
Yongli Zhang and Christoph F. Eick University of Houston, USA
Data Analysis and Intelligent Systems Lab
Section 4: see other Slide Show
Section 4: see other Slide Show
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Brainstorming How to Analyze the 3AuCountHand Datasets
GPX: Interactive Exploration of Time-series Microarray Data
CSE572, CBS572: Data Mining by H. Liu
Block Matching for Ontologies
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Donghui Zhang, Tian Xia Northeastern University
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Introduction to Artificial Intelligence Lecture 22: Computer Vision II
Presentation transcript:

“Serial” versus “Parallel”: a Comparison of Spatio-temporal Clustering Approaches Yongli Zhang, Sujing Wang, Amar Mani Aryal, and Christoph F. Eick University of Houston, USA Lamar University, Beaumont 23rd International Symposium on Methodologies for Intelligent Systems (ISMIS 2017) June 28, 2017, Warsaw, Poland Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab 1. Introduction With the development of remote sensors and sensor networks, different types of spatio-temporal datasets become increasingly available these days. Our works centers on developing spatio-temporal clustering and hotspot discovery algorithms that are capable of identifying dense regions in the location/time space. r Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Past Work Kulldorff et al. [1] introduce a spatial scan statistic for the detection of spatio-temporal cylinders where the point objects occur consistently for a significant period of time. Iyengar et al. [2] extend the basic scan statistics using the flexible square pyramid shape to detect clusters with restrictive shapes, and the proposed framework can model growth and shifts in location over time. Wang et al. [3] propose a spatiotemporal clustering algorithm ST-GRID which maps the spatial and temporal dimensions into multidimensional cells and then extract and merge spatio-temporal dense regions to obtain a final cluster. Birant et al. [4] propose ST-DBSCAN as an extension of DBSCAN for spatio-temporal clustering by introducing a second parameter of temporal neighborhood radius in addition to the spatial neighborhood radius. Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Motivation However, most spatio-temporal clustering algorithms, are not suitable to deal with large data streams, as they: pass over the data several times cannot deal with very large datasets use time and location in a parallel fashion For example, ST-DBSCAN [4] treats time and location in parallel, assume that the dataset fits into the main memory instead of coming in batches as a stream, and also they scan through dataset multiple times during clustering process. Moreover, comparison between serial approach and parallel approach is rarely investigated in the literature. Data Analysis and Intelligent Systems Lab

2. ST-DPOLY ST-DPOLY—A serial approach: Subdivides the incoming data into batches Generates spatial clusters for each batch first Next, spatio-temporal clusters are formed by identifying continuing relationships between spatial clusters in consecutive batches. Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab ST-DPOLY cont. Inputs Point cloud stream; (e.g. taxi pickup location cloud streams that are described by the location using longitude and latitude, and the pickup time.) Data collection area; (e.g. New York metropolitan area.) Outputs: Spatio-temporal cluster which are graphs of related spatial clusters Data Analysis and Intelligent Systems Lab

The Three Phases of ST-DPOLY 1. Obtain spatial density function for spatial point cloud collected in each batch. 2. Identify spatial clusters for each batch as polygons that are created from density contour lines of the spatial density function. 3. Identify relationships between spatial clusters in consecutive batches, and construct spatio-temporal clusters as continuing spatial clusters in consecutive batches. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

ST-DPOLY Phase1 In our approach, we use non-parametric kernel density estimation (KDE) [8] to obtain a 2-dimensional spatial density function f. The kernel density estimator is defined as follows: Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

ST-DPOLY Phase2 1. Grid the data collection area. 2. Calculate a probability density for all grid intersection points using the spatial density function (given in Eq. 1), and obtain a density matrix. Create a table T to store locations of all grid intersection points and corresponding density matrix. 3. Pass T, along with a pair of density threshold to CONREC, which returns two sets of contour lines. 4. Close open contour lines. 5. Classify the obtained contour lines into holes and spatial clusters; 6. Construct spatial cluster polygon using the information obtained in in step 5. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Phase 3 ST-DPOLY We define an overlap matrix, for two cluster sets and obtained for every two consecutive batches i, i+1, with having a list of clusters (with cluster ), and X’ having a list of N clusters (with cluster ), establish a matrix , an entry of which is calculated as follows: If two spatial clusters at two consecutive batches have significant overlap, we conclude that the spatial cluster doesn’t change significantly over two consecutive batches, and create a 'continuing' relationship between the two clusters. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

3. ST-SNN ST-SNN is a “parallel” appraoch, which is based on the well- established generic clustering algorithm-Shared Nearest Neighbor (SNN). It relies on a distance function that combines spatial and temporal distances. It processes all the input data together and uses distance function to compute shared neighbors for pairs of spatio- temporal objects. Using density estimation techniques based on shared nearest neighbors ST-SNN determines core objects, which are objects for which the sum of the similarities to its k-nearest neighbors are high Next, clusters are formed by computing reachable objects from core points based on the k-nearest neighbor graph. Data ,and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

ST--SNN continued similarity(p, q) : similarity between a pair of polygons p and q Density(p): the SNN density of polygon p Eps: the density threshold coreP(D): all points in the dataset D that have the SNN density of at least Min MinPs: the core polygon threshold

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

4. Experimental Results Dataset: NYC taxi trip dataset [6] Contain data for over 1.1 billion taxi trips from January 2009 through June 2016. Each individual trip record contains precise location coordinates from where the trip started and ended, timestamps when the trip started and ended, trip distance and fares. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Taxi-Cab Results ST-DPOLY For the experiment, we use yellow taxi pick-up locations collected in 20 minutes interval as batches. We analyzed taxi pickups from 6 to 7 AM on January 8th (2014). For 6 to 6:20 batch, we obtained 3 clusters, for the 6:20-6:40 batch, we obtained 2 clusters and we obtained 4 clusters for the 6:40-7 batch. According to the result, as far as pick-up is concerned, east of Midtown of New York is a hotspot which is crowded with people looking for taxis early in the morning, as well as the region centered around the Grand Central Terminal. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Taxi-Cab Results ST-SNN We use Euclidean distance to compute the spatial distance. There are 16 clusters obtained. Figure above visualizes clusters 2, 13 and 14, and they are centered around several bus terminals and train stations, which shows a similar pattern of the clustering results generated by ST-DPOLY. Cluster 2 and 13 are similar in the spatial domain. However, the time slots corresponding to these two clusters are different. In general, a lot of clusters overlap spatially, but their temporal scope is quite different, which suggest temporal gaps in the data. Data Analytics and Artificial Systems Lab Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

5. Comparison 5.1 Time/Space complexity: Note: grid size we use is m*m, n is the total number of points, e is the average number of edges a spatial cluster has. Since ST-DPOLY is a serial approach, its overall time complexity is 𝑂( 𝑚 2 ×𝑛), in cases that the number of data points is much larger than the number of grid cells (n >> m2), ST-DPOLY's complexity becomes 𝑂 𝑛 . ST-DPOLY is superior to ST-SNN in terms of both time and space complexity. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

5. Comparison 5.2 Temporal Flexibility Though time intervals can be selected based on application needs, it is fixed throughout the clustering process once selected for ST- DPOLY In terms of temporal flexibility, ST-SNN is more flexible as cluster have more variation with respect to temporal mean and standard deviation. Therefore, ST-SNN has the potential to detect “more optimal” time intervals. In general, there is both spatial and temporal overlap between clusters However, the clustering result of ST-DPOLY is more straightforward, and in terms of clustering data streams, ST-DPOLY as a serial approach is more appropriate. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

5. Comparison 5.3 Other Aspects To compare the quality of clustering result, we measure the variation of clusters obtained: The clusters generated by ST-SNN have smaller values of standard deviation and range of time, longitude, and latitude than clusters identified by ST-DPOLY, but there are more of them. As the distance functions of ST-SNN balances spatial and temporal variation, more temporal variation can occur in the spatial core area of the cluster and less in the outskirts of the cluster and more spatial variation can occur in the temporal core are of a cluster. ST-SSN actual forms clusters not based on a continuous density function but rather creates a graph from density values using thresholding and then uses “reachability” in this graph to form clusters, whereas ST- DPOLY directly operates on the density function to form clusters. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

6. Future Work Fixing cluster granularities to make the obtained results more comparable turned out to be a major challenge in this work; in particular, for ST-SSN this seems to be quite difficult due to discrete characteristics of the algorithm. Why?? Enable ST-DPLOY to support dynamic or adaptive batch sizes. Investigate semi-automatic and automatic parameter selection tools to facilitate the use of ST-DPOLY. Conduct a more thorough experimental evaluation. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Future Work cont. Extend our approach to support multiple density thresholds. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

References Martin Kulldorf. Spatial scan statistics: models, calculations, and applications. In Scan statistics and applications, pages 303{322. Springer, 1999. Vijay S Iyengar. On detecting space-time clusters. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 587{592. ACM, 2004. Min Wang, Aiping Wang, and Anbo Li. Mining spatial-temporal clusters from geo-databases. In International Conference on Advanced Data Mining and Applications, pages 263{270. Springer, 2006. Derya Birant and Alp Kut. St-dbscan: An algorithm for clustering spatial{temporal data. Data & Knowledge Engineering, 60(1):208{221, 2007. Paul D Bourke. A contouring subroutine. Byte, 12(6):143{150, 1987. http://www.nyc.gov/html/tlc/html/about/trip record data.shtml, (accessed August 23, 2016). Wang, S., Cai, T., Eick, C.F.: New spatiotemporal clustering algorithms and their applications to ozone pollution. In: Data MiningWorkshops (ICDMW), 2013 IEEE 13th International Conference on, IEEE (2013) 1061-1068 Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Data Analysis and Intelligent Systems Lab Any Questions? Data Analysis and Intelligent Systems Lab