University of Houston, USA “Serial” versus “Parallel”: a Comparison of Spatio-temporal Clustering Approaches Yongli Zhang, Sujing Wang, Amar Mani Aryal, and Christoph F. Eick University of Houston, USA 23rd International Symposium on Methodologies for Intelligent Systems (ISMIS 2017) June 26, 2017, Warsaw, Poland Data Analysis and Intelligent Systems Lab
Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab
Data Analysis and Intelligent Systems Lab 1. Introduction With the development of remote sensors and sensor networks, different types of spatio-temporal datasets become increasingly available these days. Our works centers on developing spatio-temporal clustering and hotspot discovery algorithms that are capable of identifying dense regions in the location/time space. r Data Analysis and Intelligent Systems Lab
Data Analysis and Intelligent Systems Lab Past Work Kulldorff et al. [1] introduce a spatial scan statistic for the detection of spatio-temporal cylinders where the point objects occur consistently for a significant period of time. Iyengar et al. [2] extend the basic scan statistics using the flexible square pyramid shape to detect clusters with restrictive shapes, and the proposed framework can model growth and shifts in location over time. Wang et al. [3] propose a spatiotemporal clustering algorithm ST-GRID which maps the spatial and temporal dimensions into multidimensional cells and then extract and merge spatio-temporal dense regions to obtain a final cluster. Birant et al. [4] propose ST-DBSCAN as an extension of DBSCAN for spatio-temporal clustering by introducing a second parameter of temporal neighborhood radius in addition to the spatial neighborhood radius. Data Analysis and Intelligent Systems Lab
Data Analysis and Intelligent Systems Lab Motivation However, most spatio-temporal clustering algorithms, are not suitable to deal with large data streams, as they: pass over the data several times cannot deal with very large datasets use time and location in a parallel fashion For example, ST-DBSCAN [4] treats time and location in parallel, assume that the dataset fits into the main memory instead of coming in batches as a stream, and also they scan through dataset multiple times during clustering process. Moreover, comparison between serial approach and parallel approach is rarely investigated in the literature. Data Analysis and Intelligent Systems Lab
Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab
2. ST-DPOLY ST-DPOLY—A serial approach: Subdivides the incoming data into batches Generates spatial clusters for each batch first Next, spatio-temporal clusters are formed by identifying continuing relationships between spatial clusters in consecutive batches. Data Analysis and Intelligent Systems Lab
Data Analysis and Intelligent Systems Lab ST-DPOLY cont. Inputs Point cloud stream; (e.g. taxi pickup location cloud streams that are described by the location using longitude and latitude, and the pickup time.) Data collection area; (e.g. New York metropolitan area.) Outputs: Spatio-temporal cluster which are graphs of related spatial clusters Data Analysis and Intelligent Systems Lab
The Three Phases of ST-DPOLY 1. Obtain spatial density function for spatial point cloud collected in each batch. 2. Identify spatial clusters for each batch as polygons that are created from density contour lines of the spatial density function. 3. Identify relationships between spatial clusters in consecutive batches, and construct spatio-temporal clusters as continuing spatial clusters in consecutive batches. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
ST-DPOLY Phase1 In our approach, we use non-parametric kernel density estimation (KDE) [8] to obtain a 2-dimensional spatial density function f. The kernel density estimator is defined as follows: Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
ST-DPOLY Phase2 1. Grid the data collection area. 2. Calculate a probability density for all grid intersection points using the spatial density function (given in Eq. 1), and obtain a density matrix. Create a table T to store locations of all grid intersection points and corresponding density matrix. 3. Pass T, along with a pair of density threshold to CONREC, which returns two sets of contour lines. 4. Close open contour lines. 5. Classify the obtained contour lines into holes and spatial clusters; 6. Construct spatial cluster polygon using the information obtained in in step 5. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Phase 3 ST-DPOLY We define an overlap matrix, for two cluster sets and obtained for every two consecutive batches i, i+1, with having a list of clusters (with cluster ), and X’ having a list of N clusters (with cluster ), establish a matrix , an entry of which is calculated as follows: If two spatial clusters at two consecutive batches have significant overlap, we conclude that the spatial cluster doesn’t change significantly over two consecutive batches, and create a 'continuing' relationship between the two clusters. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab
3. ST-SNN ST-SNN is a “parallel” appraoch, which is based on the well- established generic clustering algorithm-Shared Nearest Neighbor (SNN). It relies on a distance function that combines spatial and temporal distances. It processes all the input data together and uses distance function to compute shared neighbors for pairs of spatio- temporal objects. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab
4. Experimental Results Dataset: NYC taxi trip dataset [6] Contain data for over 1.1 billion taxi trips from January 2009 through June 2016. Each individual trip record contains precise location coordinates from where the trip started and ended, timestamps when the trip started and ended, trip distance and fares. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Taxi-Cab Results ST-DPOLY For the experiment, we use yellow taxi pick-up locations collected in 20 minutes interval as batches. We analyzed taxi pickups from 6 to 7 AM on January 8th (2014). For 6 to 6:20 batch, we obtained 3 clusters, for the 6:20-6:40 batch, we obtained 2 clusters and we obtained 4 clusters for the 6:40-7 batch. According to the result, as far as pick-up is concerned, east of Midtown of New York is a hotspot which is crowded with people looking for taxis early in the morning, as well as the region centered around the Grand Central Terminal. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Taxi-Cab Results ST-SNN We use Euclidean distance to compute the spatial distance. There are 16 clusters obtained. Figure above visualizes clusters 2, 13 and 14, and they are centered around several bus terminals and train stations, which shows a similar pattern of the clustering results generated by ST-DPOLY. Cluster 2 and 13 are similar in the spatial domain. However, the time slots corresponding to these two clusters are different. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab
5. Comparison 5.1 Time/Space complexity: Note: grid size we use is m*m, n is the total number of points, e is the average number of edges a spatial cluster has. Since ST-DPOLY is a serial approach, its overall time complexity is 𝑂( 𝑚 2 ×𝑛), in cases that the number of data points is much larger than the number of grid cells (n >> m2), ST-DPOLY's complexity becomes 𝑂 𝑛 . ST-DPOLY is superior to ST-SNN in terms of both time and space complexity. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
5. Comparison 5.2 Temporal Flexibility In terms of temporal flexibility, ST-SNN is more flexible as cluster have more variation with respect to temporal mean and standard deviation. Though time intervals can be selected based on application needs, it is fixed throughout the clustering process once selected. Therefore, ST-SNN has the potential to detect “more optimal” time intervals. However, the clustering result of ST-DPOLY is more straightforward, and in terms of clustering data streams, ST-DPOLY as a serial approach is more appropriate. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
5. Comparison 5.3 Quality of Clusters To compare the quality of clustering result, we measure the variation of clusters obtained: The clusters generated by ST-SNN have smaller values of standard deviation and range of time, longitude, and latitude than clusters identied by ST-DPLOY. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
6. Future Work Enable ST-DPLOY to support dynamic or adaptive batch sizes. Investigate semi-automatic and automatic parameter selection tools to facilitate the use of ST- DPLOY. Conduct a more thorough experimental evaluation. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Future Work cont. Extend our approach to support multiple density thresholds. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
References Martin Kulldor. Spatial scan statistics: models, calculations, and applications. In Scan statistics and applications, pages 303{322. Springer, 1999. Vijay S Iyengar. On detecting space-time clusters. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 587{592. ACM, 2004. Min Wang, Aiping Wang, and Anbo Li. Mining spatial-temporal clusters from geo-databases. In International Conference on Advanced Data Mining and Applications, pages 263{270. Springer, 2006. Derya Birant and Alp Kut. St-dbscan: An algorithm for clustering spatial{temporal data. Data & Knowledge Engineering, 60(1):208{221, 2007. Paul D Bourke. A contouring subroutine. Byte, 12(6):143{150, 1987. http://www.nyc.gov/html/tlc/html/about/trip record data.shtml, (accessed August 23, 2016). Wang, S., Cai, T., Eick, C.F.: New spatiotemporal clustering algorithms and their applications to ozone pollution. In: Data MiningWorkshops (ICDMW), 2013 IEEE 13th International Conference on, IEEE (2013) 1061-1068 Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab
Data Analysis and Intelligent Systems Lab Any Questions? Data Analysis and Intelligent Systems Lab